--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/2020.markdown Mon May 08 13:46:28 2023 -0400
@@ -0,0 +1,1576 @@
+[TOC]
+
+# January 2020
+
+## 2020-01-03
+
+Converted `paste.stevelosh.com` to HTTPS. Forgot it in the previous conversion.
+
+## 2020-01-14
+
+Moved all my repositories from Source Hut to `hg.stevelosh.com`. Wasn't
+expecting to have to shave this yak so soon, but such is life. On the plus
+side: pushing is *so much faster* now. Rough timing for an empty push was ~5
+seconds on Source Hut, down to around a second now. No more sitting and
+staring at the command line waiting to get back to work.
+
+## 2020-01-15
+
+Finished (mostly) reskinning `hg.stevelosh.com`. Had to shave a couple of yaks
+in Mercurial itself (diff headers getting truncated) and the hgweb Markdown
+extension. Ended up just completely forking the extension and tearing out
+almost all of its guts. The result should (hopefully) be easier to maintain as
+Mercurial breaks internal shit over time.
+
+Still need to finish the CSS for the rendered Markdown, but that can wait for
+now.
+
+## 2020-01-17
+
+Got the RIT VPN set up on my Linux machine so I can connect for class. I was
+worried I'd have to install some horrible Cisco VPN client thing, but it turns
+out you can use the `openconnect` client to connect, so `sudo apt install
+openconnect` and then `sudo openconnect vpn.rit.edu` just magically works.
+Fantastic.
+
+Finished the `hg.stevelosh.com` reskin on my lunch break. Just had to get the
+`README`s looking decent.
+
+Finally figured out a strategy for syncing a small subset of my dotfiles onto
+remote servers easily, so I don't have to be completely miserable on them.
+Goals:
+
+* Needs to provide a reasonable subset of the stuff I use to get around.
+* Needs to handle servers with both bash and fish.
+* Needs to be easily syncable with a single command and no extra setup.
+* Needs to handle servers that don't have git or Mercurial.
+
+Ended up making a `remote/` folder in my dotfiles. Almost everything inside it
+is a symlink except for a `bootstrap.sh`. The folder gets `rsync --copy-links`ed
+over and then `bootstrap.sh` will symlink all the resulting files into their
+places. Hopefully this should be easier to maintain. Did a bunch of cleanup on
+my bash and fish setups while I was at it.
+
+Still need to figure out how to get a decent Vim setup remotely. Syncing all my
+plugins is probably not ideal, but maybe. Who knows.
+
+## 2020-01-18
+
+KA biology and coffee.
+
+Cellular respiration:
+
+* Starts with glucose (C₆H₁₂O₆).
+* The process is called glycolysis, and starts in the cytosol.
+* Glycolysis split the 6-carbon glucose into two 3-carbon molecules (pyruvate).
+* In the process we produce 2 (net) ATP's, and reduce 2 NAD⁺ molecules to 2 NADH
+ (note the move from a positive charge to negative, i.e. gaining electrons,
+ i.e. RIG (reduction is gaining)).
+* Some organisms stop here and use the pyruvates for fermentation.
+* Otherwise, the carboxyl group is stripped from each pyruvate and released as
+ CO₂.
+* The rest of the pyruvate (essentially an acetyl group) latches onto "coenzyme
+ A" (aka "CoA"), forming acetyl-CoA.
+* The enzyme transfers the acetyl group to an oxaloacetic acid, which forms
+ citric acid.
+* Next step is the "Krebs cycle"/"citric acid cycle", which is… complicated, and
+ happens in the matrix of the mitochondria.
+* As the citric acid cycle progresses we keep reducing to NADH.
+* The ATPs produced in cellular respiration can be used directly.
+* The reduced NADHs (and other products like QH₂) are used to create a proton
+ gradient in the mitochondria, which is then used to produce more ATP (each
+ NADH results in ~2-3 ATPs). This happens in the crista of the mitochondria.
+* Whole result is ~27-38 ATPs (typically 29-30 in most cells).
+* Review: the four steps of cellular respiration:
+ 1. Glycolysis: splits glucose into two pyruvates, produces ATP and reduces NAD⁺ to NADH. Happens in the cytosol.
+ 2. Pyruvate Oxidation: each pyruvate is stripped of CO₂, binds to CoA, and generates more NADH.
+ 3. Citric Acid/Krebs Cycle: each acetyl-CoA goes through a cycle and produces ATP, NADH, and FADH₂.
+ 4. Oxidative Phosphorylation: the NADH and FADH₂ are used to make a proton gradient, and that is used to make more ATP.
+
+Did the `TRIE` problem in Rosalind. Making an actual trie was easy, but getting
+an adjacency list for the flattened/copypasteable output and using `cl-dot` to
+graph it nicely was fiddly. Annoying bits about `cl-dot` to remember for next
+time:
+
+* There's no `:shape :square`, it's `:shape :box`.
+* `:fillcolor` does nothing without `:style :filled`.
+* The solution to "edge labels are too close to the edges" is apparently "add
+ spaces to your label string". Ugh.
+* To change the size of a node, use `:height` and `:width`, not `:size`. They
+ will be "helpfully" be ignored if you have a label that's too big to fit.
+* `cl-dot` provides implicit labels, so if you want no label you need `:label ""`.
+
+Cleanup up the styling of `hg.stevelosh.com` a bit more. This time it was the
+`<code>` tags inside `<li>`'s that weren't styled. I styled `<p><code>` before,
+because I wanted to *avoid* applying those styles to `<pre><code>` content
+(otherwise it'd have double borders all over the place). But that missed out on
+the `<code>` tags that weren't inside paragraphs. Ended up just styling *all*
+`<code>`s and then undoing the borders in `<pre><code>`. Sigh.
+
+Also realized that the repository descriptions served by hgweb can contain
+arbitrary HTML. Neat. Used that to clean a few more things up.
+
+Updated all the Bitbucket repository `README`s to point to `hg.stevelosh.com`.
+The Great Source Hut Yak Shaving of 2020 is finally complete, thank christ.
+Time to move on to doing actual work.
+
+## 2020-01-19
+
+Finally got around to setting up packages for all of my Rosalind problems. In
+all my "coding challenge" repos (Rosalind, Project Euler, Advent of Code, etc)
+I originally just shoved everything into one big package. This was easy to
+start with, but as I solved more and more problems I started to hit naming
+conflicts more and more often. Eventually I realized that for my own sanity
+I should really give each problem its own package, with all the utilities in
+a `utils` package. I did this for my Advent of Code repo a while back and while
+it was a pain in the ass to refactor everything, the result is much nicer.
+Today I refactored the Rosalind repo. The Project Euler repo is a work in
+progress, mostly because it's so much bigger than the other two.
+
+Did Rosalind problem `PDST`. Pretty easy. Most annoying part was the output,
+but I extended my float printer to handle 2D arrays to make it easier to output
+what they want in the future.
+
+## 2020-01-20
+
+Wanted to do a Rosalind problem over lunch. On a whim I decided to use CCL just
+to keep myself honest — I usually use SBCL, so using another implementation now
+and then helps ensure I'm writing standards-compliant code. But apparently
+a recent Vlime commit broke Vlime in CCL. So I had to shave *that* yak first.
+Sent a PR.
+
+Did Rosalind problem `ASPC`. The problem itself was trivial. Made a handy
+iterate driver to make it even more trivial.
+
+Running the test suite in CCL gave me exactly what I wanted: examples of where
+I was unintentionally relying on SBCL's behavior. Today there were two:
+
+1. Relying on the traversal order of iterate's `:in-hashtable` (which uses
+ `with-hash-table-iterator`) to get the output of `TRIE`. Made a helper to
+ convert the hash tables to sorted alists to fix the output.
+2. Relying on the `~,…F` format directive having a half-up rounding strategy.
+ Turns out that when format is rounding a float to a given number of decimal
+ places, the implementation is allowed to round half however it wants. SBCL
+ seems to round half up (which matches what Rosalind shows in their examples,
+ which is why I never noticed) but CCL seems to use banker's rounding. Had to
+ do an ugly hack in `float-string` to make the output consistently match
+ Rosalind's style everywhere.
+
+## 2020-01-22
+
+Started doing Rosalind problems in shell/Awk, to join in with a group of folks
+at work doing them for the first time. The first few problems are simple, but
+I still managed to learn a few tricks by implementing them with UNIXy tools.
+
+Fixed a few remaining broken links on my site. It's been a week or two since
+I poked around at it, and that was on another machine, but `./deploy.sh` Just
+Worked. It's so nice to have a site generator I designed myself that's not
+going to change out from under me when I just want to fix something simple and
+move on with my life.
+
+## 2020-01-23
+
+Lesk book exercises, chapter 1:
+
+1. Average density would be `(/ 3r4 3r9) = 1/100000` or 1 gene every ~100kb.
+2. Parts:
+ * A: Two humans would have roughly 250k differences.
+ * B: A human and a chimpanzee would have roughly 3m differences.
+3. Parts:
+ * A: Average density would be 2 SNPs per 1000 base pairs.
+ * B: Is this a trick question? I think 1.1% of the differences would be in protein-coding regions.
+4. Parts:
+ * A: I am confused. The glossary defines a "haplotype" as a group of closely-linked genes that are typically inherited together. So… 1 haplotype? But the *text* talks about a haplotype as being a combination of SNPs in a recombination-poor region. So if "haplotype" means the combination of SNPs, not a set of genes, then I think this would be `4^10 = 1048576`.
+ * B: On a diploid chromosome, you'd have two separate sets of SNPs to combine (and order doesn't matter), so I think it's `4^10 * 4^10 / 2`.
+ * C: `(1 SNP / 5 kb) * (100 kb) = 20 SNPs = 2^20 sequences = 1048576 sequences`
+5. TODO
+6. TODO
+7. TODO
+8. A much less drastic change, though it could still cause issues if it's at
+ a critical point in the RNA molecule (e.g. the anticodon in tRNA).
+9. TODO
+10. If the gene Kangaroo coat color is not on the X chromosome, you wouldn't be able to get the calico pattern that comes from X chromosome inactivation.
+11. Parts:
+ * A: No, because if it were a mutation in the gene itself it would act by affecting the structure of the protein, and should have the same effect in adults and children. But it doesn't, so it's more likely to be a mutation in regulatory region.
+ * B: It's not on the Y chromosome?
+12. TODO
+13. TODO
+14. TODO
+15. TODO
+16. TODO
+17. TODO
+18. TODO
+
+## 2020-01-24
+
+Solved the `FIB` Rosalind problem in my shell/Awk repo. Tried doing it in
+straight bash, but I couldn't figure out how to get the bash function to use
+a global bash array for memoization after like 10 minutes of searching. Ended
+up rewriting it in Awk, which took two minutes, was simple, and runs fast.
+Great.
+
+Solved `MMCH` in my Lisp repo. Much like the previous problem in this line, it
+was way easier than they made it seem. Also fixed the Uniprot cache to use the
+filesystem instead of just keeping a hash table, to save a few seconds of my
+life every time I `(run-tests)` in a fresh Lisp session.
+
+## 2020-01-25
+
+Rebooted my Mansion repo to try to simplify it enough that I can start
+dogfooding it for actual use, then turn it into something publicly-usable.
+
+## 2020-01-27
+
+Trying to decide on data to use for the BIOL-650 project. A list of
+possibilities:
+
+* <https://www.ncbi.nlm.nih.gov/sra/ERX1735780>
+* <https://www.cell.com/cell/fulltext/S0092-8674(15)01339-2#back-bib72>
+* <https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/studied-cancers>
+* <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3673022/>
+
+## 2020-01-29
+
+Figured out how to get the mutt HTMLified email working on my local machine,
+like I had done for work. Had to install NeoMutt from source because the Ubuntu
+version is too old to include `group-alternatives`:
+
+* Clone the NeoMutt repo.
+* Install Tokyo Cabinet for the header cache with `sudo apt install libtokyocabinet-dev`.
+* Configure the NeoMutt build with `./configure --disable-idn --idn2 --disable-doc --prefix=/usr/local --gnutls --with-ui=ncurses --tokyocabinet`.
+ * Note the `--disable-idn --idn2` to make it use `libidn2` which is installed, instead of `libidn` which isn't.
+ * Is this the correct set of options? Who knows!?
+* Install into `/usr/local` with `sudo make install`.
+
+More searching around for projects for HTSA:
+
+ https://www.sciencedirect.com/science/article/pii/S2352340918302920
+ doi 10.1016/j.dib.2018.03.079
+ Illumina HiSeq 2500
+ https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA432903
+
+ https://www.sciencedirect.com/science/article/pii/S2352340919303750
+ doi 10.1016/j.dib.2019.104022
+ Ion Proton sequencer
+ https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=506922
+
+Started going through the gnuplot book at night. Going to try to do that in
+parallel with the class.
+
+## 2020-01-30
+
+Drove to RIT only to discover class was not being held in person, and we're
+supposed to work online instead. **Welp.**
+
+Installed FastQC locally and got it running. Really just downloaded the binary
+version and symlinked it into place.
+
+Downloading the actual data is taking a while. While I wait, thinking about
+disk setups: now that I have a NAS and have moved my local backups to it, I can
+format my previously-for-backup internal drives and use them for actual data.
+I really need to spend the time to grok how Linux formats and mounts drives
+— I've been flailing and slapping things together and it has mostly worked until
+now, but I really should just learn it. Ordered a copy of the new edition of
+the sysadmin book that I remember being really good from my Dumbwaiter days.
+Hopefully that will be a good reference.
+
+Finished the FastQC stuff early, so spent a few minutes going through the
+gnuplot book. Random notes:
+
+* The user config file is `~/.gnuplot` (note the lack of an `rc` suffix).
+* To stop gnuplot from stealing my goddamn focus: `set terminal … noraise`.
+* To show the current value of a setting: `show …`.
+
+Finally got around to writing a first draft of a `lines` utility. Discovered
+and fixed a stupid bug in Adopt in the process.
+
+## 2020-01-31
+
+Now that I've got a more thorough backup system, went through and did my yearly
+pruning of unnecessary crap on my hard drive, and adding restic excludes to
+avoid backin up track. Number of scans was cut drastically:
+
+* Directories: 125k to 30k
+* Files: 567k to 285k
+
+Much cleaner now.
+
+# February 2020
+
+## 2020-02-01
+
+Fixed up my `$PATH` in my bash profile, for when I'm forced to use bash.
+
+Also had to fix up the uses of `set_color` in my Fish shell config to only
+happen for interactive connections, because Fish shits out some errors when you
+call `set_color` on a noninteractive connection (e.g. when using `scp` or
+`rsync`). Sigh.
+
+Now that I've got my backups pushed to my NAS (plus off-site) I can repurpose
+those 4 TB drives sitting empty in my tower. Gonna use them for a data dump
+site on the Linux side of things, e.g. for all the FASTQs and resulting data
+I'll be dealing with for my class this semester. Read through the sysadmin book
+chapters on the filesystem which were *super* helpful. Decided to pair the
+drives as a `btrfs` raid 0 array. I'm not super concerned with durability here
+— everything should be reproducible from the original raw FASTQs from SRA if I'm
+doing my job right, so I may as well just have one giant pool of storage. Also
+I don't need this `/dump` to be encrypted — it'll be handling non-sensitive
+data.
+
+To get them ready for `btrfs` I used:
+
+* `fdisk` to create new GPT partition tables on each and blow away the old ones.
+* Had to use `sudo cryptsetup luksClose` on the previously-encrypted backup
+ drive to remove the lingering `sda_crypt` "partition" I was seeing.
+* `wipefs` to finish cleaning them completely.
+
+Even with all that, `mkfs.btrfs` complained about a partition table still
+existing on one of the drives. At that point I gave up and passed `-f` to it,
+and everything Just Worked. In hindsight, I bet I could have just done that to
+begin with and let `mkfs.btrfs` handle blowing away everything, but it was still
+a fun exercise.
+
+Attempted to figure out how to acquire MS Word for the class. Apparently
+there's a web version of Word now? If I can use that, at least I won't have to
+spin up a full Windows VM to write the papers…
+
+Installed `sra-toolkit` from Ubuntu's repository. Unfortunately it doesn't seem
+to come with `fasterq-dump`, only the older `fastq-dump`, at least according to
+`dpkg-query -L sra-toolkit`. Installed the latest binary from their site
+instead.
+
+Poked around at both `fastq-dump` and `fasterq-dump`. `fasterq-dump` seems more
+documented and user friendly, and is definitely faster. The biggest speedup
+came from telling it to use RAM as its temporary filesystem with `-t /dev/shm`,
+which cut the time down to 1/3. Guess all this RAM *is* actually good for
+something.
+
+After rebooting later, the boot process was suddenly hanging for over a minute.
+Another fucking yak to shave. Figured out how to disable the goddamn splash
+screen (edit `/etc/default/grub` to remove `quiet` and `default` and then
+`update-grub2`) so I could get some logs. Eventually saw some systemd horseshit
+about `a job is running for dev-disk-by\x2duuid-…`. Tried to figure out which
+drive had that UUID, but none of them did. After a bunch of flailing in `fstab`
+I realized that it might be the UUID of *the old backup drive I killed*. Sure
+enough, there was still an entry in `/etc/crypttab` for it, and a bunch of
+systemd shit in `/run/systemd/generator/` left over that I had to clean out.
+After that, everything boots fast(ish) again. I hate computing.
+
+## 2020-02-02
+
+Spent some time cleaning up the `Makefile` for my individual class project.
+Learned some stuff about `make` in the process:
+
+* When you have a pattern rule `foo/bar/%.baz`, whatever matches the `%` is
+ called the "stem", and you can use `$*` to get the stem inside the rule.
+* There are a bunch of ways to allow users to override variables, but the one
+ I settled on is to use `VAR := val` in the `Makefile`. Users can override
+ these with `make VAR=newval …`. This doesn't allow overriding with
+ environment variables, but that seems fraught with danger anyway, so let's
+ consider it a feature, not a bug.
+* I was dicking around a lot with `addprefix` and `addsuffix` unnecessarily
+ before. Much simpler is to use `$(FOO:%=prefix%suffix)`, e.g. `FASTQ_FILES
+ := $(KEYS:%=data/00-input/%.fq.gz)`. This works with multiple-word variables
+ too (god help me if filenames ever contain spaces).
+* Prefixing a command with `@` prevents `make` from echoing the command before
+ executing it (but not when doing a dry run). Handy for the `echo …` commands
+ in `make debug`.
+* Management of directory creation is… annoying. Ended up going with
+ placeholder `.ph` files. Simple but inelegant. Oh well.
+
+Finished writing up my class project pick paper.
+
+Did more in the gnuplot book.
+
+Updated my `lines` tool to add a `--limit` option and handle broken pipes.
+I could have sworn I handled broken pipes in some other Lisp script at some
+point, but can't seem to find it now. I used `sb-int:broken-pipe` but I'm not
+sure if that's something I should be relying on. Need to ask in `#sbcl`.
+
+## 2020-02-04
+
+Class today. Did a bunch of various things after the lecture (I hope I can
+remember it all).
+
+Got the group permissions sorted out on the server in two steps:
+
+* Had the admin [add a `setgid` bit to the team directory][dir-setgid], so new
+ stuff in there will get created with the team group as the owner by default.
+* Then we all added `umask 002` to our `rc` files to change our default `umask`,
+ so when we create files they'll be `664` by default instead of `644`. This
+ isn't a problem for non-team files because they get created with your personal
+ group as the group.
+
+[dir-setgid]: https://www.gnu.org/software/coreutils/manual/html_node/Directory-Setuid-and-Setgid.html
+
+Once that was sorted out, started diving into aligning the Trapnell data with
+tophat and bowtie. This is… not pleasant.
+
+First I need to install both tools. There's an Ubuntu package for `tophat`, but
+it wanted to install NodeJS as a prerequisite. Why on earth does tophat want
+NodeJS? `SIGINT`ed that immediately and downloaded the binary version instead.
+
+The wrapper script `tophat2` tries to set up `$PATH` magically so everything
+just works:
+
+ #!/usr/bin/env bash
+ pbin=""
+ fl=$(readlink $0)
+ if [[ -z "$fl" ]]; then
+ pbin=$(dirname $0)
+ else
+ pbin=$(dirname $fl)
+ fi
+ export PATH=$pbin:$PATH
+ $pbin/tophat "$@"
+
+Unfortunately it doesn't try hard enough. The `readlink` is *attempting* to
+deal with symbolic links there, but only works with absolute links — relative
+symlinks (e.g. `ln -s ./tophat-…/tophat2 tophat2`) just return the relative
+path, which then gets passed to `dirname` and treated as relative from whatever
+directory you happen to be in. And let's ignore all those unquoted variable
+expansions I guess. Sigh. Better version:
+
+ #!/usr/bin/env bash
+ pbin=""
+ fl=$(readlink -f "$0")
+ if [[ -z "$fl" ]]; then
+ pbin=$(dirname -- "$0")
+ else
+ pbin=$(dirname -- "$fl")
+ fi
+ export PATH="$pbin:$PATH"
+ "$pbin"/tophat "$@"
+
+Once it's installed, I can finally run it. This is where things get even more
+confusing. After a whole bunch of flailing I think I managed to narrow down
+what the hell all these magic parameters are, and how things have changed since
+the paper was written. Notes:
+
+* Bowtie is an aligner — it aligns reads to a reference genome.
+* Tophat uses bowtie to align reads, but adds splice-awareness on top.
+* To use Bowtie to align, you need to preprocess your reference genome and
+ create an index of the genome.
+* Additionally/separately, you also need to create a *transcriptome* reference
+ genome for the splice awareness. You do this with tophat, which invokes
+ bowtie under the hood. This means you end up with *two* bowtie indexes:
+ a genome index and a transcriptome index.
+* To create the genome index: `bowtie2-build --threads 15 foo.fa some/prefix/foo`
+ * This creates a bunch of files in `some/prefix` named `foo.…`.
+ * Why does this take a basename and not just a directory name?
+ * Do the name of the FASTA and the name of the index files have to match?
+* To create the transcriptome index: `tophat2 -p 15 --GTF foo.gff --transcriptome-index=another/prefix/bar some/prefix/foo`
+ * There are three separate things specified here, each in a different way:
+ * The gene annotation file is provided with the `--GTF` option, as a path. (Even if it's a GFF file, not a GTF file. There's no `--GFF` option. Of course.)
+ * The genome index is provided with a positional parameter, as a basename.
+ * The destination for the transcriptome index is a basename, and it'll splat out a bunch of files with various suffixes there.
+* Note that the chromosome names in the `.gtf`/`.gff` file and the sequence
+ headers in the `.fa` file must match.
+ * This caused me much pain when I was trying to use the `.gtf` file (foolishly assuming `--GTF` required a `.gtf`), the inclusion of which seems to be a trap in the input data laid for the unwary student. I will not be fooled again.
+ * Tophat gave me a completely useless error (`"Couldn't build bowtie index with err = 1"`) when I tried this, rather than tell me anything useful that might have pointed me in the right direction. Extremely Good Software.
+* Some/all of the prefixes/basenames may or may not have to match. By the time
+ I flailed enough to get everything working I had renamed everything to
+ a single prefix, but I'm not sure if it was strictly necessary. I'll
+ investigate tomorrow or Thursday, when I'm less annoyed.
+* Once you've got the two indexes created, you can run the alignment:
+ `tophat2 -p 15 --transcriptome=another/prefix/bar --output-dir baz/ some/prefix/foo *.fq`
+
+That's as far as I got today. Next step is to clean up this mess and get it
+into the `Makefile`.
+
+## 2020-02-06
+
+Refactored the `Makefile` and added in all the alignment stuff. Wrote up the
+weekly report far too late.
+
+## 2020-02-08
+
+Decided to try using gnuplot to do the graphing exercises in the Campbell
+Biology textbook, to combine the things I've been reading lately.
+
+The first exercise I did was from chapter 9. Data:
+
+ Low 4.3
+ Normal 4.8
+ Elevated 8.7
+
+Gnuplot file:
+
+ set title "{/:Bold*1.5 Thyroid Hormome Levels and Oxygen Consumption}"
+ set xlabel "Thyroid Hormone Level"
+ set ylabel "Oxygen Consumption Rate [nmol O_2 / (min * mg cells)]"
+
+ unset key
+ set boxwidth 0.8
+
+ plot [][0:10] "bio/thyroid" u 0:2:xtic(1) w boxes fillstyle solid fc rgb "black"
+
+That was pretty simple. Then I went back to a previous chapter. Data:
+
+ # Time (min) Concentration of inorganic Phosphate (μmol/mL)
+ 0 0
+ 5 10
+ 10 90
+ 15 180
+ 20 270
+ 25 330
+ 30 355
+ 35 355
+ 40 355
+
+Gnuplot file:
+
+ set title "{/:Bold*1.3 Glucose-6-Phosphate Activity in Isolated Liver Cells}"
+ set xlabel "Time (minutes)"
+ set ylabel "Concentration of ℗_i (μmol/mL)"
+
+ set xtics format "%h min"
+
+ unset key
+
+ set grid xtics ytics
+
+ set lmargin 12
+ set rmargin 6
+
+ plot [0:45][0:400] "liver-glucose" w lines linewidth 2
+
+This one was a little trickier. I was trying to play around in the gnuplot
+REPL, but pasting the Unicode characters was producing garbage at the gnuplot
+command line. Pasting into a normal terminal/fish shell worked fine. After
+much flailing I tracked down the problem to gnuplot itself:
+
+* Gnuplot uses one of several different libraries to handle readline-like
+ editing: `libeditline`, GNU `readline`, and some others (I think).
+* The library it uses is specified as an argument to `./configure` and linked in
+ at compile time.
+* `libeditline` does not handle UTF-8 input correctly in this, the year of our
+ Lord, 2020.
+* GNU `readline` *does* handle UTF-8 input correctly.
+* Unfortunately, `readline` is GPL-licensed, which is incompatible with
+ gnuplot's license.
+* Debian (and Ubuntu) therefore link gnuplot with the broken `libeditline`,
+ rather than linking it to `libreadline` and violating the GPL.
+* As far as I can tell, there's no way to simply disable all line editing in
+ gnuplot so it can be wrapped with `rlwrap` or something.
+
+At this point I just gave up on using Unicode at the REPL and created `.gp`
+files that I then `load`. The GPL strikes again.
+
+## 2020-02-09
+
+Installed R Studio from the `.deb`. Needed to `apt install r-base` first to get
+the dependencies. It installed… a lot of stuff. Let's hope it hasn't wreaked
+too much havoc.
+
+Moved on to differential gene expression with `cuffdiff`. Finally starting to
+understand what the actual data files are. The input FASTQs are:
+
+ GSM794483_C1_R1_1.fq.gz
+ GSM794483_C1_R1_2.fq.gz
+ GSM794484_C1_R2_1.fq.gz
+ GSM794484_C1_R2_2.fq.gz
+ GSM794485_C1_R3_1.fq.gz
+ GSM794485_C1_R3_2.fq.gz
+ GSM794486_C2_R1_1.fq.gz
+ GSM794486_C2_R1_2.fq.gz
+ GSM794487_C2_R2_1.fq.gz
+ GSM794487_C2_R2_2.fq.gz
+ GSM794488_C2_R3_1.fq.gz
+ GSM794488_C2_R3_2.fq.gz
+
+The `C` here apparently stands for "condition". I still have no idea what
+`R[123]` is trying to tell me. The `_[12]` is the paired-end read pairs. Would
+it kill people to name things more descriptively and/or include a `README` that
+explains what all their one-letter abbreviations mean?
+
+Thinking about restructuring the filesystem layout to make the relationships
+between the FASTQ files more explicit:
+
+ C1/
+ R1/
+ GSM794483_C1_R1_1.fq.gz
+ GSM794483_C1_R1_2.fq.gz
+ R3/
+ GSM794484_C1_R2_1.fq.gz
+ GSM794484_C1_R2_2.fq.gz
+ R3/
+ GSM794485_C1_R3_1.fq.gz
+ GSM794485_C1_R3_2.fq.gz
+ C2/
+ R1/
+ GSM794486_C2_R1_1.fq.gz
+ GSM794486_C2_R1_2.fq.gz
+ R2/
+ GSM794487_C2_R2_1.fq.gz
+ GSM794487_C2_R2_2.fq.gz
+ R3/
+ GSM794488_C2_R3_1.fq.gz
+ GSM794488_C2_R3_2.fq.gz
+
+But I might be too lazy at this point to do it.
+
+The more I work with `make` the more I realize how limited it is. The general
+idea behind it (computing a dependency graph and using it to rebuild things) is
+sound, but the implementation is a clusterfuck of string processing and horrific
+syntax. It really gives me the urge to create Yet Another Build System. Sigh.
+
+Eventually `cuffdiff` finished. Time to move on. Tried to install cummerbund
+in R, but ran into a bunch of errors. Had to install numerous third-party
+dependencies outside of R:
+
+* `libcurl4-openssl-dev`
+* `libxml2-dev`
+* `libmariadbclient-dev`
+
+Why on earth does installing cummerbund require a goddamn **MySQL** client?!
+This is completely nuts.
+
+Ran into a giant pile of inscrutable errors even after installing all these
+bananas dependencies. Eventually tracked it down to the R version shipped in
+the Ubuntu package repository being too out of date to work with some of the
+required packages. I hate volatile software. Installed a newer version of
+R with [this guide](https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-18-04).
+
+Finally managed to get cummerbund working and started graphing stuff. It mostly
+works, but my graphs don't really look quite like theirs. Need to ask the
+professor about this.
+
+In the mean time, played around with using gnuplot to plot the data instead of
+R. Turned out to be not too bad, since cuffdiff already dumps its output in
+a textual format. Example of the plot comparing the expression levels of the
+various genes:
+
+ set logscale
+ unset key
+ plot "gene_exp.diff" using \
+ "value_1":"value_2" lt 3 pt 1 lw 1, \
+ x+(0.5*x) lt -1 lw 1 dt 2, \
+ x-(0.5*x) lt -1 lw 1 dt 2
+
+I also played around with adding some nicer default `linetype`s to my
+`~/.gnuplot` using colors from ColorBrewer. Fairly happy with the result so
+far, but I'll need to play around with them more over time to see if they hold
+up.
+
+## 2020-02-10
+
+Started making some actual gnuplot scripts to draw the stuff from yesterday.
+Got a couple of scripts working, but went down a horrific rabbit hole.
+
+First: the graphs produced by cummerbund are titled very poorly ("Genes" is not
+particularly helpful). I'm going to need to chat with the professor to try to
+decipher what these graphs are trying to show me.
+
+I reproduced the scatter and volcano plots without too much trouble. Used the
+PDF plotter and came up with the simple printable line styles that should be
+okay.
+
+Then I tried to reproduce the first graph, the "distribution of expression
+levels for each sample", and all hell broke loose. First of all, what is this
+graph even trying to show? The title is just "Genes", which is useless garbage.
+The y-axis is labelled "Density", the x-axis is pabelled `log_10(FPKM)`, and the
+function used to create it is `csDensity`, so maybe this is a kernel density
+plot of the `log10(FPKM)` values? Well, I tried that, and I ended up with
+fucking spaghetti in gnuplot — there were hundreds of lines all overlaying each
+other instead of the expected single line. I spent like an hour dicking around
+online trying to figure out what the hell gnuplot was doing. Eventually,
+*finally*, I realized no one online was going to help me, and decided to break
+things down myself. I extracted the problematic column from the file and
+computed the `log10` values myself in Lisp, and found an issue: some of them are
+0, which means their `log10` value is undefined. After removing those and
+replotting, I confirmed that those undefined values were what was causing all
+the discontinuities in the original gnuplot kernel density plot — instead of
+ignoring the values, it would start a new line every time it hit one. Christ.
+So then I cloned down the cummerbund repo to see how they were handling this.
+The code is… not particularly easy to read. I searched around more online, now
+that I realized the problem, and eventually found a post where someone notes
+that they add `1` to the FPKM values first, before taking the `log`, to avoid
+negative and undefined log values. So the x-axis label that says `log_10(FPKM)`
+is fucking lying — what's actually on the graph is `log_10(FPKM+1)`. Once
+I plugged that in, everything works and the graph looks roughly like theirs. To
+hell with all this, I'm done for tonight.
+
+## 2020-02-11
+
+Class today. Chatted about various stuff. The XLOC identifiers aren't actual
+identifiers, they're just arbitrary IDs generated during the alignment process.
+That explains why I was seeing `0`s when trying to plot the `XLOC_012662`, which
+was regucalcin in the paper — it's a different gene in my run, so I'd need to
+look up the gene in my own results. Also talked about P and Q values, and
+plotting each to find interesting cutoffs. I really need to refresh my memory
+on how to more easily make histograms in gnuplot.
+
+Started reading [A Curious Moon](https://bigmachine.io/products/a-curious-moon/)
+and got about 1/3 of the way through. An interesting approach to teaching SQL,
+though the writing voice is a little bit annoying. Still turning out to be
+a good book though.
+
+Finished the gnuplot book. It's really, really good.
+
+Finished my weekly report and homework for this week. Good to have a couple
+days of buffer. Figured out how to build the assignment zip file in
+a `Makefile` which was overkill, but "fun":
+
+ assignments/sjl7678-alignment.zip: $(ALIGNMENTS)
+ rm -rf assignments/sjl7678-alignment
+ mkdir -p assignments/sjl7678-alignment/
+ cd assignments/sjl7678-alignment && mkdir -p $(KEY_PREFIXES)
+ mcp 'data/02-alignment/*/align_summary.txt' 'assignments/sjl7678-alignment/#1/align_summary.txt'
+ mcp 'data/02-alignment/*/logs/tophat.log' 'assignments/sjl7678-alignment/#1/tophat.log'
+ cd assignments && zip -r sjl7678-alignment.zip sjl7678-alignment
+
+I'm slowly becoming more and more disillusioned of `make`. Even though I'm only
+using it for what it's actually designed for (running commands to turn files
+into new files) it's still a tremendous pain in the ass to do anything
+nontrivial. The "scripting" support is completely batshit string-based munging,
+and even the basic definition of any dependencies beyond explicit lists of files
+is miserable.
+
+## 2020-02-12
+
+Spent some (far too much) time getting the Trapnell data into Postgres, figuring
+out how to ergonomically plot it with gnuplot, and figuring out how to do decent
+histogram and CDF plots. I'm not thrilled with what I came up with, but it'll
+have to do for now because it's already 11 PM.
+
+## 2020-02-15
+
+Tracked down the root cause of https://github.com/dimitri/pgloader/issues/1005
+— need to write it up in a comment.
+
+## 2020-02-16
+
+Finished the `make` book. I'm becoming less and less enthusiastic about `make`
+the more I learn about it and see real-world examples of it.
+
+Wrote up the pgloader issue debugging comment.
+
+## 2020-02-18
+
+Class today was a lot of chatting and planning ahead for the group project.
+Things to think about:
+
+* Need to get a CSV or something together with the sample names and SRA IDs so
+ I can map the FASTQ files back to the actual conditions. Not sure if there's
+ a way to pull this metadata out of the raw data files or whether I have to
+ type it out by hand like some uncultured swine.
+* Need to figure out what the "transcript integrity number" from the paper
+ means.
+ * Professor wasn't sure.
+ * I think I found the paper that created it, so I just need to read it and hope it doesn't go over my head.
+* Need to think more about the filesystem layout for the project.
+ * Pipeline directories mostly worked in the individual one. Try that again?
+* Need to figure out how we're going to automate the project.
+ * Considered trying `redo` but I want it to run on the server, so that's probably out.
+ * Could use `make` again even though I'm more and more disenchanted with it every time I try to use it.
+ * Could use bash scripts but that seems miserable in different ways.
+* I'd really like to try to reproduce the paper with the exact versions of the
+ tools they used, as a baseline.
+ * This will likely be fucking miserable because modern software has a half life of a year.
+ * If I can manage to do that, then I could update the tools and see what changes.
+ * Then try different tools and see what changes.
+
+One more big question: need to figure out what the overrepresented sequences
+FastQC is complaining about are. I figured I'd do BLAST searches for them, so
+I wanted to get a list of all the overrepresented sequences. So I need to pull
+them out of the FastQC reports… which are HTML files… all on one line. Did some
+horrific `grep`ing to get what I wanted:
+
+ sjl at ouroboros in /dump/karkinos/fastqs/qc
+ ><((°> grep -Po 'id="M9".*<h2' *.html | head | grep -Po '<td>.*?</td>' | grep -v 'No Hit' | grep '<td>[ACTG]' | tr -s '<>' ' ' | cut -d' ' -f3 | sort | uniq -c
+ 2 CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT
+ 2 CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT
+ 2 CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC
+ 2 CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT
+ 2 CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
+ 2 CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG
+
+ sjl at ouroboros in /dump/karkinos/fastqs/qc
+ ><((°> grep -Po 'id="M9".*<h2' *.html | head | grep -Po '<td>.*?</td>' | grep -v 'No Hit' | grep '<td>[ACTG]' | tr -s '<>' ' ' | cut -d' ' -f3 | sort | uniq | xargs -ISEQ grep --files-with-match SEQ *.html | sort | uniq
+ SRR6671104.1_1_fastqc.html
+ SRR6671104.1_2_fastqc.html
+
+Interesting — only one of the samples has this issue, and the same sequences
+appear in both sides of the pair. Immediately after doing this horrible `grep`
+incantation I remembered the `.zip` produced by FastQC and sure enough, there's
+a text file in it with the results in a much less awful format. But I still
+needed a janky awk incantation:
+
+ ><((°> awk 'BEGIN { RS=">>END_MODULE\n"; FS="\n" } $1 ~ /Overrepre/ { for (i=1;i<=NF;i++) print FILENAME, $i }' */fastqc_data.txt
+ SRR6671104.1_1_fastqc/fastqc_data.txt >>Overrepresented sequences warn
+ SRR6671104.1_1_fastqc/fastqc_data.txt #Sequence Count Percentage Possible Source
+ SRR6671104.1_1_fastqc/fastqc_data.txt CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT 58594 0.18785731410785536 No Hit
+ SRR6671104.1_1_fastqc/fastqc_data.txt CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC 39157 0.12554065004132323 No Hit
+ SRR6671104.1_1_fastqc/fastqc_data.txt CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG 36886 0.11825963218388151 No Hit
+ SRR6671104.1_1_fastqc/fastqc_data.txt CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT 34457 0.11047205297836592 No Hit
+ SRR6671104.1_1_fastqc/fastqc_data.txt CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT 33273 0.10667604895229328 No Hit
+ SRR6671104.1_1_fastqc/fastqc_data.txt CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT 31923 0.1023478349022949 No Hit
+ SRR6671104.1_1_fastqc/fastqc_data.txt
+ SRR6671104.1_2_fastqc/fastqc_data.txt >>Overrepresented sequences warn
+ SRR6671104.1_2_fastqc/fastqc_data.txt #Sequence Count Percentage Possible Source
+ SRR6671104.1_2_fastqc/fastqc_data.txt CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT 58938 0.18896020716948458 No Hit
+ SRR6671104.1_2_fastqc/fastqc_data.txt CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC 39362 0.12619789736002668 No Hit
+ SRR6671104.1_2_fastqc/fastqc_data.txt CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG 36752 0.11783001686336315 No Hit
+ SRR6671104.1_2_fastqc/fastqc_data.txt CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT 34782 0.11151403043484702 No Hit
+ SRR6671104.1_2_fastqc/fastqc_data.txt CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT 34471 0.11051693816110664 No Hit
+ SRR6671104.1_2_fastqc/fastqc_data.txt CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT 31900 0.10227409495922085 No Hit
+ SRR6671104.1_2_fastqc/fastqc_data.txt
+ SRR6671105.1_1_fastqc/fastqc_data.txt >>Overrepresented sequences pass
+ SRR6671105.1_1_fastqc/fastqc_data.txt
+ …
+
+Some quick `cut`ing and such and I've got what I need, in a slightly less
+horrible way. I guess.
+
+ ><((°> awk 'BEGIN { RS=">>END_MODULE\n"; FS="\n" } $1 ~ /Overrepre/ { for (i=3;i<NF;i++) printf(">%s/%d %s\n", FILENAME, x++, $i) }' */fastqc_data.txt | cut -f1 | sort -k2 | uniq -f1 | tr ' ' '\n'
+ >SRR6671104.1_1_fastqc/fastqc_data.txt/0
+ CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT
+ >SRR6671104.1_1_fastqc/fastqc_data.txt/5
+ CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT
+ >SRR6671104.1_1_fastqc/fastqc_data.txt/1
+ CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC
+ >SRR6671104.1_1_fastqc/fastqc_data.txt/4
+ CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT
+ >SRR6671104.1_1_fastqc/fastqc_data.txt/3
+ CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
+ >SRR6671104.1_1_fastqc/fastqc_data.txt/2
+ CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG
+
+Did a BLAST search for these sequences and they're all 18S Ribosomal RNA.
+Professor wasn't sure how that could have gotten into the mix, since the paper
+says they only used poly(A) RNA. Need to ask around at work to see if anyone
+has any ideas what the heck happened.
+
+(This weirdness is only on sample C0. Unfortunately that's not the sample the
+paper identified as being low quality (C3) so that isn't the issue.)
+
+Did a bit more on the Enceladus book. Decided to just read the rest, because
+the DB/Makefile crap is getting way too messy. I really need a `sqlfmt` tool,
+but every time I try to use smug to write a parser I get depressed.
+
+Merged a GitHub PR into badwolf and tried to pull it down locally with hg-git,
+which fucking broke for the millionth time, with a lying error message blaming
+hg-prompt. Fucking fantastic. Updated hg-git (apparently there's a new repo on
+Heptapod since Atlassian is shitcanning BitBucket's Mercurial support) and now
+there are four heads on default. I have no idea which one I'm supposed to use.
+I picked the latest one arbitrarily and it seems to work, at least well enough
+to pull down the PR. Good enough.
+
+## 2020-02-19
+
+Finished the weekly report for class.
+
+Fixed the scroll overflow of the code blocks in this site.
+
+## 2020-02-22
+
+Put together some initial structure for the group project. Got the files
+unzipped and FastQC run, so tomorrow I can dive into new stuff.
+
+Tried to track down a human reference genome for use later. I *think* I got
+what I need, but the sites are… not very user friendly. Will confirm with the
+professor in class.
+
+Started reading one of Tufte's books. Found a beautiful graph from a 1979 paper
+(`10.1126/science.206.4421.987`) with a gorgeous plotter font. Need to try to
+find a similar font to use myself.
+
+## 2020-02-23
+
+Did another graphing exercise from Campbell Biology in gnuplot for fun.
+
+Figured out that `dashtype ''` in my style files was causing the *line* to show
+as solid, but the entry in the key was blank! Turns out `dashtype solid` is
+what I actually want. Also learned that I need to explicitly reset that when
+I want a solid line, otherwise the dashtypes I set in previous styles will carry
+over.
+
+Refined my color plotting style to use dashed lines instead of lighter colors
+for types 6-10. Seems easier to read that way — the light colors were *really*
+light.
+
+Figured out how to do linear regression in gnuplot, and made some helper
+functions to make it less annoying:
+
+ # lr = linear regression (the function to fit)
+ # lrt = linear regression title
+
+ baselrt(mm, bb, mprecision, bprecision) = sprintf(\
+ sprintf("%%.%dfx + %%.%df", mprecision, bprecision), \
+ mm, bb)
+
+ lr(x) = m * x + b
+ lrt(mp, bp) = baselrt(m, b, mp, bp)
+
+ lr2(x) = m2 * x + b2
+ lrt2(mp, bp) = baselrt(m2, b2, mp, bp)
+
+ fit lr(x) "plant-co2" using "CO2":"Corn" via m, b
+ fit lr2(x) "plant-co2" using "CO2":"Velvetleaf" via m2, b2
+
+ plot [300:1050] \
+ "plant-co2" u "CO2":"Corn" w lp, \
+ "plant-co2" u "CO2":"Velvetleaf" w lp, \
+ lr(x) t lrt(3,0) lt 6, \
+ lr2(x) t lrt2(3, 0) lt 7
+
+Looked into why our first few bases are called at lower quality. Found
+[this](http://seqanswers.com/forums/showthread.php?t=61936) which says that it's
+because Illumina uses the first few bases to calibrate their quality score
+analyzer, and so they mark them as being less confident to account for the
+calibration process. Signed up for that forum (over bare-ass HTTP, sigh) to get
+the damn tech note PDF too.
+
+Dug out the old AxiDraw to play around with
+<https://anvaka.github.io/city-roads/?q=rochester%2C%20ny&areaId=3600176069>.
+First problem: after installing the new AxiDraw version, connecting to the
+AxiDraw in Inkscape failed with `Failed to connect to AxiDraw`. Sigh. After
+much flailing, it turns out the problem was that something had `pip install`ed
+the `serial` module, which is some kind of serialization/deserialization
+library, and so `import serial` was importing that instead of pySerial, which is
+the serial port communication library that happens to use the module name
+`serial` too. Ugh. So `pip uninstall serial; pip install pySerial` fixed that.
+Then I tried to get the `Render > Hershey Text` thing working again, but it
+just… did nothing at all. Looks like there's a *second* Hershey Text plugin,
+this time under the AxiDraw extension menu. I guess I'm supposed to use that
+one now? Who knows?
+
+Played around with how to make a dot plot in gnuplot while the AxiDraw drew
+Rochester. First I grabbed the read counts for our various FASTQ files, pulled
+from the FastQC output files:
+
+ ffind _data.txt \
+ | xargs grep 'Total Sequences' \
+ | sed -Ee 's|./([^/]+)_fastqc/.*\t|\1 |' \
+ | sort --stable -k1,1 -r \
+ | sort -k2,2 -n \
+ > seqs.txt
+
+ cat seqs.txt | head -n2
+ C3_1 21081615
+ C3_2 21081615
+
+Then I can plot a dot plot with something like:
+
+ # https://twitter.com/stevelosh/status/1231434196213796864
+ set termoption font "Routed Gothic,12"
+
+ # I wonder if there's a way to capitalize stylistically
+ set title "READ COUNTS OF INDIVIDUAL FASTQ FILES"
+ set xlabel "READS (MILLIONS)"
+
+ # major x tics every 2 million, with 2 minor divisions per major (i.e. minor tics are every 1 million)
+ set xtics 2
+ set mxtics 2
+
+ set grid y nox
+
+ # left justify
+ set ytics left offset -2.2, 0.0
+
+ unset key
+
+ # scale to millions by hand
+ # read FASTQ names from file
+ plot [20:35][-1:26] "seqs.txt" using ($2 / 1000000.0):0:ytic(1) lc rgb 'black'
+
+And I get something like:
+
+![plot](https://i.imgur.com/YAPXHaQ.png)
+
+Neat!
+
+Hacked together some Awk to remove overrepresented sequences. **But** I don't
+think a simple `grep -v` approach works, because the two FASTQ files are
+expected to have the paired reads at the same positions in the file. So if we
+remove a read from one file but not the other, now all the reads are going to be
+offset. So we need to remove these reads a bit more carefully (really, we need
+tools that process the paired-end reads together). Need to think about this
+a little bit more.
+
+## 2020-02-24
+
+Figured out how to mount a directory on my home machine on my laptop with SSHFS,
+in advance for tomorrow in case I need it:
+
+ # mount
+ mkdir ~/foo
+ sshfs -o idmap=user sjl@server:/whatever ~/foo
+
+ # unmount
+ fusermount -u ~/foo
+
+Downloaded STAR and got it running. Read through the manual which, despite the
+broken English, was fairly useful. Tried running it on the gzipped reference
+genome but it complains that it requires the genome to be uncompressed. Tried
+to trick it into working by psubbing `zcat` but that didn't work. Gave up and
+decompressed the damn reference.
+
+Once the genome index was built, started aligning the reads. It failed right
+away complaining about a zero-length read. `zcat | grep`ing showed the read as
+being there in both files, *but* I bet this was an issue of the pairs getting
+out of sync because we grepped out the overrepresented sequences naively. Tried
+restarting the alignment on the original data and it's still going, so that's
+probably it. Need to figure out how to filter those bad reads properly tomorrow.
+
+## 2020-02-25
+
+Class. Chatting about QC and such.
+
+Professor says he asked around and people haven't seen the first 5 bases being
+lower quality before, but that the explanation from Illumina makes sense, and
+that we should *not* trim those bases unless they seem to be causing issues.
+Also talked about how to filter out the overrepresented sequences — he thought
+fastx had something for this, but I can't seem to find anything in the
+documentation. I may need to write some code.
+
+Professor installed `pigz` on the server, so I can remove my hacky `xargs`
+workaround.
+
+Need to find a way to filter bad sequences from BOTH files at once.
+
+Need to align to the NCBI reference GTF instead of the other weird one.
+
+## 2020-02-26
+
+Tried rebuilding the genome index with the other GTF file. Initially got:
+
+ Feb 26 21:00:45 ..... inserting junctions into the genome indices
+ terminate called after throwing an instance of 'std::bad_alloc'
+ what(): std::bad_alloc
+ ./src/03-index-genome.sh: line 19: 14902 Aborted (core dumped) STAR --runThreadN $CORES --runMode genomeGenerate --genomeDir data/03-genome-index …
+
+That seems… bad. But perhaps this is because I didn't have enough RAM, because
+I still had my Lisp process from 2 days ago open consuming ~25 GB of RAM.
+I killed that so I've got about 60 GB now. Let's see if that helps. Otherwise
+it sounds like I may need to look at [this](https://github.com/alexdobin/STAR/issues/103).
+
+Still need to find that other paper my partner found from the same people.
+
+Submitted weekly report.
+
+# March 2020
+
+## 2020-03-16
+
+Installed TrimGalore. Had to install `cutadapt` first, which is some Python
+package. Luckily it's `apt install`able so I did that, then grabbed TrimGalore
+which is blessedly self-contained.
+
+Installed pardre. The "release" is just a source dump of C++ files with no
+`README`, cool. Had to install `mpi-default-dev` to get `make` to work. Gross.
+
+Used ParDRe to dedupe a couple of the samples as a test. It took 927 seconds to
+finish the first pair, where I used the `.fastq.gz` files as input and the `-z`
+option to enable compression. I noticed a bunch of the time (the last 700
+seconds or so) was spent bottlenecked on a single CPU core while writing the
+output. So then I tried the next sample without compression, and it took 309
+seconds. Welp.
+
+Ran FastQC on the first two sample results. For the most part FastQC is happy
+now, but there *are* still some duplicated sequences left. I'm guessing this is
+because ParDRe is paired-end aware, and will only remove a pair if *the entire
+thing* is a duplicate, while FastQC works only on a per-fastq basis. I don't
+know this for sure though.
+
+I did the initial run with 0 mismatches allowed. We should also try with
+N mismatches, though note that this will not work perfectly because ParDRe
+clusters reads by their prefix, so reads with `<=N` mismatches in their prefix
+(and none in the suffix) will not be caught because they'll never get compared.
+Welp.
+
+Cleaned up the scripts to run all the ParDRe runs (0/1/2 mismatches for every
+pair) and FastQC them. Seems to take roughly 5 minutes per pair to do ParDRe.
+We have 13 pairs, so it should take around an hour per mismatch setting (i.e.
+3 hours) to do them all.
+
+ParDRe finished. Results look reasonable. There's not a *ton* of difference
+between 0 and 1 mismatch (there is *some* though). The difference between 1 and
+2 is minimal. Next step is to trim low-quality stuff with trim galore. Gonna
+do that during "class" tomorrow and let it run while I'm at work.
+
+## 2020-03-17
+
+Started sketching out some trimming scripts using Trim Galore. It has a bunch
+of options — I'm not really sure how to go about deciding what the correct
+values for each are, other than… try some and see if the resulting FastQC run
+seems reasonable?
+
+Attempted to do something foolish: update software. I wanted to fix the StumpWM
+window/frame focus thing, but I figured I would update StumpWM and SBCL and my
+dotfiles first. Clearly, this was a mistake. I updated all of them, added the
+fix, and restarted StumpWM, and everything went to shit. StumpWM suddenly
+started chewing up tons of a CPU core, lagging to hell, and Firefox and Chrome
+wouldn't even open. Tried reverting back to the previous versions but it didn't
+work. Eventually I SSH'ed into the machine and rebooted it, and *then*
+everything started working again. I think. We'll see. Going to give up on
+updating for today — wasted enough time dicking around with this tonight.
+I hate computers.
+
+## 2020-03-18
+
+Debugged a gnuplot issue a bit in IRC with someone. I think I'm losing my mind.
+
+Submitted weekly report for class.
+
+## 2020-03-19
+
+Met with the professor for office hours to chat about the deduplication/trimming
+results.
+
+For deduplication, we first talked about what might cause the actual problem
+it's trying to solve. A cell may have many duplicate copies of a particular
+mRNA floating around if the gene is highly expressed — this is exactly what
+RNAseq is trying to measure. However, once those mRNAs are fragmented, it's
+extremely unlikely that they would fragment *exactly* in the same place, and so
+when you sequence the resulting fragments they should be offset from each other.
+If you see exact duplicate reads, it means that the *fragment* was somehow
+duplicated, which is probably an artifact of your laboratory process and not
+biologically significant.
+
+Next: mismatches. ParDRe supports removing duplicates with up to N mismatches,
+so we talked about why we might want to do this. If two reads match with `N
+> 0` mismatches it indicates one of the following:
+
+* It's the same problem as the exact duplicates, but with an extra error
+ accumulated during the process (e.g. a sequencing read error or a PCR
+ duplication problem).
+* It's from two genes that are very biologically similar, and so is genuine
+ useful information. However, this is still unlikely because the fragments
+ would likely be shifted as described above.
+
+At the end he recommended deduplicating with 1 mismatch.
+
+Another complication: ParDRe can operate in paired-end mode, which means that in
+some cases FastQC still finds some duplication. This happens when one side of
+the read is an exact duplicate while the other is not. ParDRe does not remove
+this (because of the non-duplicate side) but FastQC only looks at the individual
+FASTQs. He said this is not a problem, and that ParDRe is doing what we want.
+
+Also chatted a bit about trimming. We mostly talked about the issue mentioned
+in the Trim Galore documentation: because it trims *any* matching adapter
+sequences, it will *always* trim off a particular base at the end of a read
+(because that one base always matches the adapter sequence's last base). This
+is not ideal. He recommended running Trim Galore and requiring at least adapter
+5 bases to match before trimming to deal with that issue.
+
+Plan for today: rerun the trimming with the 5 base requirement and using the
+1 mismatch deduplication data as the source, and then kick off an alignment if
+I still have time (unlikely).
+
+Neomutt has suddenly started shitting the bed and failing to quite with `Unable
+to write` errors. I *hate* software. Tried updating to the latest `master` and
+rebuilding. I guess we'll see if it fixes things.
+
+Trimming finished. It looks like increasing the minimum adapter length to
+5 fixed the base imbalance issue. Good.
+
+Decided to rebuild the STAR index I had made because I can't remember whether
+I had already rebuilt it with the different reference file. Shouldn't take too
+long.
+
+Index rebuild finished in ~35 minutes. Kicked off the 26 alignment runs. That
+should take a lot longer.
+
+Checked in on the STAR progress. It's actually only taking ~5-10 minutes to run
+each alignment. That's not nearly as bad as I expected.
+
+Initial alignments finished. Started digging into what I can actually *do* with
+the results. I found the Integrative Genomics Viewer (IGV) that seems
+promising. Downloaded it and loaded the genome into it, then tried to load the
+alignments but apparently it wants BAMs, not SAMs, and additionally it wants
+indexes of the BAMs.
+
+I looked into how to convert the SAMs to BAMs+indexes. Went down a bit of an
+unnecessary rabbit hole, but it was still productive. My initial attempt was
+using `samtools` to do a bunch of conversion:
+
+ STAR … # produces Aligned.out.sam
+
+ samtools view -S -b Aligned.out.sam > sample.unsorted.bam
+ samtools sort sample.unsorted.bam > sample.bam
+ samtools index sample.bam
+
+This worked, but took a while and wrote a bunch of intermediate files I don't
+really need. Eventually I realized that STAR can produce sorted BAM files
+itself, so all I need to do with `samtools` is the final indexing. Learned
+a bunch about running STAR too. One nice way to save time across all the
+alignments is to have STAR load the genome index into shared memory once at the
+beginning and use it for all the alignments, the flush it out at the end:
+
+ function cleanup {
+ STAR --genomeDir "${genome}" --genomeLoad Remove
+ }
+
+ trap cleanup EXIT
+
+ STAR --genomeDir "${genome}" --genomeLoad LoadAndExit
+
+Then the STAR invocations look like:
+
+ STAR \
+ --runMode alignReads \
+ --runThreadN $CORES \
+ --genomeDir "${genome}" \
+ --genomeLoad LoadAndKeep \
+ --limitBAMsortRAM "${sortram}" \
+ --outSAMtype BAM SortedByCoordinate \
+ --outBAMcompression 0 \
+ --outBAMsortingThreadN $CORES \
+ --outFileNamePrefix "${outdir}/" \
+ --readFilesIn "${in1}" "${in2}"
+
+I tried putting the temporary directory on `/dev/shm` but it ended up being too
+much with the persistent genome index also in RAM. Unfortunately my order from
+Crucial is still backordered. Oh well, I'll just watch more TNG while I wait
+for all the alignments to complete.
+
+## 2020-03-31
+
+Ran the Trapnell plots from the paper on our data. Results seem at least
+plausible on the surface. Pinged my partner to see if his plots match.
+
+Attempted to install RSeQC to do the TIN score stuff mentioned in the paper.
+Ran `pip3 install RSeQC` as described in their docs, but that failed with:
+
+ ValueError: no cython installed, but can not find pysam/libchtslib.c.Make sure that cython is installed when building from the repository
+
+After some searching online I found that this can be fixed by upgrading pip with
+`pip3 install --upgrade pip`. It's… fine, I guess?
+
+Once I got it installed, I tried running `tin.py` to produce the TIN scores.
+The script wants the BAM (which I have) and a "BED" file. BED apparently stands
+for Browser Extensible Data and is a TSV file with lines like this:
+
+ chr1 67092175 67109072 XM_011541469.1 0 - 67093004 67103382 0 5 1429,187,70,145,44, 0,3059,4076,11062,16853,
+ chr1 67092175 67131183 XM_011541467.1 0 - 67093004 67127240 0 9 1429,187,70,106,68,113,158,92,42, 0,3059,4076,11062,19401,23176,33576,34990,38966,
+ chr1 67092175 67131227 XM_017001276.1 0 - 67093004 67127240 0 9 1429,187,70,145,68,113,158,92,86, 0,3059,4076,11062,19401,23176,33576,34990,38966,
+
+The collections of stuff I downloaded from UCSC didn't include one of these, but
+the RSeQC site has one available for download. Actually, it has several BED
+files. I'm not sure which one is correct. I downloaded `hg38_RefSeq.bed`
+because it sounds the most like the GTF I used (`hg38.ncbiRefSeq.gtf`). Was
+that correct? Who knows?
+
+Now that I have that, I'm doing an initial run with:
+
+ tin.py --input data/05-alignment/clean/C0/C0.bam -r data/00-raw/hg38_RefSeq.bed
+
+It seems like it's going to take a long time. Guess I'll wait.
+
+The goddamn thing didn't output an ending timestamp. Come *on*. It took
+somewhere between 2-5 hours, but I have no idea how long because I wasn't
+watching it when it finished. I guess I'll do another run with a `time` prefix.
+Sigh.
+
+# April 2020
+
+## 2020-04-04
+
+Spent an hour wrangling gnuplot for class to try to make a multiplot of the TIN
+scores. Mostly it was just a bunch of fiddly stuff to try to make the graph
+look exactly like I wanted. Some of the key points follow.
+
+Basic multiplotting happens with:
+
+ set multiplot \
+ title "TIN Score Distributions by Sample" font ",14" \
+ layout 3,5
+
+ …
+
+ unset multiplot
+
+Looped through the samples with this:
+
+ do for[c=0:9] {
+ set title "C" . c
+ bad = (c == 3)
+ plot [][0:1] "data/05-alignment/tin/C" . c . ".tin.xls" using 5:(1) smooth cnorm notitle lw 3 \
+ lc (bad ? 3 : '#000000') \
+ lt (bad ? 6 : 1)
+ }
+
+ do for[n=1:3] {
+ set title "N" . n
+ bad = (n == 2 || n == 3)
+ plot [][0:1] "data/05-alignment/tin/N" . n . ".tin.xls" using 5:(1) smooth cnorm notitle lw 3 \
+ lc (bad ? 3 : '#000000') \
+ lt (bad ? 6 : 1)
+ }
+
+Note the `smooth cnorm` requiring the columns to be `using
+data-column:literal-value`. This tripped me up for a while — I was doing `using
+0:data-column` and that was fucking things up completely. I realized the
+problem when I noticed there wasn't a peak at zero, and also that sorting
+changed the results (which it shouldn't). `smooth cnormal` needs exactly one
+column of the value, plus a weight value for the point. Remember this!
+
+You can skip around the plot with `set multiplot next` and `set multiplot prev`.
+Handy for adjusting the layout.
+
+Instead of trying to draw labels on every graph, I decided to try putting a key
+at the bottom right. This was very fiddly:
+
+ set multiplot next
+
+ $empty << EOF
+ EOF
+
+ set termoption fontscale "0.1"
+ set title "{/*0.9 Key}"
+
+ set ylabel "{/*0.9 CDF}"
+ set xlabel "{/*0.9 TIN Score}"
+ set format x "{/*0.9 %g}"
+ set format y "{/*0.9 %.2f}"
+
+ set origin 0.75,0.0
+ set size 0.25,0.33
+
+ plot [0:100][0:1] $empty using 1:(1) smooth cnorm notitle
+
+Note the `%g` for "humanized" formatting. I'm going to remember the `g` as
+standing for `god dammit, what is the magic character?`.
+
+## 2020-04-11
+
+Merged some PRs for `t` and did some basic test maintenance. I still love
+`cram` for CLI tests — it's such an intuitive, usable way to test things.
+
+# July 2020
+
+## 2020-07-18
+
+More work on Jarl.
+
+Started going through The Data Compression Book, doing the examples in Common
+Lisp. First code was a bit stream wrapper. I initially did
+a `trivial-gray-streams` wrapper but then refactored it into simple structs,
+since it's going to be writing a lot of bits and performance probably matters.
+Maybe.
+
+## 2020-07-19
+
+Did some server upgrades I've put off too long. Need to script up some stuff
+for automagically doing this in the future. Might be a good excuse to learn
+`expect`?
+
+PR'ed Quicklisp to move my projects from Bitbucket/Github to my self-hosted
+setup. At least now I own my availability and won't have to do this bullshit
+again.
+
+Imported a couple of contributed patches to `hg-prompt`. I'm glad people still
+find it useful. Also took the time to port the documentation to `d` because it
+was using some Python wiki bullshit thing before.
+
+Did some more work on The Data Compression Book. Wrote up a little `main`
+skeleton for the binaries. Might use this as an excuse to try out subtask-style
+interfaces in Adopt. I think it'll work, but I haven't actually tried it yet.
+
+## 2020-07-26
+
+Finally finished reading ULSAH. Took a few months, but aside from skimming over
+20% that wasn't super relevant it was as good as I remember back from 2008.
+Lots of things I need to look deeper into and try out on my own:
+
+* `strace` and friends.
+* Performance monitoring stuff: `sar`, `vmstat`, `iostat`, etc.
+* VMs using KVM and Qemu instead of Virtualbox.
+* Learn more about `systemd` (unfortunately).
+* Poke at `btrfs` a bit more using my existing setup.
+* Look into Ansible as a configuration management system — it seems the simplest
+ of the Puppet/Chef/Salt/Ansible crew.
+* Building machine images with Packer.
+* Look into Terraform for deployment.
+* Restructure local wifi into main/guest networks.
+
+## 2020-07-28
+
+Did a bit more work on Jarl. It's actually coming together. Still got a bunch
+of things on the TODO list but I poked at the GitHub API with it a bit and it
+already works surprisingly smoothly. Neat.
+
+# August 2020
+
+## 2020-08-27
+
+Fixed Chancery to work around a bug in CCL's `RANDOM`. In the process I had to
+shave another yak and rotate my SSH keys so I could push to GitHub from my
+laptop. Ended up going with `ed25519` keys this time.
+
+## 2020-08-28
+
+Took a stab at having Travis run Lisp tests. After flailing at it for a while,
+there's just too much jank. I'll stick to running them locally.
+
+Also tried installing Clasp, but the process is too bonkers. If they ever
+provide binaries I'll run my tests on it, but the current install process is
+just way too much work to be worth my time.
+
+## 2020-08-28
+
+A bunch more work on Jarl. Wrapped up a couple of loose ends. Ran into some
+more MOP craziness that I *think* I've solved, but am not 100% sure.
+
+## 2020-08-30
+
+Threw together a little `weather` CLI because trying to use any weather
+website/app is just maddening. All I want is the temp/rain chance for the next
+few hours to know if I can go sit in a park… why is this so hard to find?
+Learned a bit about ZIP codes in the process of making it.
+
+# September 2020
+
+## 2020-09-19
+
+First day of PTO. My Pine64 arrived, so I downloaded their installer/MicroSD
+flasher thing and put Ubuntu on it for now, at least to make sure it's all
+working. Worked pretty well (aside from very slow download speeds) but the
+thing seems to work!
+
+Wanted to watch some stuff on my projector, but for some reason VLC would
+*always* display on the main monitor when fullscreen was activated, no matter
+where the window was on the screen beforehand and no matter whether I set the
+(multiple!) settings in the preferences to the other screen. Absolutely
+*maddening*, especially because this worked fine a week ago. Turns out the
+solution was to run `refresh-heads` in StumpWM. Christ. I hate computers.
+
+## 2020-09-27
+
+TIL about `ncdu`. It's like a text-mode DaisyDisk, which is something I've
+wanted for a long time. Nice!
+
+# October 2020
+
+## 2020-10-01
+
+New Thinkpad finally arrived. Time to shave the computer setup yak once again.
+
+Figured out how to get into the BIOS and disable the GUI BIOS. Disabled
+Trackpad, etc. Can't seem to figure out how to disable hyperthreading yet.
+
+Downloading an Ubuntu 20.04 `iso`. Since the machine is so new I figure I'll
+need a relatively new release. Documentation says to open the "startup disk
+creator" in Ubuntu's gnome search thing but since I don't use that, some
+googling around shows it's actually called `usb-creator-gtk`. Selected my 4GB
+USB drive and the 2.6 GB ISO and pressed the start button, and got an error
+dialog saying "Could not write the disk image … to the device …." *Fantastic*
+start. Tried another USB stick, same thing. Eventually something clicked in my
+brain and I realized it probably wanted me to `sudo`. Restarted the creator
+with `sudo` and it worked. You couldn't have fucking *told* me that in the
+error message? Christ.
+
+Plugged in the USB stick and rebooted. Went straight to Windows setup. Killed
+power and rebooted and used `F12` to get to the boot menu. Ubuntu started, ran
+a drive scan, found errors in a file. Christ.
+
+Eventually the install finished. Booted into Ubuntu at a tiny resolution.
+Wanted to increase the resolution but the settings menu isn't tall enough to see
+the display settings I guess. Christ.
+
+Dove into a giant rabbit hole to update my kernel. Once that was finally done
+my resolution was better.
+
+Generated an SSH key and bootstrapped it by typing in the pubkey manually into
+an `authorized_keys` file because I don't want to install Dropbox on this
+machine now that Dropbox is such a horrible mess. Typed in my
+`hg.stevelosh.com` SSH config by hand, and was finally able to clone my
+dotfiles. Immediately failed because Git subrepos are disabled by default in
+Mercurial for security reasons, and you have to enable them in your `hgrc` which
+is in the repo I'm trying to clone. Kill me. Created a simple `hgrc` to enable
+them. It worked until one of the Github repo clones randomly hung. Christ.
+Ran `hg up -C` to unfuck it.
+
+Installed an old SBCL through `apt`, cloned and rebuilt a new version, installed
+Quicklisp, cloned and built Stump.
+
+Got through the hellscape of `hg-git` installation by pinning versions that work
+on my desktop. Some day this isn't going to work and I'll have to unravel this
+hairball or abandon Mercurial, but that day is not today.
+
+Building `st` on the new machine. The `README` only mentions "the X header
+files" as a dependency, which is a lie. You actually need `libx11-dev`,
+`libxft-dev`, and `libxext-dev`.
+
+## 2020-10-06
+
+Going back through the ARM book and doing some of the exercises. To help me
+check my answers, I extended CACL to support reading characters, added
+`char-code` and `code-char` commands, added `hex` and `bits` commands.
+
+Also started going through BIAL again, because now we're doing it for the book
+club at work. Maybe I'll actually finish this time.
+
+I keep going back and forth on whether it's worth making a special sequence
+class for DNA/RNA sequences while doing Rosalind. I poke around at it every
+time, but I don't think I have the time to truly do it right (yet), so for now
+I'm going to stick with vanilla strings.
+
+## 2020-10-07
+
+Finished chapter 1 of BIAL again. Took a little of my old code (mainly
+utilities) but mostly just wrote it from scratch. It was a good exercise.
+I feel a little rusty at Lisp though…
+
+Sending emails with `msmtp` from Fastmail suddenly broke with:
+
+ msmtp: TLS certificate verification failed: the certificate fingerprint does not match fastmail
+
+I guess Fastmail updated their certs. The fingerprints are at
+<https://www.fastmail.com/help/technical/certs.html>.
+
+## 2020-10-22
+
+Continued setting up the new laptop. Was going to set up mutt for mail, but one
+of my settings files is still in Dropbox and I was hoping to avoid installing
+Dropbox on this machine since Dropbox is a pile of garbage now. Maybe it's time
+to shave that yak and set up some kind of `rsync` based syncing solution for the
+bulk of my files.
+
+## 2020-10-26
+
+Finally got around to merging some `hg-prompt` patches. They add Python
+3 compatibility, but break Python 2 compatibility because it used `bytes(foo,
+'utf-8')` which breaks with an inscrutable error in Python 2:
+
+ Python 2.7.17 (default, Sep 30 2020, 13:38:04)
+ [GCC 7.5.0] on linux2
+ Type "help", "copyright", "credits" or "license" for more information.
+ >>> bytes('foo', 'utf-8')
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+ TypeError: str() takes at most 1 argument (2 given)
+
+I'm calling `bytes`, not `str`, what the fuck are you talking about, Python?
+Eventually I found `bytes(foo.encode('utf-8'))` which will work in both. I'm
+glad I don't use this language any more.
+
+# November 2020
+
+## 2020-11-02
+
+Had to update SBCL today and wanted to parallelize the build to make it run
+a bit faster. I remember Rainer Joswig tweeting about this but can't find the
+tweet, Googling for `SBCL parallel` is useless, there's no documentation of the
+magic env vars, and I can't find them by scanning `make-config.sh` or `make.sh`
+so I eventually gave up and asked in `#sbcl`. The magical incantation to
+parallelize SBCL with multiple jobs is:
+
+ SBCL_MAKE_PARALLEL=12 SBCL_MAKE_JOBS=-j12 ./make.sh
+
+Need to investigate <https://github.com/troglobit/xplugd> for dealing with
+keyboard bullshit in Linux. Looks promising.
+
+## 2020-11-08
+
+Getting `pass` set up on the new machine. Steps, for the next time I need to do
+this:
+
+ git clone … ~/.password-store
+ apt install pass scdaemon
+ switch-yubikeys
+ gpg --import gpg-pubkey-main.txt
+
+ pass show …
--- a/README.markdown Sun Nov 08 23:08:49 2020 -0500
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,1576 +0,0 @@
-[TOC]
-
-# January 2020
-
-## 2020-01-03
-
-Converted `paste.stevelosh.com` to HTTPS. Forgot it in the previous conversion.
-
-## 2020-01-14
-
-Moved all my repositories from Source Hut to `hg.stevelosh.com`. Wasn't
-expecting to have to shave this yak so soon, but such is life. On the plus
-side: pushing is *so much faster* now. Rough timing for an empty push was ~5
-seconds on Source Hut, down to around a second now. No more sitting and
-staring at the command line waiting to get back to work.
-
-## 2020-01-15
-
-Finished (mostly) reskinning `hg.stevelosh.com`. Had to shave a couple of yaks
-in Mercurial itself (diff headers getting truncated) and the hgweb Markdown
-extension. Ended up just completely forking the extension and tearing out
-almost all of its guts. The result should (hopefully) be easier to maintain as
-Mercurial breaks internal shit over time.
-
-Still need to finish the CSS for the rendered Markdown, but that can wait for
-now.
-
-## 2020-01-17
-
-Got the RIT VPN set up on my Linux machine so I can connect for class. I was
-worried I'd have to install some horrible Cisco VPN client thing, but it turns
-out you can use the `openconnect` client to connect, so `sudo apt install
-openconnect` and then `sudo openconnect vpn.rit.edu` just magically works.
-Fantastic.
-
-Finished the `hg.stevelosh.com` reskin on my lunch break. Just had to get the
-`README`s looking decent.
-
-Finally figured out a strategy for syncing a small subset of my dotfiles onto
-remote servers easily, so I don't have to be completely miserable on them.
-Goals:
-
-* Needs to provide a reasonable subset of the stuff I use to get around.
-* Needs to handle servers with both bash and fish.
-* Needs to be easily syncable with a single command and no extra setup.
-* Needs to handle servers that don't have git or Mercurial.
-
-Ended up making a `remote/` folder in my dotfiles. Almost everything inside it
-is a symlink except for a `bootstrap.sh`. The folder gets `rsync --copy-links`ed
-over and then `bootstrap.sh` will symlink all the resulting files into their
-places. Hopefully this should be easier to maintain. Did a bunch of cleanup on
-my bash and fish setups while I was at it.
-
-Still need to figure out how to get a decent Vim setup remotely. Syncing all my
-plugins is probably not ideal, but maybe. Who knows.
-
-## 2020-01-18
-
-KA biology and coffee.
-
-Cellular respiration:
-
-* Starts with glucose (C₆H₁₂O₆).
-* The process is called glycolysis, and starts in the cytosol.
-* Glycolysis split the 6-carbon glucose into two 3-carbon molecules (pyruvate).
-* In the process we produce 2 (net) ATP's, and reduce 2 NAD⁺ molecules to 2 NADH
- (note the move from a positive charge to negative, i.e. gaining electrons,
- i.e. RIG (reduction is gaining)).
-* Some organisms stop here and use the pyruvates for fermentation.
-* Otherwise, the carboxyl group is stripped from each pyruvate and released as
- CO₂.
-* The rest of the pyruvate (essentially an acetyl group) latches onto "coenzyme
- A" (aka "CoA"), forming acetyl-CoA.
-* The enzyme transfers the acetyl group to an oxaloacetic acid, which forms
- citric acid.
-* Next step is the "Krebs cycle"/"citric acid cycle", which is… complicated, and
- happens in the matrix of the mitochondria.
-* As the citric acid cycle progresses we keep reducing to NADH.
-* The ATPs produced in cellular respiration can be used directly.
-* The reduced NADHs (and other products like QH₂) are used to create a proton
- gradient in the mitochondria, which is then used to produce more ATP (each
- NADH results in ~2-3 ATPs). This happens in the crista of the mitochondria.
-* Whole result is ~27-38 ATPs (typically 29-30 in most cells).
-* Review: the four steps of cellular respiration:
- 1. Glycolysis: splits glucose into two pyruvates, produces ATP and reduces NAD⁺ to NADH. Happens in the cytosol.
- 2. Pyruvate Oxidation: each pyruvate is stripped of CO₂, binds to CoA, and generates more NADH.
- 3. Citric Acid/Krebs Cycle: each acetyl-CoA goes through a cycle and produces ATP, NADH, and FADH₂.
- 4. Oxidative Phosphorylation: the NADH and FADH₂ are used to make a proton gradient, and that is used to make more ATP.
-
-Did the `TRIE` problem in Rosalind. Making an actual trie was easy, but getting
-an adjacency list for the flattened/copypasteable output and using `cl-dot` to
-graph it nicely was fiddly. Annoying bits about `cl-dot` to remember for next
-time:
-
-* There's no `:shape :square`, it's `:shape :box`.
-* `:fillcolor` does nothing without `:style :filled`.
-* The solution to "edge labels are too close to the edges" is apparently "add
- spaces to your label string". Ugh.
-* To change the size of a node, use `:height` and `:width`, not `:size`. They
- will be "helpfully" be ignored if you have a label that's too big to fit.
-* `cl-dot` provides implicit labels, so if you want no label you need `:label ""`.
-
-Cleanup up the styling of `hg.stevelosh.com` a bit more. This time it was the
-`<code>` tags inside `<li>`'s that weren't styled. I styled `<p><code>` before,
-because I wanted to *avoid* applying those styles to `<pre><code>` content
-(otherwise it'd have double borders all over the place). But that missed out on
-the `<code>` tags that weren't inside paragraphs. Ended up just styling *all*
-`<code>`s and then undoing the borders in `<pre><code>`. Sigh.
-
-Also realized that the repository descriptions served by hgweb can contain
-arbitrary HTML. Neat. Used that to clean a few more things up.
-
-Updated all the Bitbucket repository `README`s to point to `hg.stevelosh.com`.
-The Great Source Hut Yak Shaving of 2020 is finally complete, thank christ.
-Time to move on to doing actual work.
-
-## 2020-01-19
-
-Finally got around to setting up packages for all of my Rosalind problems. In
-all my "coding challenge" repos (Rosalind, Project Euler, Advent of Code, etc)
-I originally just shoved everything into one big package. This was easy to
-start with, but as I solved more and more problems I started to hit naming
-conflicts more and more often. Eventually I realized that for my own sanity
-I should really give each problem its own package, with all the utilities in
-a `utils` package. I did this for my Advent of Code repo a while back and while
-it was a pain in the ass to refactor everything, the result is much nicer.
-Today I refactored the Rosalind repo. The Project Euler repo is a work in
-progress, mostly because it's so much bigger than the other two.
-
-Did Rosalind problem `PDST`. Pretty easy. Most annoying part was the output,
-but I extended my float printer to handle 2D arrays to make it easier to output
-what they want in the future.
-
-## 2020-01-20
-
-Wanted to do a Rosalind problem over lunch. On a whim I decided to use CCL just
-to keep myself honest — I usually use SBCL, so using another implementation now
-and then helps ensure I'm writing standards-compliant code. But apparently
-a recent Vlime commit broke Vlime in CCL. So I had to shave *that* yak first.
-Sent a PR.
-
-Did Rosalind problem `ASPC`. The problem itself was trivial. Made a handy
-iterate driver to make it even more trivial.
-
-Running the test suite in CCL gave me exactly what I wanted: examples of where
-I was unintentionally relying on SBCL's behavior. Today there were two:
-
-1. Relying on the traversal order of iterate's `:in-hashtable` (which uses
- `with-hash-table-iterator`) to get the output of `TRIE`. Made a helper to
- convert the hash tables to sorted alists to fix the output.
-2. Relying on the `~,…F` format directive having a half-up rounding strategy.
- Turns out that when format is rounding a float to a given number of decimal
- places, the implementation is allowed to round half however it wants. SBCL
- seems to round half up (which matches what Rosalind shows in their examples,
- which is why I never noticed) but CCL seems to use banker's rounding. Had to
- do an ugly hack in `float-string` to make the output consistently match
- Rosalind's style everywhere.
-
-## 2020-01-22
-
-Started doing Rosalind problems in shell/Awk, to join in with a group of folks
-at work doing them for the first time. The first few problems are simple, but
-I still managed to learn a few tricks by implementing them with UNIXy tools.
-
-Fixed a few remaining broken links on my site. It's been a week or two since
-I poked around at it, and that was on another machine, but `./deploy.sh` Just
-Worked. It's so nice to have a site generator I designed myself that's not
-going to change out from under me when I just want to fix something simple and
-move on with my life.
-
-## 2020-01-23
-
-Lesk book exercises, chapter 1:
-
-1. Average density would be `(/ 3r4 3r9) = 1/100000` or 1 gene every ~100kb.
-2. Parts:
- * A: Two humans would have roughly 250k differences.
- * B: A human and a chimpanzee would have roughly 3m differences.
-3. Parts:
- * A: Average density would be 2 SNPs per 1000 base pairs.
- * B: Is this a trick question? I think 1.1% of the differences would be in protein-coding regions.
-4. Parts:
- * A: I am confused. The glossary defines a "haplotype" as a group of closely-linked genes that are typically inherited together. So… 1 haplotype? But the *text* talks about a haplotype as being a combination of SNPs in a recombination-poor region. So if "haplotype" means the combination of SNPs, not a set of genes, then I think this would be `4^10 = 1048576`.
- * B: On a diploid chromosome, you'd have two separate sets of SNPs to combine (and order doesn't matter), so I think it's `4^10 * 4^10 / 2`.
- * C: `(1 SNP / 5 kb) * (100 kb) = 20 SNPs = 2^20 sequences = 1048576 sequences`
-5. TODO
-6. TODO
-7. TODO
-8. A much less drastic change, though it could still cause issues if it's at
- a critical point in the RNA molecule (e.g. the anticodon in tRNA).
-9. TODO
-10. If the gene Kangaroo coat color is not on the X chromosome, you wouldn't be able to get the calico pattern that comes from X chromosome inactivation.
-11. Parts:
- * A: No, because if it were a mutation in the gene itself it would act by affecting the structure of the protein, and should have the same effect in adults and children. But it doesn't, so it's more likely to be a mutation in regulatory region.
- * B: It's not on the Y chromosome?
-12. TODO
-13. TODO
-14. TODO
-15. TODO
-16. TODO
-17. TODO
-18. TODO
-
-## 2020-01-24
-
-Solved the `FIB` Rosalind problem in my shell/Awk repo. Tried doing it in
-straight bash, but I couldn't figure out how to get the bash function to use
-a global bash array for memoization after like 10 minutes of searching. Ended
-up rewriting it in Awk, which took two minutes, was simple, and runs fast.
-Great.
-
-Solved `MMCH` in my Lisp repo. Much like the previous problem in this line, it
-was way easier than they made it seem. Also fixed the Uniprot cache to use the
-filesystem instead of just keeping a hash table, to save a few seconds of my
-life every time I `(run-tests)` in a fresh Lisp session.
-
-## 2020-01-25
-
-Rebooted my Mansion repo to try to simplify it enough that I can start
-dogfooding it for actual use, then turn it into something publicly-usable.
-
-## 2020-01-27
-
-Trying to decide on data to use for the BIOL-650 project. A list of
-possibilities:
-
-* <https://www.ncbi.nlm.nih.gov/sra/ERX1735780>
-* <https://www.cell.com/cell/fulltext/S0092-8674(15)01339-2#back-bib72>
-* <https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/studied-cancers>
-* <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3673022/>
-
-## 2020-01-29
-
-Figured out how to get the mutt HTMLified email working on my local machine,
-like I had done for work. Had to install NeoMutt from source because the Ubuntu
-version is too old to include `group-alternatives`:
-
-* Clone the NeoMutt repo.
-* Install Tokyo Cabinet for the header cache with `sudo apt install libtokyocabinet-dev`.
-* Configure the NeoMutt build with `./configure --disable-idn --idn2 --disable-doc --prefix=/usr/local --gnutls --with-ui=ncurses --tokyocabinet`.
- * Note the `--disable-idn --idn2` to make it use `libidn2` which is installed, instead of `libidn` which isn't.
- * Is this the correct set of options? Who knows!?
-* Install into `/usr/local` with `sudo make install`.
-
-More searching around for projects for HTSA:
-
- https://www.sciencedirect.com/science/article/pii/S2352340918302920
- doi 10.1016/j.dib.2018.03.079
- Illumina HiSeq 2500
- https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA432903
-
- https://www.sciencedirect.com/science/article/pii/S2352340919303750
- doi 10.1016/j.dib.2019.104022
- Ion Proton sequencer
- https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=506922
-
-Started going through the gnuplot book at night. Going to try to do that in
-parallel with the class.
-
-## 2020-01-30
-
-Drove to RIT only to discover class was not being held in person, and we're
-supposed to work online instead. **Welp.**
-
-Installed FastQC locally and got it running. Really just downloaded the binary
-version and symlinked it into place.
-
-Downloading the actual data is taking a while. While I wait, thinking about
-disk setups: now that I have a NAS and have moved my local backups to it, I can
-format my previously-for-backup internal drives and use them for actual data.
-I really need to spend the time to grok how Linux formats and mounts drives
-— I've been flailing and slapping things together and it has mostly worked until
-now, but I really should just learn it. Ordered a copy of the new edition of
-the sysadmin book that I remember being really good from my Dumbwaiter days.
-Hopefully that will be a good reference.
-
-Finished the FastQC stuff early, so spent a few minutes going through the
-gnuplot book. Random notes:
-
-* The user config file is `~/.gnuplot` (note the lack of an `rc` suffix).
-* To stop gnuplot from stealing my goddamn focus: `set terminal … noraise`.
-* To show the current value of a setting: `show …`.
-
-Finally got around to writing a first draft of a `lines` utility. Discovered
-and fixed a stupid bug in Adopt in the process.
-
-## 2020-01-31
-
-Now that I've got a more thorough backup system, went through and did my yearly
-pruning of unnecessary crap on my hard drive, and adding restic excludes to
-avoid backin up track. Number of scans was cut drastically:
-
-* Directories: 125k to 30k
-* Files: 567k to 285k
-
-Much cleaner now.
-
-# February 2020
-
-## 2020-02-01
-
-Fixed up my `$PATH` in my bash profile, for when I'm forced to use bash.
-
-Also had to fix up the uses of `set_color` in my Fish shell config to only
-happen for interactive connections, because Fish shits out some errors when you
-call `set_color` on a noninteractive connection (e.g. when using `scp` or
-`rsync`). Sigh.
-
-Now that I've got my backups pushed to my NAS (plus off-site) I can repurpose
-those 4 TB drives sitting empty in my tower. Gonna use them for a data dump
-site on the Linux side of things, e.g. for all the FASTQs and resulting data
-I'll be dealing with for my class this semester. Read through the sysadmin book
-chapters on the filesystem which were *super* helpful. Decided to pair the
-drives as a `btrfs` raid 0 array. I'm not super concerned with durability here
-— everything should be reproducible from the original raw FASTQs from SRA if I'm
-doing my job right, so I may as well just have one giant pool of storage. Also
-I don't need this `/dump` to be encrypted — it'll be handling non-sensitive
-data.
-
-To get them ready for `btrfs` I used:
-
-* `fdisk` to create new GPT partition tables on each and blow away the old ones.
-* Had to use `sudo cryptsetup luksClose` on the previously-encrypted backup
- drive to remove the lingering `sda_crypt` "partition" I was seeing.
-* `wipefs` to finish cleaning them completely.
-
-Even with all that, `mkfs.btrfs` complained about a partition table still
-existing on one of the drives. At that point I gave up and passed `-f` to it,
-and everything Just Worked. In hindsight, I bet I could have just done that to
-begin with and let `mkfs.btrfs` handle blowing away everything, but it was still
-a fun exercise.
-
-Attempted to figure out how to acquire MS Word for the class. Apparently
-there's a web version of Word now? If I can use that, at least I won't have to
-spin up a full Windows VM to write the papers…
-
-Installed `sra-toolkit` from Ubuntu's repository. Unfortunately it doesn't seem
-to come with `fasterq-dump`, only the older `fastq-dump`, at least according to
-`dpkg-query -L sra-toolkit`. Installed the latest binary from their site
-instead.
-
-Poked around at both `fastq-dump` and `fasterq-dump`. `fasterq-dump` seems more
-documented and user friendly, and is definitely faster. The biggest speedup
-came from telling it to use RAM as its temporary filesystem with `-t /dev/shm`,
-which cut the time down to 1/3. Guess all this RAM *is* actually good for
-something.
-
-After rebooting later, the boot process was suddenly hanging for over a minute.
-Another fucking yak to shave. Figured out how to disable the goddamn splash
-screen (edit `/etc/default/grub` to remove `quiet` and `default` and then
-`update-grub2`) so I could get some logs. Eventually saw some systemd horseshit
-about `a job is running for dev-disk-by\x2duuid-…`. Tried to figure out which
-drive had that UUID, but none of them did. After a bunch of flailing in `fstab`
-I realized that it might be the UUID of *the old backup drive I killed*. Sure
-enough, there was still an entry in `/etc/crypttab` for it, and a bunch of
-systemd shit in `/run/systemd/generator/` left over that I had to clean out.
-After that, everything boots fast(ish) again. I hate computing.
-
-## 2020-02-02
-
-Spent some time cleaning up the `Makefile` for my individual class project.
-Learned some stuff about `make` in the process:
-
-* When you have a pattern rule `foo/bar/%.baz`, whatever matches the `%` is
- called the "stem", and you can use `$*` to get the stem inside the rule.
-* There are a bunch of ways to allow users to override variables, but the one
- I settled on is to use `VAR := val` in the `Makefile`. Users can override
- these with `make VAR=newval …`. This doesn't allow overriding with
- environment variables, but that seems fraught with danger anyway, so let's
- consider it a feature, not a bug.
-* I was dicking around a lot with `addprefix` and `addsuffix` unnecessarily
- before. Much simpler is to use `$(FOO:%=prefix%suffix)`, e.g. `FASTQ_FILES
- := $(KEYS:%=data/00-input/%.fq.gz)`. This works with multiple-word variables
- too (god help me if filenames ever contain spaces).
-* Prefixing a command with `@` prevents `make` from echoing the command before
- executing it (but not when doing a dry run). Handy for the `echo …` commands
- in `make debug`.
-* Management of directory creation is… annoying. Ended up going with
- placeholder `.ph` files. Simple but inelegant. Oh well.
-
-Finished writing up my class project pick paper.
-
-Did more in the gnuplot book.
-
-Updated my `lines` tool to add a `--limit` option and handle broken pipes.
-I could have sworn I handled broken pipes in some other Lisp script at some
-point, but can't seem to find it now. I used `sb-int:broken-pipe` but I'm not
-sure if that's something I should be relying on. Need to ask in `#sbcl`.
-
-## 2020-02-04
-
-Class today. Did a bunch of various things after the lecture (I hope I can
-remember it all).
-
-Got the group permissions sorted out on the server in two steps:
-
-* Had the admin [add a `setgid` bit to the team directory][dir-setgid], so new
- stuff in there will get created with the team group as the owner by default.
-* Then we all added `umask 002` to our `rc` files to change our default `umask`,
- so when we create files they'll be `664` by default instead of `644`. This
- isn't a problem for non-team files because they get created with your personal
- group as the group.
-
-[dir-setgid]: https://www.gnu.org/software/coreutils/manual/html_node/Directory-Setuid-and-Setgid.html
-
-Once that was sorted out, started diving into aligning the Trapnell data with
-tophat and bowtie. This is… not pleasant.
-
-First I need to install both tools. There's an Ubuntu package for `tophat`, but
-it wanted to install NodeJS as a prerequisite. Why on earth does tophat want
-NodeJS? `SIGINT`ed that immediately and downloaded the binary version instead.
-
-The wrapper script `tophat2` tries to set up `$PATH` magically so everything
-just works:
-
- #!/usr/bin/env bash
- pbin=""
- fl=$(readlink $0)
- if [[ -z "$fl" ]]; then
- pbin=$(dirname $0)
- else
- pbin=$(dirname $fl)
- fi
- export PATH=$pbin:$PATH
- $pbin/tophat "$@"
-
-Unfortunately it doesn't try hard enough. The `readlink` is *attempting* to
-deal with symbolic links there, but only works with absolute links — relative
-symlinks (e.g. `ln -s ./tophat-…/tophat2 tophat2`) just return the relative
-path, which then gets passed to `dirname` and treated as relative from whatever
-directory you happen to be in. And let's ignore all those unquoted variable
-expansions I guess. Sigh. Better version:
-
- #!/usr/bin/env bash
- pbin=""
- fl=$(readlink -f "$0")
- if [[ -z "$fl" ]]; then
- pbin=$(dirname -- "$0")
- else
- pbin=$(dirname -- "$fl")
- fi
- export PATH="$pbin:$PATH"
- "$pbin"/tophat "$@"
-
-Once it's installed, I can finally run it. This is where things get even more
-confusing. After a whole bunch of flailing I think I managed to narrow down
-what the hell all these magic parameters are, and how things have changed since
-the paper was written. Notes:
-
-* Bowtie is an aligner — it aligns reads to a reference genome.
-* Tophat uses bowtie to align reads, but adds splice-awareness on top.
-* To use Bowtie to align, you need to preprocess your reference genome and
- create an index of the genome.
-* Additionally/separately, you also need to create a *transcriptome* reference
- genome for the splice awareness. You do this with tophat, which invokes
- bowtie under the hood. This means you end up with *two* bowtie indexes:
- a genome index and a transcriptome index.
-* To create the genome index: `bowtie2-build --threads 15 foo.fa some/prefix/foo`
- * This creates a bunch of files in `some/prefix` named `foo.…`.
- * Why does this take a basename and not just a directory name?
- * Do the name of the FASTA and the name of the index files have to match?
-* To create the transcriptome index: `tophat2 -p 15 --GTF foo.gff --transcriptome-index=another/prefix/bar some/prefix/foo`
- * There are three separate things specified here, each in a different way:
- * The gene annotation file is provided with the `--GTF` option, as a path. (Even if it's a GFF file, not a GTF file. There's no `--GFF` option. Of course.)
- * The genome index is provided with a positional parameter, as a basename.
- * The destination for the transcriptome index is a basename, and it'll splat out a bunch of files with various suffixes there.
-* Note that the chromosome names in the `.gtf`/`.gff` file and the sequence
- headers in the `.fa` file must match.
- * This caused me much pain when I was trying to use the `.gtf` file (foolishly assuming `--GTF` required a `.gtf`), the inclusion of which seems to be a trap in the input data laid for the unwary student. I will not be fooled again.
- * Tophat gave me a completely useless error (`"Couldn't build bowtie index with err = 1"`) when I tried this, rather than tell me anything useful that might have pointed me in the right direction. Extremely Good Software.
-* Some/all of the prefixes/basenames may or may not have to match. By the time
- I flailed enough to get everything working I had renamed everything to
- a single prefix, but I'm not sure if it was strictly necessary. I'll
- investigate tomorrow or Thursday, when I'm less annoyed.
-* Once you've got the two indexes created, you can run the alignment:
- `tophat2 -p 15 --transcriptome=another/prefix/bar --output-dir baz/ some/prefix/foo *.fq`
-
-That's as far as I got today. Next step is to clean up this mess and get it
-into the `Makefile`.
-
-## 2020-02-06
-
-Refactored the `Makefile` and added in all the alignment stuff. Wrote up the
-weekly report far too late.
-
-## 2020-02-08
-
-Decided to try using gnuplot to do the graphing exercises in the Campbell
-Biology textbook, to combine the things I've been reading lately.
-
-The first exercise I did was from chapter 9. Data:
-
- Low 4.3
- Normal 4.8
- Elevated 8.7
-
-Gnuplot file:
-
- set title "{/:Bold*1.5 Thyroid Hormome Levels and Oxygen Consumption}"
- set xlabel "Thyroid Hormone Level"
- set ylabel "Oxygen Consumption Rate [nmol O_2 / (min * mg cells)]"
-
- unset key
- set boxwidth 0.8
-
- plot [][0:10] "bio/thyroid" u 0:2:xtic(1) w boxes fillstyle solid fc rgb "black"
-
-That was pretty simple. Then I went back to a previous chapter. Data:
-
- # Time (min) Concentration of inorganic Phosphate (μmol/mL)
- 0 0
- 5 10
- 10 90
- 15 180
- 20 270
- 25 330
- 30 355
- 35 355
- 40 355
-
-Gnuplot file:
-
- set title "{/:Bold*1.3 Glucose-6-Phosphate Activity in Isolated Liver Cells}"
- set xlabel "Time (minutes)"
- set ylabel "Concentration of ℗_i (μmol/mL)"
-
- set xtics format "%h min"
-
- unset key
-
- set grid xtics ytics
-
- set lmargin 12
- set rmargin 6
-
- plot [0:45][0:400] "liver-glucose" w lines linewidth 2
-
-This one was a little trickier. I was trying to play around in the gnuplot
-REPL, but pasting the Unicode characters was producing garbage at the gnuplot
-command line. Pasting into a normal terminal/fish shell worked fine. After
-much flailing I tracked down the problem to gnuplot itself:
-
-* Gnuplot uses one of several different libraries to handle readline-like
- editing: `libeditline`, GNU `readline`, and some others (I think).
-* The library it uses is specified as an argument to `./configure` and linked in
- at compile time.
-* `libeditline` does not handle UTF-8 input correctly in this, the year of our
- Lord, 2020.
-* GNU `readline` *does* handle UTF-8 input correctly.
-* Unfortunately, `readline` is GPL-licensed, which is incompatible with
- gnuplot's license.
-* Debian (and Ubuntu) therefore link gnuplot with the broken `libeditline`,
- rather than linking it to `libreadline` and violating the GPL.
-* As far as I can tell, there's no way to simply disable all line editing in
- gnuplot so it can be wrapped with `rlwrap` or something.
-
-At this point I just gave up on using Unicode at the REPL and created `.gp`
-files that I then `load`. The GPL strikes again.
-
-## 2020-02-09
-
-Installed R Studio from the `.deb`. Needed to `apt install r-base` first to get
-the dependencies. It installed… a lot of stuff. Let's hope it hasn't wreaked
-too much havoc.
-
-Moved on to differential gene expression with `cuffdiff`. Finally starting to
-understand what the actual data files are. The input FASTQs are:
-
- GSM794483_C1_R1_1.fq.gz
- GSM794483_C1_R1_2.fq.gz
- GSM794484_C1_R2_1.fq.gz
- GSM794484_C1_R2_2.fq.gz
- GSM794485_C1_R3_1.fq.gz
- GSM794485_C1_R3_2.fq.gz
- GSM794486_C2_R1_1.fq.gz
- GSM794486_C2_R1_2.fq.gz
- GSM794487_C2_R2_1.fq.gz
- GSM794487_C2_R2_2.fq.gz
- GSM794488_C2_R3_1.fq.gz
- GSM794488_C2_R3_2.fq.gz
-
-The `C` here apparently stands for "condition". I still have no idea what
-`R[123]` is trying to tell me. The `_[12]` is the paired-end read pairs. Would
-it kill people to name things more descriptively and/or include a `README` that
-explains what all their one-letter abbreviations mean?
-
-Thinking about restructuring the filesystem layout to make the relationships
-between the FASTQ files more explicit:
-
- C1/
- R1/
- GSM794483_C1_R1_1.fq.gz
- GSM794483_C1_R1_2.fq.gz
- R3/
- GSM794484_C1_R2_1.fq.gz
- GSM794484_C1_R2_2.fq.gz
- R3/
- GSM794485_C1_R3_1.fq.gz
- GSM794485_C1_R3_2.fq.gz
- C2/
- R1/
- GSM794486_C2_R1_1.fq.gz
- GSM794486_C2_R1_2.fq.gz
- R2/
- GSM794487_C2_R2_1.fq.gz
- GSM794487_C2_R2_2.fq.gz
- R3/
- GSM794488_C2_R3_1.fq.gz
- GSM794488_C2_R3_2.fq.gz
-
-But I might be too lazy at this point to do it.
-
-The more I work with `make` the more I realize how limited it is. The general
-idea behind it (computing a dependency graph and using it to rebuild things) is
-sound, but the implementation is a clusterfuck of string processing and horrific
-syntax. It really gives me the urge to create Yet Another Build System. Sigh.
-
-Eventually `cuffdiff` finished. Time to move on. Tried to install cummerbund
-in R, but ran into a bunch of errors. Had to install numerous third-party
-dependencies outside of R:
-
-* `libcurl4-openssl-dev`
-* `libxml2-dev`
-* `libmariadbclient-dev`
-
-Why on earth does installing cummerbund require a goddamn **MySQL** client?!
-This is completely nuts.
-
-Ran into a giant pile of inscrutable errors even after installing all these
-bananas dependencies. Eventually tracked it down to the R version shipped in
-the Ubuntu package repository being too out of date to work with some of the
-required packages. I hate volatile software. Installed a newer version of
-R with [this guide](https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-18-04).
-
-Finally managed to get cummerbund working and started graphing stuff. It mostly
-works, but my graphs don't really look quite like theirs. Need to ask the
-professor about this.
-
-In the mean time, played around with using gnuplot to plot the data instead of
-R. Turned out to be not too bad, since cuffdiff already dumps its output in
-a textual format. Example of the plot comparing the expression levels of the
-various genes:
-
- set logscale
- unset key
- plot "gene_exp.diff" using \
- "value_1":"value_2" lt 3 pt 1 lw 1, \
- x+(0.5*x) lt -1 lw 1 dt 2, \
- x-(0.5*x) lt -1 lw 1 dt 2
-
-I also played around with adding some nicer default `linetype`s to my
-`~/.gnuplot` using colors from ColorBrewer. Fairly happy with the result so
-far, but I'll need to play around with them more over time to see if they hold
-up.
-
-## 2020-02-10
-
-Started making some actual gnuplot scripts to draw the stuff from yesterday.
-Got a couple of scripts working, but went down a horrific rabbit hole.
-
-First: the graphs produced by cummerbund are titled very poorly ("Genes" is not
-particularly helpful). I'm going to need to chat with the professor to try to
-decipher what these graphs are trying to show me.
-
-I reproduced the scatter and volcano plots without too much trouble. Used the
-PDF plotter and came up with the simple printable line styles that should be
-okay.
-
-Then I tried to reproduce the first graph, the "distribution of expression
-levels for each sample", and all hell broke loose. First of all, what is this
-graph even trying to show? The title is just "Genes", which is useless garbage.
-The y-axis is labelled "Density", the x-axis is pabelled `log_10(FPKM)`, and the
-function used to create it is `csDensity`, so maybe this is a kernel density
-plot of the `log10(FPKM)` values? Well, I tried that, and I ended up with
-fucking spaghetti in gnuplot — there were hundreds of lines all overlaying each
-other instead of the expected single line. I spent like an hour dicking around
-online trying to figure out what the hell gnuplot was doing. Eventually,
-*finally*, I realized no one online was going to help me, and decided to break
-things down myself. I extracted the problematic column from the file and
-computed the `log10` values myself in Lisp, and found an issue: some of them are
-0, which means their `log10` value is undefined. After removing those and
-replotting, I confirmed that those undefined values were what was causing all
-the discontinuities in the original gnuplot kernel density plot — instead of
-ignoring the values, it would start a new line every time it hit one. Christ.
-So then I cloned down the cummerbund repo to see how they were handling this.
-The code is… not particularly easy to read. I searched around more online, now
-that I realized the problem, and eventually found a post where someone notes
-that they add `1` to the FPKM values first, before taking the `log`, to avoid
-negative and undefined log values. So the x-axis label that says `log_10(FPKM)`
-is fucking lying — what's actually on the graph is `log_10(FPKM+1)`. Once
-I plugged that in, everything works and the graph looks roughly like theirs. To
-hell with all this, I'm done for tonight.
-
-## 2020-02-11
-
-Class today. Chatted about various stuff. The XLOC identifiers aren't actual
-identifiers, they're just arbitrary IDs generated during the alignment process.
-That explains why I was seeing `0`s when trying to plot the `XLOC_012662`, which
-was regucalcin in the paper — it's a different gene in my run, so I'd need to
-look up the gene in my own results. Also talked about P and Q values, and
-plotting each to find interesting cutoffs. I really need to refresh my memory
-on how to more easily make histograms in gnuplot.
-
-Started reading [A Curious Moon](https://bigmachine.io/products/a-curious-moon/)
-and got about 1/3 of the way through. An interesting approach to teaching SQL,
-though the writing voice is a little bit annoying. Still turning out to be
-a good book though.
-
-Finished the gnuplot book. It's really, really good.
-
-Finished my weekly report and homework for this week. Good to have a couple
-days of buffer. Figured out how to build the assignment zip file in
-a `Makefile` which was overkill, but "fun":
-
- assignments/sjl7678-alignment.zip: $(ALIGNMENTS)
- rm -rf assignments/sjl7678-alignment
- mkdir -p assignments/sjl7678-alignment/
- cd assignments/sjl7678-alignment && mkdir -p $(KEY_PREFIXES)
- mcp 'data/02-alignment/*/align_summary.txt' 'assignments/sjl7678-alignment/#1/align_summary.txt'
- mcp 'data/02-alignment/*/logs/tophat.log' 'assignments/sjl7678-alignment/#1/tophat.log'
- cd assignments && zip -r sjl7678-alignment.zip sjl7678-alignment
-
-I'm slowly becoming more and more disillusioned of `make`. Even though I'm only
-using it for what it's actually designed for (running commands to turn files
-into new files) it's still a tremendous pain in the ass to do anything
-nontrivial. The "scripting" support is completely batshit string-based munging,
-and even the basic definition of any dependencies beyond explicit lists of files
-is miserable.
-
-## 2020-02-12
-
-Spent some (far too much) time getting the Trapnell data into Postgres, figuring
-out how to ergonomically plot it with gnuplot, and figuring out how to do decent
-histogram and CDF plots. I'm not thrilled with what I came up with, but it'll
-have to do for now because it's already 11 PM.
-
-## 2020-02-15
-
-Tracked down the root cause of https://github.com/dimitri/pgloader/issues/1005
-— need to write it up in a comment.
-
-## 2020-02-16
-
-Finished the `make` book. I'm becoming less and less enthusiastic about `make`
-the more I learn about it and see real-world examples of it.
-
-Wrote up the pgloader issue debugging comment.
-
-## 2020-02-18
-
-Class today was a lot of chatting and planning ahead for the group project.
-Things to think about:
-
-* Need to get a CSV or something together with the sample names and SRA IDs so
- I can map the FASTQ files back to the actual conditions. Not sure if there's
- a way to pull this metadata out of the raw data files or whether I have to
- type it out by hand like some uncultured swine.
-* Need to figure out what the "transcript integrity number" from the paper
- means.
- * Professor wasn't sure.
- * I think I found the paper that created it, so I just need to read it and hope it doesn't go over my head.
-* Need to think more about the filesystem layout for the project.
- * Pipeline directories mostly worked in the individual one. Try that again?
-* Need to figure out how we're going to automate the project.
- * Considered trying `redo` but I want it to run on the server, so that's probably out.
- * Could use `make` again even though I'm more and more disenchanted with it every time I try to use it.
- * Could use bash scripts but that seems miserable in different ways.
-* I'd really like to try to reproduce the paper with the exact versions of the
- tools they used, as a baseline.
- * This will likely be fucking miserable because modern software has a half life of a year.
- * If I can manage to do that, then I could update the tools and see what changes.
- * Then try different tools and see what changes.
-
-One more big question: need to figure out what the overrepresented sequences
-FastQC is complaining about are. I figured I'd do BLAST searches for them, so
-I wanted to get a list of all the overrepresented sequences. So I need to pull
-them out of the FastQC reports… which are HTML files… all on one line. Did some
-horrific `grep`ing to get what I wanted:
-
- sjl at ouroboros in /dump/karkinos/fastqs/qc
- ><((°> grep -Po 'id="M9".*<h2' *.html | head | grep -Po '<td>.*?</td>' | grep -v 'No Hit' | grep '<td>[ACTG]' | tr -s '<>' ' ' | cut -d' ' -f3 | sort | uniq -c
- 2 CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT
- 2 CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT
- 2 CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC
- 2 CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT
- 2 CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
- 2 CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG
-
- sjl at ouroboros in /dump/karkinos/fastqs/qc
- ><((°> grep -Po 'id="M9".*<h2' *.html | head | grep -Po '<td>.*?</td>' | grep -v 'No Hit' | grep '<td>[ACTG]' | tr -s '<>' ' ' | cut -d' ' -f3 | sort | uniq | xargs -ISEQ grep --files-with-match SEQ *.html | sort | uniq
- SRR6671104.1_1_fastqc.html
- SRR6671104.1_2_fastqc.html
-
-Interesting — only one of the samples has this issue, and the same sequences
-appear in both sides of the pair. Immediately after doing this horrible `grep`
-incantation I remembered the `.zip` produced by FastQC and sure enough, there's
-a text file in it with the results in a much less awful format. But I still
-needed a janky awk incantation:
-
- ><((°> awk 'BEGIN { RS=">>END_MODULE\n"; FS="\n" } $1 ~ /Overrepre/ { for (i=1;i<=NF;i++) print FILENAME, $i }' */fastqc_data.txt
- SRR6671104.1_1_fastqc/fastqc_data.txt >>Overrepresented sequences warn
- SRR6671104.1_1_fastqc/fastqc_data.txt #Sequence Count Percentage Possible Source
- SRR6671104.1_1_fastqc/fastqc_data.txt CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT 58594 0.18785731410785536 No Hit
- SRR6671104.1_1_fastqc/fastqc_data.txt CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC 39157 0.12554065004132323 No Hit
- SRR6671104.1_1_fastqc/fastqc_data.txt CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG 36886 0.11825963218388151 No Hit
- SRR6671104.1_1_fastqc/fastqc_data.txt CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT 34457 0.11047205297836592 No Hit
- SRR6671104.1_1_fastqc/fastqc_data.txt CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT 33273 0.10667604895229328 No Hit
- SRR6671104.1_1_fastqc/fastqc_data.txt CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT 31923 0.1023478349022949 No Hit
- SRR6671104.1_1_fastqc/fastqc_data.txt
- SRR6671104.1_2_fastqc/fastqc_data.txt >>Overrepresented sequences warn
- SRR6671104.1_2_fastqc/fastqc_data.txt #Sequence Count Percentage Possible Source
- SRR6671104.1_2_fastqc/fastqc_data.txt CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT 58938 0.18896020716948458 No Hit
- SRR6671104.1_2_fastqc/fastqc_data.txt CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC 39362 0.12619789736002668 No Hit
- SRR6671104.1_2_fastqc/fastqc_data.txt CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG 36752 0.11783001686336315 No Hit
- SRR6671104.1_2_fastqc/fastqc_data.txt CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT 34782 0.11151403043484702 No Hit
- SRR6671104.1_2_fastqc/fastqc_data.txt CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT 34471 0.11051693816110664 No Hit
- SRR6671104.1_2_fastqc/fastqc_data.txt CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT 31900 0.10227409495922085 No Hit
- SRR6671104.1_2_fastqc/fastqc_data.txt
- SRR6671105.1_1_fastqc/fastqc_data.txt >>Overrepresented sequences pass
- SRR6671105.1_1_fastqc/fastqc_data.txt
- …
-
-Some quick `cut`ing and such and I've got what I need, in a slightly less
-horrible way. I guess.
-
- ><((°> awk 'BEGIN { RS=">>END_MODULE\n"; FS="\n" } $1 ~ /Overrepre/ { for (i=3;i<NF;i++) printf(">%s/%d %s\n", FILENAME, x++, $i) }' */fastqc_data.txt | cut -f1 | sort -k2 | uniq -f1 | tr ' ' '\n'
- >SRR6671104.1_1_fastqc/fastqc_data.txt/0
- CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT
- >SRR6671104.1_1_fastqc/fastqc_data.txt/5
- CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT
- >SRR6671104.1_1_fastqc/fastqc_data.txt/1
- CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC
- >SRR6671104.1_1_fastqc/fastqc_data.txt/4
- CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT
- >SRR6671104.1_1_fastqc/fastqc_data.txt/3
- CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
- >SRR6671104.1_1_fastqc/fastqc_data.txt/2
- CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG
-
-Did a BLAST search for these sequences and they're all 18S Ribosomal RNA.
-Professor wasn't sure how that could have gotten into the mix, since the paper
-says they only used poly(A) RNA. Need to ask around at work to see if anyone
-has any ideas what the heck happened.
-
-(This weirdness is only on sample C0. Unfortunately that's not the sample the
-paper identified as being low quality (C3) so that isn't the issue.)
-
-Did a bit more on the Enceladus book. Decided to just read the rest, because
-the DB/Makefile crap is getting way too messy. I really need a `sqlfmt` tool,
-but every time I try to use smug to write a parser I get depressed.
-
-Merged a GitHub PR into badwolf and tried to pull it down locally with hg-git,
-which fucking broke for the millionth time, with a lying error message blaming
-hg-prompt. Fucking fantastic. Updated hg-git (apparently there's a new repo on
-Heptapod since Atlassian is shitcanning BitBucket's Mercurial support) and now
-there are four heads on default. I have no idea which one I'm supposed to use.
-I picked the latest one arbitrarily and it seems to work, at least well enough
-to pull down the PR. Good enough.
-
-## 2020-02-19
-
-Finished the weekly report for class.
-
-Fixed the scroll overflow of the code blocks in this site.
-
-## 2020-02-22
-
-Put together some initial structure for the group project. Got the files
-unzipped and FastQC run, so tomorrow I can dive into new stuff.
-
-Tried to track down a human reference genome for use later. I *think* I got
-what I need, but the sites are… not very user friendly. Will confirm with the
-professor in class.
-
-Started reading one of Tufte's books. Found a beautiful graph from a 1979 paper
-(`10.1126/science.206.4421.987`) with a gorgeous plotter font. Need to try to
-find a similar font to use myself.
-
-## 2020-02-23
-
-Did another graphing exercise from Campbell Biology in gnuplot for fun.
-
-Figured out that `dashtype ''` in my style files was causing the *line* to show
-as solid, but the entry in the key was blank! Turns out `dashtype solid` is
-what I actually want. Also learned that I need to explicitly reset that when
-I want a solid line, otherwise the dashtypes I set in previous styles will carry
-over.
-
-Refined my color plotting style to use dashed lines instead of lighter colors
-for types 6-10. Seems easier to read that way — the light colors were *really*
-light.
-
-Figured out how to do linear regression in gnuplot, and made some helper
-functions to make it less annoying:
-
- # lr = linear regression (the function to fit)
- # lrt = linear regression title
-
- baselrt(mm, bb, mprecision, bprecision) = sprintf(\
- sprintf("%%.%dfx + %%.%df", mprecision, bprecision), \
- mm, bb)
-
- lr(x) = m * x + b
- lrt(mp, bp) = baselrt(m, b, mp, bp)
-
- lr2(x) = m2 * x + b2
- lrt2(mp, bp) = baselrt(m2, b2, mp, bp)
-
- fit lr(x) "plant-co2" using "CO2":"Corn" via m, b
- fit lr2(x) "plant-co2" using "CO2":"Velvetleaf" via m2, b2
-
- plot [300:1050] \
- "plant-co2" u "CO2":"Corn" w lp, \
- "plant-co2" u "CO2":"Velvetleaf" w lp, \
- lr(x) t lrt(3,0) lt 6, \
- lr2(x) t lrt2(3, 0) lt 7
-
-Looked into why our first few bases are called at lower quality. Found
-[this](http://seqanswers.com/forums/showthread.php?t=61936) which says that it's
-because Illumina uses the first few bases to calibrate their quality score
-analyzer, and so they mark them as being less confident to account for the
-calibration process. Signed up for that forum (over bare-ass HTTP, sigh) to get
-the damn tech note PDF too.
-
-Dug out the old AxiDraw to play around with
-<https://anvaka.github.io/city-roads/?q=rochester%2C%20ny&areaId=3600176069>.
-First problem: after installing the new AxiDraw version, connecting to the
-AxiDraw in Inkscape failed with `Failed to connect to AxiDraw`. Sigh. After
-much flailing, it turns out the problem was that something had `pip install`ed
-the `serial` module, which is some kind of serialization/deserialization
-library, and so `import serial` was importing that instead of pySerial, which is
-the serial port communication library that happens to use the module name
-`serial` too. Ugh. So `pip uninstall serial; pip install pySerial` fixed that.
-Then I tried to get the `Render > Hershey Text` thing working again, but it
-just… did nothing at all. Looks like there's a *second* Hershey Text plugin,
-this time under the AxiDraw extension menu. I guess I'm supposed to use that
-one now? Who knows?
-
-Played around with how to make a dot plot in gnuplot while the AxiDraw drew
-Rochester. First I grabbed the read counts for our various FASTQ files, pulled
-from the FastQC output files:
-
- ffind _data.txt \
- | xargs grep 'Total Sequences' \
- | sed -Ee 's|./([^/]+)_fastqc/.*\t|\1 |' \
- | sort --stable -k1,1 -r \
- | sort -k2,2 -n \
- > seqs.txt
-
- cat seqs.txt | head -n2
- C3_1 21081615
- C3_2 21081615
-
-Then I can plot a dot plot with something like:
-
- # https://twitter.com/stevelosh/status/1231434196213796864
- set termoption font "Routed Gothic,12"
-
- # I wonder if there's a way to capitalize stylistically
- set title "READ COUNTS OF INDIVIDUAL FASTQ FILES"
- set xlabel "READS (MILLIONS)"
-
- # major x tics every 2 million, with 2 minor divisions per major (i.e. minor tics are every 1 million)
- set xtics 2
- set mxtics 2
-
- set grid y nox
-
- # left justify
- set ytics left offset -2.2, 0.0
-
- unset key
-
- # scale to millions by hand
- # read FASTQ names from file
- plot [20:35][-1:26] "seqs.txt" using ($2 / 1000000.0):0:ytic(1) lc rgb 'black'
-
-And I get something like:
-
-![plot](https://i.imgur.com/YAPXHaQ.png)
-
-Neat!
-
-Hacked together some Awk to remove overrepresented sequences. **But** I don't
-think a simple `grep -v` approach works, because the two FASTQ files are
-expected to have the paired reads at the same positions in the file. So if we
-remove a read from one file but not the other, now all the reads are going to be
-offset. So we need to remove these reads a bit more carefully (really, we need
-tools that process the paired-end reads together). Need to think about this
-a little bit more.
-
-## 2020-02-24
-
-Figured out how to mount a directory on my home machine on my laptop with SSHFS,
-in advance for tomorrow in case I need it:
-
- # mount
- mkdir ~/foo
- sshfs -o idmap=user sjl@server:/whatever ~/foo
-
- # unmount
- fusermount -u ~/foo
-
-Downloaded STAR and got it running. Read through the manual which, despite the
-broken English, was fairly useful. Tried running it on the gzipped reference
-genome but it complains that it requires the genome to be uncompressed. Tried
-to trick it into working by psubbing `zcat` but that didn't work. Gave up and
-decompressed the damn reference.
-
-Once the genome index was built, started aligning the reads. It failed right
-away complaining about a zero-length read. `zcat | grep`ing showed the read as
-being there in both files, *but* I bet this was an issue of the pairs getting
-out of sync because we grepped out the overrepresented sequences naively. Tried
-restarting the alignment on the original data and it's still going, so that's
-probably it. Need to figure out how to filter those bad reads properly tomorrow.
-
-## 2020-02-25
-
-Class. Chatting about QC and such.
-
-Professor says he asked around and people haven't seen the first 5 bases being
-lower quality before, but that the explanation from Illumina makes sense, and
-that we should *not* trim those bases unless they seem to be causing issues.
-Also talked about how to filter out the overrepresented sequences — he thought
-fastx had something for this, but I can't seem to find anything in the
-documentation. I may need to write some code.
-
-Professor installed `pigz` on the server, so I can remove my hacky `xargs`
-workaround.
-
-Need to find a way to filter bad sequences from BOTH files at once.
-
-Need to align to the NCBI reference GTF instead of the other weird one.
-
-## 2020-02-26
-
-Tried rebuilding the genome index with the other GTF file. Initially got:
-
- Feb 26 21:00:45 ..... inserting junctions into the genome indices
- terminate called after throwing an instance of 'std::bad_alloc'
- what(): std::bad_alloc
- ./src/03-index-genome.sh: line 19: 14902 Aborted (core dumped) STAR --runThreadN $CORES --runMode genomeGenerate --genomeDir data/03-genome-index …
-
-That seems… bad. But perhaps this is because I didn't have enough RAM, because
-I still had my Lisp process from 2 days ago open consuming ~25 GB of RAM.
-I killed that so I've got about 60 GB now. Let's see if that helps. Otherwise
-it sounds like I may need to look at [this](https://github.com/alexdobin/STAR/issues/103).
-
-Still need to find that other paper my partner found from the same people.
-
-Submitted weekly report.
-
-# March 2020
-
-## 2020-03-16
-
-Installed TrimGalore. Had to install `cutadapt` first, which is some Python
-package. Luckily it's `apt install`able so I did that, then grabbed TrimGalore
-which is blessedly self-contained.
-
-Installed pardre. The "release" is just a source dump of C++ files with no
-`README`, cool. Had to install `mpi-default-dev` to get `make` to work. Gross.
-
-Used ParDRe to dedupe a couple of the samples as a test. It took 927 seconds to
-finish the first pair, where I used the `.fastq.gz` files as input and the `-z`
-option to enable compression. I noticed a bunch of the time (the last 700
-seconds or so) was spent bottlenecked on a single CPU core while writing the
-output. So then I tried the next sample without compression, and it took 309
-seconds. Welp.
-
-Ran FastQC on the first two sample results. For the most part FastQC is happy
-now, but there *are* still some duplicated sequences left. I'm guessing this is
-because ParDRe is paired-end aware, and will only remove a pair if *the entire
-thing* is a duplicate, while FastQC works only on a per-fastq basis. I don't
-know this for sure though.
-
-I did the initial run with 0 mismatches allowed. We should also try with
-N mismatches, though note that this will not work perfectly because ParDRe
-clusters reads by their prefix, so reads with `<=N` mismatches in their prefix
-(and none in the suffix) will not be caught because they'll never get compared.
-Welp.
-
-Cleaned up the scripts to run all the ParDRe runs (0/1/2 mismatches for every
-pair) and FastQC them. Seems to take roughly 5 minutes per pair to do ParDRe.
-We have 13 pairs, so it should take around an hour per mismatch setting (i.e.
-3 hours) to do them all.
-
-ParDRe finished. Results look reasonable. There's not a *ton* of difference
-between 0 and 1 mismatch (there is *some* though). The difference between 1 and
-2 is minimal. Next step is to trim low-quality stuff with trim galore. Gonna
-do that during "class" tomorrow and let it run while I'm at work.
-
-## 2020-03-17
-
-Started sketching out some trimming scripts using Trim Galore. It has a bunch
-of options — I'm not really sure how to go about deciding what the correct
-values for each are, other than… try some and see if the resulting FastQC run
-seems reasonable?
-
-Attempted to do something foolish: update software. I wanted to fix the StumpWM
-window/frame focus thing, but I figured I would update StumpWM and SBCL and my
-dotfiles first. Clearly, this was a mistake. I updated all of them, added the
-fix, and restarted StumpWM, and everything went to shit. StumpWM suddenly
-started chewing up tons of a CPU core, lagging to hell, and Firefox and Chrome
-wouldn't even open. Tried reverting back to the previous versions but it didn't
-work. Eventually I SSH'ed into the machine and rebooted it, and *then*
-everything started working again. I think. We'll see. Going to give up on
-updating for today — wasted enough time dicking around with this tonight.
-I hate computers.
-
-## 2020-03-18
-
-Debugged a gnuplot issue a bit in IRC with someone. I think I'm losing my mind.
-
-Submitted weekly report for class.
-
-## 2020-03-19
-
-Met with the professor for office hours to chat about the deduplication/trimming
-results.
-
-For deduplication, we first talked about what might cause the actual problem
-it's trying to solve. A cell may have many duplicate copies of a particular
-mRNA floating around if the gene is highly expressed — this is exactly what
-RNAseq is trying to measure. However, once those mRNAs are fragmented, it's
-extremely unlikely that they would fragment *exactly* in the same place, and so
-when you sequence the resulting fragments they should be offset from each other.
-If you see exact duplicate reads, it means that the *fragment* was somehow
-duplicated, which is probably an artifact of your laboratory process and not
-biologically significant.
-
-Next: mismatches. ParDRe supports removing duplicates with up to N mismatches,
-so we talked about why we might want to do this. If two reads match with `N
-> 0` mismatches it indicates one of the following:
-
-* It's the same problem as the exact duplicates, but with an extra error
- accumulated during the process (e.g. a sequencing read error or a PCR
- duplication problem).
-* It's from two genes that are very biologically similar, and so is genuine
- useful information. However, this is still unlikely because the fragments
- would likely be shifted as described above.
-
-At the end he recommended deduplicating with 1 mismatch.
-
-Another complication: ParDRe can operate in paired-end mode, which means that in
-some cases FastQC still finds some duplication. This happens when one side of
-the read is an exact duplicate while the other is not. ParDRe does not remove
-this (because of the non-duplicate side) but FastQC only looks at the individual
-FASTQs. He said this is not a problem, and that ParDRe is doing what we want.
-
-Also chatted a bit about trimming. We mostly talked about the issue mentioned
-in the Trim Galore documentation: because it trims *any* matching adapter
-sequences, it will *always* trim off a particular base at the end of a read
-(because that one base always matches the adapter sequence's last base). This
-is not ideal. He recommended running Trim Galore and requiring at least adapter
-5 bases to match before trimming to deal with that issue.
-
-Plan for today: rerun the trimming with the 5 base requirement and using the
-1 mismatch deduplication data as the source, and then kick off an alignment if
-I still have time (unlikely).
-
-Neomutt has suddenly started shitting the bed and failing to quite with `Unable
-to write` errors. I *hate* software. Tried updating to the latest `master` and
-rebuilding. I guess we'll see if it fixes things.
-
-Trimming finished. It looks like increasing the minimum adapter length to
-5 fixed the base imbalance issue. Good.
-
-Decided to rebuild the STAR index I had made because I can't remember whether
-I had already rebuilt it with the different reference file. Shouldn't take too
-long.
-
-Index rebuild finished in ~35 minutes. Kicked off the 26 alignment runs. That
-should take a lot longer.
-
-Checked in on the STAR progress. It's actually only taking ~5-10 minutes to run
-each alignment. That's not nearly as bad as I expected.
-
-Initial alignments finished. Started digging into what I can actually *do* with
-the results. I found the Integrative Genomics Viewer (IGV) that seems
-promising. Downloaded it and loaded the genome into it, then tried to load the
-alignments but apparently it wants BAMs, not SAMs, and additionally it wants
-indexes of the BAMs.
-
-I looked into how to convert the SAMs to BAMs+indexes. Went down a bit of an
-unnecessary rabbit hole, but it was still productive. My initial attempt was
-using `samtools` to do a bunch of conversion:
-
- STAR … # produces Aligned.out.sam
-
- samtools view -S -b Aligned.out.sam > sample.unsorted.bam
- samtools sort sample.unsorted.bam > sample.bam
- samtools index sample.bam
-
-This worked, but took a while and wrote a bunch of intermediate files I don't
-really need. Eventually I realized that STAR can produce sorted BAM files
-itself, so all I need to do with `samtools` is the final indexing. Learned
-a bunch about running STAR too. One nice way to save time across all the
-alignments is to have STAR load the genome index into shared memory once at the
-beginning and use it for all the alignments, the flush it out at the end:
-
- function cleanup {
- STAR --genomeDir "${genome}" --genomeLoad Remove
- }
-
- trap cleanup EXIT
-
- STAR --genomeDir "${genome}" --genomeLoad LoadAndExit
-
-Then the STAR invocations look like:
-
- STAR \
- --runMode alignReads \
- --runThreadN $CORES \
- --genomeDir "${genome}" \
- --genomeLoad LoadAndKeep \
- --limitBAMsortRAM "${sortram}" \
- --outSAMtype BAM SortedByCoordinate \
- --outBAMcompression 0 \
- --outBAMsortingThreadN $CORES \
- --outFileNamePrefix "${outdir}/" \
- --readFilesIn "${in1}" "${in2}"
-
-I tried putting the temporary directory on `/dev/shm` but it ended up being too
-much with the persistent genome index also in RAM. Unfortunately my order from
-Crucial is still backordered. Oh well, I'll just watch more TNG while I wait
-for all the alignments to complete.
-
-## 2020-03-31
-
-Ran the Trapnell plots from the paper on our data. Results seem at least
-plausible on the surface. Pinged my partner to see if his plots match.
-
-Attempted to install RSeQC to do the TIN score stuff mentioned in the paper.
-Ran `pip3 install RSeQC` as described in their docs, but that failed with:
-
- ValueError: no cython installed, but can not find pysam/libchtslib.c.Make sure that cython is installed when building from the repository
-
-After some searching online I found that this can be fixed by upgrading pip with
-`pip3 install --upgrade pip`. It's… fine, I guess?
-
-Once I got it installed, I tried running `tin.py` to produce the TIN scores.
-The script wants the BAM (which I have) and a "BED" file. BED apparently stands
-for Browser Extensible Data and is a TSV file with lines like this:
-
- chr1 67092175 67109072 XM_011541469.1 0 - 67093004 67103382 0 5 1429,187,70,145,44, 0,3059,4076,11062,16853,
- chr1 67092175 67131183 XM_011541467.1 0 - 67093004 67127240 0 9 1429,187,70,106,68,113,158,92,42, 0,3059,4076,11062,19401,23176,33576,34990,38966,
- chr1 67092175 67131227 XM_017001276.1 0 - 67093004 67127240 0 9 1429,187,70,145,68,113,158,92,86, 0,3059,4076,11062,19401,23176,33576,34990,38966,
-
-The collections of stuff I downloaded from UCSC didn't include one of these, but
-the RSeQC site has one available for download. Actually, it has several BED
-files. I'm not sure which one is correct. I downloaded `hg38_RefSeq.bed`
-because it sounds the most like the GTF I used (`hg38.ncbiRefSeq.gtf`). Was
-that correct? Who knows?
-
-Now that I have that, I'm doing an initial run with:
-
- tin.py --input data/05-alignment/clean/C0/C0.bam -r data/00-raw/hg38_RefSeq.bed
-
-It seems like it's going to take a long time. Guess I'll wait.
-
-The goddamn thing didn't output an ending timestamp. Come *on*. It took
-somewhere between 2-5 hours, but I have no idea how long because I wasn't
-watching it when it finished. I guess I'll do another run with a `time` prefix.
-Sigh.
-
-# April 2020
-
-## 2020-04-04
-
-Spent an hour wrangling gnuplot for class to try to make a multiplot of the TIN
-scores. Mostly it was just a bunch of fiddly stuff to try to make the graph
-look exactly like I wanted. Some of the key points follow.
-
-Basic multiplotting happens with:
-
- set multiplot \
- title "TIN Score Distributions by Sample" font ",14" \
- layout 3,5
-
- …
-
- unset multiplot
-
-Looped through the samples with this:
-
- do for[c=0:9] {
- set title "C" . c
- bad = (c == 3)
- plot [][0:1] "data/05-alignment/tin/C" . c . ".tin.xls" using 5:(1) smooth cnorm notitle lw 3 \
- lc (bad ? 3 : '#000000') \
- lt (bad ? 6 : 1)
- }
-
- do for[n=1:3] {
- set title "N" . n
- bad = (n == 2 || n == 3)
- plot [][0:1] "data/05-alignment/tin/N" . n . ".tin.xls" using 5:(1) smooth cnorm notitle lw 3 \
- lc (bad ? 3 : '#000000') \
- lt (bad ? 6 : 1)
- }
-
-Note the `smooth cnorm` requiring the columns to be `using
-data-column:literal-value`. This tripped me up for a while — I was doing `using
-0:data-column` and that was fucking things up completely. I realized the
-problem when I noticed there wasn't a peak at zero, and also that sorting
-changed the results (which it shouldn't). `smooth cnormal` needs exactly one
-column of the value, plus a weight value for the point. Remember this!
-
-You can skip around the plot with `set multiplot next` and `set multiplot prev`.
-Handy for adjusting the layout.
-
-Instead of trying to draw labels on every graph, I decided to try putting a key
-at the bottom right. This was very fiddly:
-
- set multiplot next
-
- $empty << EOF
- EOF
-
- set termoption fontscale "0.1"
- set title "{/*0.9 Key}"
-
- set ylabel "{/*0.9 CDF}"
- set xlabel "{/*0.9 TIN Score}"
- set format x "{/*0.9 %g}"
- set format y "{/*0.9 %.2f}"
-
- set origin 0.75,0.0
- set size 0.25,0.33
-
- plot [0:100][0:1] $empty using 1:(1) smooth cnorm notitle
-
-Note the `%g` for "humanized" formatting. I'm going to remember the `g` as
-standing for `god dammit, what is the magic character?`.
-
-## 2020-04-11
-
-Merged some PRs for `t` and did some basic test maintenance. I still love
-`cram` for CLI tests — it's such an intuitive, usable way to test things.
-
-# July 2020
-
-## 2020-07-18
-
-More work on Jarl.
-
-Started going through The Data Compression Book, doing the examples in Common
-Lisp. First code was a bit stream wrapper. I initially did
-a `trivial-gray-streams` wrapper but then refactored it into simple structs,
-since it's going to be writing a lot of bits and performance probably matters.
-Maybe.
-
-## 2020-07-19
-
-Did some server upgrades I've put off too long. Need to script up some stuff
-for automagically doing this in the future. Might be a good excuse to learn
-`expect`?
-
-PR'ed Quicklisp to move my projects from Bitbucket/Github to my self-hosted
-setup. At least now I own my availability and won't have to do this bullshit
-again.
-
-Imported a couple of contributed patches to `hg-prompt`. I'm glad people still
-find it useful. Also took the time to port the documentation to `d` because it
-was using some Python wiki bullshit thing before.
-
-Did some more work on The Data Compression Book. Wrote up a little `main`
-skeleton for the binaries. Might use this as an excuse to try out subtask-style
-interfaces in Adopt. I think it'll work, but I haven't actually tried it yet.
-
-## 2020-07-26
-
-Finally finished reading ULSAH. Took a few months, but aside from skimming over
-20% that wasn't super relevant it was as good as I remember back from 2008.
-Lots of things I need to look deeper into and try out on my own:
-
-* `strace` and friends.
-* Performance monitoring stuff: `sar`, `vmstat`, `iostat`, etc.
-* VMs using KVM and Qemu instead of Virtualbox.
-* Learn more about `systemd` (unfortunately).
-* Poke at `btrfs` a bit more using my existing setup.
-* Look into Ansible as a configuration management system — it seems the simplest
- of the Puppet/Chef/Salt/Ansible crew.
-* Building machine images with Packer.
-* Look into Terraform for deployment.
-* Restructure local wifi into main/guest networks.
-
-## 2020-07-28
-
-Did a bit more work on Jarl. It's actually coming together. Still got a bunch
-of things on the TODO list but I poked at the GitHub API with it a bit and it
-already works surprisingly smoothly. Neat.
-
-# August 2020
-
-## 2020-08-27
-
-Fixed Chancery to work around a bug in CCL's `RANDOM`. In the process I had to
-shave another yak and rotate my SSH keys so I could push to GitHub from my
-laptop. Ended up going with `ed25519` keys this time.
-
-## 2020-08-28
-
-Took a stab at having Travis run Lisp tests. After flailing at it for a while,
-there's just too much jank. I'll stick to running them locally.
-
-Also tried installing Clasp, but the process is too bonkers. If they ever
-provide binaries I'll run my tests on it, but the current install process is
-just way too much work to be worth my time.
-
-## 2020-08-28
-
-A bunch more work on Jarl. Wrapped up a couple of loose ends. Ran into some
-more MOP craziness that I *think* I've solved, but am not 100% sure.
-
-## 2020-08-30
-
-Threw together a little `weather` CLI because trying to use any weather
-website/app is just maddening. All I want is the temp/rain chance for the next
-few hours to know if I can go sit in a park… why is this so hard to find?
-Learned a bit about ZIP codes in the process of making it.
-
-# September 2020
-
-## 2020-09-19
-
-First day of PTO. My Pine64 arrived, so I downloaded their installer/MicroSD
-flasher thing and put Ubuntu on it for now, at least to make sure it's all
-working. Worked pretty well (aside from very slow download speeds) but the
-thing seems to work!
-
-Wanted to watch some stuff on my projector, but for some reason VLC would
-*always* display on the main monitor when fullscreen was activated, no matter
-where the window was on the screen beforehand and no matter whether I set the
-(multiple!) settings in the preferences to the other screen. Absolutely
-*maddening*, especially because this worked fine a week ago. Turns out the
-solution was to run `refresh-heads` in StumpWM. Christ. I hate computers.
-
-## 2020-09-27
-
-TIL about `ncdu`. It's like a text-mode DaisyDisk, which is something I've
-wanted for a long time. Nice!
-
-# October 2020
-
-## 2020-10-01
-
-New Thinkpad finally arrived. Time to shave the computer setup yak once again.
-
-Figured out how to get into the BIOS and disable the GUI BIOS. Disabled
-Trackpad, etc. Can't seem to figure out how to disable hyperthreading yet.
-
-Downloading an Ubuntu 20.04 `iso`. Since the machine is so new I figure I'll
-need a relatively new release. Documentation says to open the "startup disk
-creator" in Ubuntu's gnome search thing but since I don't use that, some
-googling around shows it's actually called `usb-creator-gtk`. Selected my 4GB
-USB drive and the 2.6 GB ISO and pressed the start button, and got an error
-dialog saying "Could not write the disk image … to the device …." *Fantastic*
-start. Tried another USB stick, same thing. Eventually something clicked in my
-brain and I realized it probably wanted me to `sudo`. Restarted the creator
-with `sudo` and it worked. You couldn't have fucking *told* me that in the
-error message? Christ.
-
-Plugged in the USB stick and rebooted. Went straight to Windows setup. Killed
-power and rebooted and used `F12` to get to the boot menu. Ubuntu started, ran
-a drive scan, found errors in a file. Christ.
-
-Eventually the install finished. Booted into Ubuntu at a tiny resolution.
-Wanted to increase the resolution but the settings menu isn't tall enough to see
-the display settings I guess. Christ.
-
-Dove into a giant rabbit hole to update my kernel. Once that was finally done
-my resolution was better.
-
-Generated an SSH key and bootstrapped it by typing in the pubkey manually into
-an `authorized_keys` file because I don't want to install Dropbox on this
-machine now that Dropbox is such a horrible mess. Typed in my
-`hg.stevelosh.com` SSH config by hand, and was finally able to clone my
-dotfiles. Immediately failed because Git subrepos are disabled by default in
-Mercurial for security reasons, and you have to enable them in your `hgrc` which
-is in the repo I'm trying to clone. Kill me. Created a simple `hgrc` to enable
-them. It worked until one of the Github repo clones randomly hung. Christ.
-Ran `hg up -C` to unfuck it.
-
-Installed an old SBCL through `apt`, cloned and rebuilt a new version, installed
-Quicklisp, cloned and built Stump.
-
-Got through the hellscape of `hg-git` installation by pinning versions that work
-on my desktop. Some day this isn't going to work and I'll have to unravel this
-hairball or abandon Mercurial, but that day is not today.
-
-Building `st` on the new machine. The `README` only mentions "the X header
-files" as a dependency, which is a lie. You actually need `libx11-dev`,
-`libxft-dev`, and `libxext-dev`.
-
-## 2020-10-06
-
-Going back through the ARM book and doing some of the exercises. To help me
-check my answers, I extended CACL to support reading characters, added
-`char-code` and `code-char` commands, added `hex` and `bits` commands.
-
-Also started going through BIAL again, because now we're doing it for the book
-club at work. Maybe I'll actually finish this time.
-
-I keep going back and forth on whether it's worth making a special sequence
-class for DNA/RNA sequences while doing Rosalind. I poke around at it every
-time, but I don't think I have the time to truly do it right (yet), so for now
-I'm going to stick with vanilla strings.
-
-## 2020-10-07
-
-Finished chapter 1 of BIAL again. Took a little of my old code (mainly
-utilities) but mostly just wrote it from scratch. It was a good exercise.
-I feel a little rusty at Lisp though…
-
-Sending emails with `msmtp` from Fastmail suddenly broke with:
-
- msmtp: TLS certificate verification failed: the certificate fingerprint does not match fastmail
-
-I guess Fastmail updated their certs. The fingerprints are at
-<https://www.fastmail.com/help/technical/certs.html>.
-
-## 2020-10-22
-
-Continued setting up the new laptop. Was going to set up mutt for mail, but one
-of my settings files is still in Dropbox and I was hoping to avoid installing
-Dropbox on this machine since Dropbox is a pile of garbage now. Maybe it's time
-to shave that yak and set up some kind of `rsync` based syncing solution for the
-bulk of my files.
-
-## 2020-10-26
-
-Finally got around to merging some `hg-prompt` patches. They add Python
-3 compatibility, but break Python 2 compatibility because it used `bytes(foo,
-'utf-8')` which breaks with an inscrutable error in Python 2:
-
- Python 2.7.17 (default, Sep 30 2020, 13:38:04)
- [GCC 7.5.0] on linux2
- Type "help", "copyright", "credits" or "license" for more information.
- >>> bytes('foo', 'utf-8')
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- TypeError: str() takes at most 1 argument (2 given)
-
-I'm calling `bytes`, not `str`, what the fuck are you talking about, Python?
-Eventually I found `bytes(foo.encode('utf-8'))` which will work in both. I'm
-glad I don't use this language any more.
-
-# November 2020
-
-## 2020-11-02
-
-Had to update SBCL today and wanted to parallelize the build to make it run
-a bit faster. I remember Rainer Joswig tweeting about this but can't find the
-tweet, Googling for `SBCL parallel` is useless, there's no documentation of the
-magic env vars, and I can't find them by scanning `make-config.sh` or `make.sh`
-so I eventually gave up and asked in `#sbcl`. The magical incantation to
-parallelize SBCL with multiple jobs is:
-
- SBCL_MAKE_PARALLEL=12 SBCL_MAKE_JOBS=-j12 ./make.sh
-
-Need to investigate <https://github.com/troglobit/xplugd> for dealing with
-keyboard bullshit in Linux. Looks promising.
-
-## 2020-11-08
-
-Getting `pass` set up on the new machine. Steps, for the next time I need to do
-this:
-
- git clone … ~/.password-store
- apt install pass scdaemon
- switch-yubikeys
- gpg --import gpg-pubkey-main.txt
-
- pass show …
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/bio/2021.markdown Mon May 08 13:46:28 2023 -0400
@@ -0,0 +1,152 @@
+# Biology Notes
+
+## 2021-04-26
+
+Homework:
+
+* Watch Ken Burns: the Gene.
+* Go through <https://www.coursera.org/learn/introduction-genomics>
+* Find practice exams from biology courses.
+* Find a paper to go over.
+
+Started Introduction to Genomic Tech class on Coursera.
+
+In one of the videos he talked about the word gene. What does it mean?
+
+* A heritable unit of the genome?
+* A sequence of the genome that codes for a protein?
+* What about introns? Are they part of the gene?
+* What about the upstream region with enhancers/repressors/promoter? Aren't
+ those just as important as the content itself, since they manage whether the
+ gene gets transcribed at all?
+* What about parts that code for non-protein-making RNA (TODO what is this
+ called again?)?
+
+## 2021-04-27
+
+More Coursera.
+
+* Any given gene might have a bunch of exons that get spliced in many different
+ ways. Is there a way to tell just by the sequence which stretches are exons?
+ I.e. is there something like the start/stop codons for exons/introns?
+
+## 2021-04-28
+
+More Coursera. Learned about some sequencing variants I hadn't known before:
+
+* CHiP-Seq is used to figure out where repressor/enhancer proteins bind to the DNA:
+ * Start by "cross-linking" the protein to the DNA (basically "freezing" the proteins to the DNA so they can't detach).
+ * Fragment the DNA.
+ * Use antibodies (how?) to pull/separate out the fragments that have a protein fused to them.
+ * Sequence just those fragments to find all the stretches of DNA that have proteins (presumably repressors/enhancers) bound to them.
+* Bisulfite sequencing is used to figure out which C's are methylated:
+ * Start with two identical DNA samples (unsure why we need this).
+ * In one sample, do some chemical magic that converts all *non*-methylated C's into U's.
+ * Sequence both samples.
+ * Any remaining C's in the modified sample must have been methylated.
+
+## 2021-04-29
+
+Coursera CS section. Mostly review for me, but a couple of interesting tidbits:
+
+* I liked how he emphasized how you have to *understand* what the software you
+ use is doing, and not just treat it as a black box.
+* The example of RNA editing and misalignments was interesting.
+* Need to look into "NCBI Genome Workbench" program. How does it compare to IGV?
+* The example RNAseq pipeline he used was bowtie → tophat → cufflinks → cuffdiff
+ which was what we used in my class at RIT.
+
+## 2021-05-01
+
+Coursera statistics section:
+
+* I like the term "ridiculogram". I see these a lot.
+* Batch effects are very common and very problematic.
+
+## 2021-05-03
+
+Lesson.
+
+Gene definition: a region of the genome with some function. Alternatives might
+be called "genomic elements".
+
+Introns/exons *might* use epigenetic information to determine splice sites.
+Search for "regulation of RNA splicing" to find more.
+
+Ligand: a molecule that can interact.
+
+Homework:
+
+* Might want to do DNA Algorithms class on Coursera.
+* Do at least one of <https://ocw.mit.edu/courses/biology/7-012-introduction-to-biology-fall-2004/exams/>
+* Look into supplementary material to find out if we were used in the journal club paper.
+
+## 2021-05-09
+
+Finally getting to the practice tests and the Coursera course. This week has
+been crazy.
+
+Started with practice test 2. Some parts were easy, some others I vaguely
+recall but am going to need to use the book to refresh myself on.
+
+## 2021-05-16
+
+Trying the scavenger hunt Emily gave me.
+
+There seem to be a LOT of results. Maybe I should try limiting them by date?
+Tried to figure that out, looks like there's a way to do it with an "Entrez
+Query". <https://www.ncbi.nlm.nih.gov/books/NBK3837/> is the handbook that
+describes those queries. I think it should be something like:
+
+ 2000/1/1:2020/5/5[Publication Date]
+
+Not sure if there's a standard place to find the location, or if it's always
+just randomly mixed into the text and you have to figure it out.
+
+## 2021-05-19
+
+Lesson.
+
+Chatted about the COVID scavenger.
+
+Question from last time: how does DNA replication terminate? It goes until the
+telomeres every time — it never partially replicates.
+
+## 2021-05-23
+
+Caught up on the coding side of the Coursera course. Redid the gnuplot
+interface in my Lisp utilities to match what the gnuplot book recommends, and
+it's working pretty well.
+
+## 2021-05-24
+
+Lesson.
+
+CD34+ cells: CD34 is a protein present only in young stem cells, used very often
+as a marker to identify them.
+
+Flow cytometry:
+
+1. Pick 1 or more markers (proteins on the surface of a cell).
+2. Buy antibodies that will bind to those markers, which have a fluorescent tag
+ tag attached to them.
+3. Draw blood.
+4. Lyse the red blood cells with a special reagent.
+5. Mix in the antibodies. Antibodies bind to the cells of interest.
+6. Use a microfludics thing to pipe cells one at a time through a channel.
+7. Use lasers to excite the fluorescent tags.
+8. Measure cell counts (and abundance of the marker on the individual cells!).
+
+## 2021-06-01
+
+Flow cytometry presentation. FCS is a super interesting tool that I didn't
+really know about before. Interesting tidbits:
+
+* About grill/refrigerator sized, $100k to $200k+ (depending on number of
+ lasers, more = better).
+* You can not only count cells, but also *sort* them on the fly.
+* It analyzes basic non-wavelength stuff like the size of the cell (by shadow
+ size) and density (by how much light is scattered at various depths in the
+ cell).
+* It also analyzes wavelength-specific stuff to figure out which
+ receptors/antibodies got dyed.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/bio/essential-cell-bio/01.markdown Mon May 08 13:46:28 2023 -0400
@@ -0,0 +1,172 @@
+Chapter 1
+=========
+
+Questions
+---------
+
+### Q 1-1
+
+* Person: 7
+* Potato: 7
+* Vacuum: 3
+
+### Q 1-2
+
+No, because a mistake in producing a shoe will usually result in it being thrown
+out. In a rare case of a beneficial mistake, the one who notices the mistake
+(the eventual wearer) is probably not in a position to give feedback that would
+let the improvement be incorporated back into the shoes, unless they happen to
+work at the factory (unlikely).
+
+### Q 1-3
+
+It doesn't prove you've generated life, it could easily be contamination from
+airborne microorganisms. You would need to filter or at least sterilize the air
+first to remove microorganisms.
+
+Pasteur did it by using long, bent-neck flasks, assuming that the organisms
+wouldn't be able to fight gravity and make it up the neck of the tube. I guess
+it worked, but that seems iffy to me — I could easily imagine an errant air
+current managing to push a dust speck into the neck.
+
+### Q 1-4
+
+ 10e-12 g * (N₀ * 2^(t/20 min)) = 6e24 kg
+ 10e-12 g * ( 1 * 2^(t/20 min)) = 6e27 g
+ 2^(t/20min) = 6e27 g / 10e-12 g
+ 2^(t/20min) = 6e38
+ log₂(2^(t/20min)) = log₂(6e38)
+ t/20min = 128.818
+ t ≈ 43 hours
+
+ set logscale y
+ set xlabel "Time (hours)"
+ set ylabel "Biomass (kg)"
+ plot [0:44][] ((2**(x/0.333333)) * 10.0**-15)
+
+It's not a paradox because in reality the bacteria are resource-constrained long
+before they approach the mass of Earth.
+
+### Q 1-5
+
+Certain substances are harmful if they get into the cell, while others are
+helpful (and some can be either, depending on what else it happening in the
+cell). Having a membrane that can selectively import specific substances allows
+a cell to keep out the bad stuff but allow the good stuff.
+
+### Q 1-6
+
+Light microscopy is cheaper, usually less labor intensive, and allows studying
+of living cells. Electron microscopy allows much higher
+magnification/resolution.
+
+* Living skin cell: must be done with light, electron cannot image living cells.
+* Yeast mitochondrion: probably requires the magnification of electron.
+* Bacterium: light is probably sufficient.
+* Microtubule: required electron magnification.
+
+### Q 1-7
+
+Many genes are similar between humans and yeast. Studying a particular gene in
+yeast can be many times more efficient than trying to study it in humans, so her
+money goes much further.
+
+### Q 1-8
+
+A. Cytosol: subset of the cytoplasm that is not contained inside an organelle.
+ A gel where many of the chemical reactions inside the cell happen.
+B. Cytoplasm: everything inside the cell membrane and outside the nucleus.
+ Does… lots of different things.
+C. Mitochondria: small organelles where cellular respiration happens ("the
+ powerhouse of the cell"). Originated from a bacteria being engulfed by an
+ ancestral cell (Margulis' endosymbiont hypothesis). Have their own genomes,
+ though many genes have been lost/relocated to the nuclear genome. The
+ compatibility between a mitochondrial genome and nuclear genome is important:
+ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4981548/
+D. Nucleus: membrane-surrounded organelle that contains the cell's DNA. Site of
+ DNA→ RNA transcription and DNA replication. Only found in eukaryotes (by
+ definition).
+E. Chloroplasts: plant organelles where photosynthesis happens. Thought to be
+ another endosymbiont. Also have their own genomes. Is chloroplast/nuclear
+ genome compatibility also important?
+F. Lysosomes: small organelles that digest food particles and break down other
+ unwanted particles.
+G. Chromosomes: packaged DNA (includes e.g. histones, etc). Condense during
+ cell division so they can line up along the metaphase plate and microtubules
+ can attach to the kinetochores and pull apart/sort them.
+H. Golgi Apparatus: organelle that modifies/packages things to be moved around
+ inside of or out of the cell.
+I. Peroxisomes: small vesicles used for reactions where hydrogen peroxide is
+ used to inactivate something toxic.
+J. Plasma Membrane: phospholipid bilayer separating the cell from the world.
+ Selectively permeable to things, studded with various things (e.g. channel
+ proteins, receptors, etc).
+K. Endoplasmic Reticulum: extension of the nuclear membrane. Partly studded
+ with ribosomes (rough ER), partly not (smooth ER). Site where parts of the
+ cell membrane and things destined for outside the cell are made.
+L. Cytoskeleton: three kinds of proteins (actin, intermediate filaments,
+ microtubules) in the cytosol that act as the internal structure/scaffolding
+ of the cell.
+M. Ribosome: small machine made of RNA and protein, used to translate RNA into
+ protein.
+
+### Q 1-9
+
+A. No, it's passed by DNA.
+B. Yes, bacteria are prokaryotes and don't have a nucleus.
+C. No, plants are eukaryotes.
+D. Pretty much always yes.
+E. No, the cytosol is everything that's NOT an organelle. The cytoplasm
+ includes the organelles.
+F. Yes.
+G. No, protozoans are unicellular.
+H. Yes (if you count digested food particles as "unwanted").
+
+### Q 1-10
+
+A. Nucleus
+B. Vacuole
+C. Cell wall
+D. Chloroplast
+
+### Q 1-11
+
+Actin
+Intermediate filaments
+Microtubules
+
+Muscle cells have a lot of actin.
+
+Skin would probably have more microtubules (helpful for communication).
+
+### Q 1-12
+
+ 1e6 * 2^(t/20) = 1 * 2^(t/15)
+ 1e6 = (2^(t/15) / 2^(t/20))
+ 1e6 = 2^(t/15 - t/20)
+ log₂(1e6) = log₂(2^(t/15 - t/20))
+ 20 ≈ t/15 - t/20
+ 20 ≈ t * (1/15 - 1/20)
+ 20 ≈ t * (1/60)
+ 1200 minutes ≈ t
+ 20 hours ≈ t
+
+ set logscale y
+ set xlabel "Time (hours)"
+ set ylabel "Population"
+ plot [0:24][] (1.0e6 * 2**(x/0.333333)), (1 * 2**(x/0.25))
+
+### Q 1-13
+
+Eventually one bacterium may acquire a mutation that resists the antibiotic.
+Selection pressure will rapidly cause its descendants to take over the
+population, at which point it continues growing normally.
+
+### Q 1-14
+
+ 10^13 = 1 * 2^(t/24)
+ 2^43 = 2^(t/24)
+ 43 = t/24
+ 43 * (24 hours) = t
+ 43 days = t
+
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/bio/essential-cell-bio/02.markdown Mon May 08 13:46:28 2023 -0400
@@ -0,0 +1,126 @@
+Chapter 2
+=========
+
+Questions
+---------
+
+### Q 2-1
+
+Number of molecules in the ocean:
+
+ 1.5e24 cm³ = 1.5e24 g = 1.5e24 g * 6e23 molecules = 9e37
+ ^^^^^^^^^^^^^^^^^^^^^^^
+ Not sure if this is right -- do I need the divide by the molecular weight of
+ water here?
+
+Picking up a mole of water means you have 6e23 shots at getting one of the
+original molecules, each with a (roughly) 6e23/9e37 chance (1/150000000000000)
+of success, or 149999999999999/150000000000000 chance of failure:
+
+ 1 - ((1 - 6e23/9e37) ^ 6e23)
+
+ set logscale x
+ plot [1:600000000000000000000000][] (1 - (0.9999999999999933 ** x))
+
+The chance is almost 100%.
+
+### Q 2-2
+
+A. Number: 6. Weight: 12.
+B. 6.
+C. 4. It can form covalent bonds with up to 4 other atoms.
+D. It has 2 additional neutrons. It will eventually radioactively decay, but
+ otherwise acts like Carbon.
+
+### Q 2-3
+
+This seems like a reasonable statement to make.
+
+### Q 2-4
+
+True. In water molecules the O atom has a partial negative charge δ⁻, so it
+will be attracted to the positively charged Na⁺ cations and will repelled from the
+negatively charged Cl⁻ anions.
+
+### Q 2-5
+
+Yes, at a concentration of 10⁻⁷ M/L. They're formed by protons moving between
+water molecules.
+
+### Q 2-6
+
+Using two separate types of amino acids would require separate versions of all
+the various translation machinery for no benefit, and it's simpler to just use
+one kind of machinery for everything.
+
+### Q 2-7
+
+The polarity of a chemical bond is referring to electrical polarity (+/-). When
+the word is used for a polypeptide it's just referring to the general
+asymmetry/directionality of the different ends, not electrical charge.
+
+### Q 2-8
+
+Having a single type of reaction mechanism means enzymes can be reused for
+multiple purposes, instead of having to build many different types of metabolism
+pathways. Condensation/hydrolysis is nice because water is already so abundant
+in the cell.
+
+### Q 2-9
+
+Covalent bonds are much harder to break apart when finished than noncovalent
+bonds.
+
+Because covalent bonds are very strong, a single covalent bond might be enough
+to hold macromolecules together. But this is not necessarily a good thing
+— requiring multiple noncovalent bonds allows more specificity and lets
+a molecule be more selective about the other molecules it binds to.
+
+### Q 2-10
+
+A. Usually, although H sometimes has no neutron.
+B. If it does, it's an ion (an anion).
+C. True.
+D. No, different isotopes have different number of neutrons.
+E. True (but is it the *only* factor? not sure).
+F. True.
+G. True.
+
+### Q 2-11
+
+C 12
+H 2
+O 16
+Total 30
+
+A. 12/30 * 5 * 6e23 = 1.2e24
+B. I'm not really sure what they're asking here.
+C. 2e-10 * n = 0.07e-3
+ n = 350,000
+
+### Q 2-12
+
+A. 4, 8, 8
+B.
+ Gain Lose
+ Helium 0 0
+ Oxygen 2 6
+ Carbon 4 4
+ Sodium 7 1
+ Chlorine 1 7
+
+C. Helium is not very reactive. Sodium and Chlorine are well suited to bonding
+ with each other.
+
+### Q 2-13
+
+Oxygen is much more electronegative than Sulfur (3.44 > 2.58). This means H₂O
+is a much more polar molecule than H₂S, so water molecules have much more
+hydrogen bonding between themselves which helps keep water liquid at higher
+temperatures than H₂S.
+
+### Q 2-14
+
+Glycine + Alanine
+H₅NC₂O₂ + H₇NC₃O₂ = H₂O + H₁₀N₂C₅O₄
+
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/bio/essential-cell-bio/03.markdown Mon May 08 13:46:28 2023 -0400
@@ -0,0 +1,120 @@
+Chapter 3
+=========
+
+Questions
+---------
+
+### Q 3-1
+
+It would probably not happen in a single step, because sugars have the form
+(CH₂O)ₙ. The smallest sugars have n=3, and so multiple steps are required to
+build the sugar from individual H₂O and CO₂ molecules.
+
+Heat must be generated because sugars are more complex molecules than the
+components. As order increases, the overall system must gain disorder in the
+form of heat.
+
+### Q 3-2
+
+A. Yes
+B. No
+C. No
+D. Yes
+E. No
+
+### Q 3-3
+
+A. ΔG depends on the starting proportion of heads and tails. If there are
+ unequal amounts, there will be a tendency for the box to end up closer to
+ 1:1. The more unequal the starting proportions the larger magnitude ΔG will
+ be. ΔG⁰ is zero, because H → T and T → H are equally likely.
+
+B. The temperature is how vigorously you shake the box. The activation energy
+ is the minimum force needed to possibly flip a coin. The enzyme would be
+ something that lowered the minimum force needed to flip a coin (e.g. it might
+ make the coins lighter).
+
+### Q 3-4
+
+Enzymes would reduce the activation energy required at each step, but wouldn't
+change the starting or ending values at all.
+
+### Q 3-5
+
+It's probably limited by the concentration of the reactants.
+
+The area under the curve for the catalyzed portion should be 10,000,000 times
+the area under the uncatalyzed portion.
+
+### Q 3-6
+
+Enzymes don't catalyze a reaction in a single direction, they catalyze it it
+*either* direction depending on the relative concentrations of the
+reactants/products.
+
+### Q 3-7
+
+A. Rocks: ATP
+ Debris: ADP + Pi
+ Bucket Bottom: X
+ Bucket Top: Y
+
+B. Rocks hitting the ground without machine: ATP → ADP + Pi + lots of heat
+ Rocks hitting the ground WITH machine: X + ATP → Y + ADP + Pi + some heat
+
+### Q 3-8
+
+ΔG depends not only on ΔG⁰ but also on the concentration of the reactants and
+products in the cell:
+
+ ΔG = ΔG⁰ + RT ln([X]/[Y])
+ ΔG = ΔG⁰ + RT ln([ADP]*[Pi]/[ATP])
+
+If the proportion `[ADP]*[Pi]/[ATP]` is less than 1 (i.e. there's more ATP in
+the cell than ADP/Pi) then `ln([ADP]*[Pi]/[ATP])` will be negative and will
+increase the amount of free energy liberated when the reaction happens.
+
+A range is given because the exact concentration of ATP, ADP, and Pi in the cell
+is not always constant.
+
+### Q 3-9
+
+A. No, this is cellular respiration.
+B. Yes, this is the second stage of photosynthesis.
+C. No, once you have the nucleoside triphosphates you're all set.
+D. Yes, cellular respiration is used to "charge up" the carriers like ADP into ATP.
+E. Yes, this is an example of D.
+
+### Q 3-10
+
+A. No, they might slow to a crawl but they won't stop completely.
+B. No, the "high energy" here is referring to how much energy is release from
+ giving up the electron, not how much energy the electron itself has.
+C. Yes, roughly (109 is roughly 2 * 54)
+D. If it's oxidized it's lost some of its electrons, so… not sure.
+E. Yes, e.g. CoA.
+F. No, this is true in the presence of oxygen but not necessarily true everywhere.
+G. What?
+H. No, constants are constant. But the equilibrium constant of the chain of
+ reactions *as a whole* will be shifted when compared to X → Y.
+
+### Q 3-11
+
+ 4.2 kJ/mole * 2 = 12.6 kJ/mole
+
+According to the table, the ratio of X to Y will be a little larger than 10².
+
+Doubling the number of hydrogen bonds would double the energy, and result in
+a ratio a little more than 10⁴.
+
+### Q 3-12
+
+### Q 3-13
+
+True:
+
+ ΔG = ΔG⁰ + RT ln([X]/[Y])
+
+If X is more prevalent than Y, then x/y will be more than 1 and the log will be
+positive. If Y is more prevalent than X then x/y will be between 0 and 1 and
+the log will be negative. RT is always a positive constant.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/bio/intro.markdown Mon May 08 13:46:28 2023 -0400
@@ -0,0 +1,74 @@
+# About Me
+
+Graduated in 2008 with a BS in Computer Science.
+
+Worked in the tech industry:
+
+* 1 year at telecom veal farm.
+* 2 years at small web development startup.
+* 2 years at a non profit (started working remotely).
+* 2 years at a bank.
+
+Simple got acquired, I got some money. Decided to go back to school. Packed up
+everything and moved to grad school for two years. Graduated with an MS in
+Computer Science.
+
+Moved back to Rochester. Got my old job at Simple back.
+
+Pretty soon it was clear that Simple had changed and I was miserable. Started
+looking around for new jobs.
+
+Eventually I found a job posting at a biotech company (10x). I didn't have any
+biology background but that was okay for them. Joined in 2018 and I've been
+there for about two years.
+
+# About Them
+
+# Learning Biology
+
+First, definitely want to learn more about biology. I've been learning on my
+own via:
+
+* Khan Academy
+* Textbooks
+ * Lots of pop science books
+ * Campbell Biology
+ * Genomics books from book club
+* A few partial Coursera courses
+* Two classes at RIT
+ * Bioinformatics 101
+ * High throughput sequencing
+
+Wondering what the next steps are. Microbiology of the Cell? Coursera classes?
+Something else?
+
+# Finding New Things
+
+How do people find areas that interest them? Genomics interests me, but I don't
+know if that's just because I'm working at a genomics company and that's what
+I have the most exposure to. I see advice like "find a lab whose work
+interests you" and "keep an eye on new papers" — for CS/programming I know how
+to do that, but with biology I don't really know where to start.
+
+# School?
+
+I like working at 10x because even if I'm not doing super sciency stuff myself,
+I still get to work on things that are used by e.g. cancer researchers.
+
+With that said, I really enjoyed going back to school a few years ago. I like
+learning new things a lot. I enjoyed the research end of things too, but didn't
+feel like it was very meaningful to humanity at large (esoteric CS topic).
+
+Learning a lot about biology and been really interesting to me, and I could see
+myself going back to school one more time for a PhD. Is this completely crazy?
+
+I have a BS and MS in computer science — would I have to do a full BS/MS in
+biology or bioinformatics first before being able to do a PhD in comp bio or
+bioinformatics? Or is it possible I could learn enough biology on my own over
+the next year or two and lean on my programming/CS experience in a comp
+bio/bioinformatics setting?
+
+I was lucky enough to join 10x when it was significantly smaller and get some
+stock options. I'm golden handcuffed to them for the next year at least, but
+after that I have more flexibility.
+
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/bio/intro/2021-08-xx.markdown Mon May 08 13:46:28 2023 -0400
@@ -0,0 +1,41 @@
+Try to find places where biologists hang out.
+
+Email graduate admissions and just ask them.
+
+Look into more background on protein structure.
+
+
+-------------------
+
+Consider taking a few classes at RIT.
+
+Search around for schools with bioinformatics/computational biology programs.
+Read their papers and find things that seem interesting. Look over their code
+too.
+
+Look through recent journals and try to pick a couple of recent papers that seem
+approachable.
+
+-------------------------------------------
+
+
+Try branching out a little bit -- look at immunology, virology, microbiology.
+
+Start learning cell biology next.
+
+
+-------------------------------------
+
+Find the top 5 people/grad schools whose work sound interesting.
+
+Find faculty members look at lab websites, look at open positions and projects.
+
+----------------------------
+
+Get transcripts, see if any of my previous credits would transfer into a BS/MS.
+Might be able to cut some years off a BS.
+
+* MIT OpenCourseware
+* Harvard extension school
+* Start looking into cell and molecular biology
+* Biotechnology class