January 2020

2020-01-03

Converted paste.stevelosh.com to HTTPS. Forgot it in the previous conversion.

2020-01-14

Moved all my repositories from Source Hut to hg.stevelosh.com. Wasn't expecting to have to shave this yak so soon, but such is life. On the plus side: pushing is so much faster now. Rough timing for an empty push was ~5 seconds on Source Hut, down to around a second now. No more sitting and staring at the command line waiting to get back to work.

2020-01-15

Finished (mostly) reskinning hg.stevelosh.com. Had to shave a couple of yaks in Mercurial itself (diff headers getting truncated) and the hgweb Markdown extension. Ended up just completely forking the extension and tearing out almost all of its guts. The result should (hopefully) be easier to maintain as Mercurial breaks internal shit over time.

Still need to finish the CSS for the rendered Markdown, but that can wait for now.

2020-01-17

Got the RIT VPN set up on my Linux machine so I can connect for class. I was worried I'd have to install some horrible Cisco VPN client thing, but it turns out you can use the openconnect client to connect, so sudo apt install openconnect and then sudo openconnect vpn.rit.edu just magically works. Fantastic.

Finished the hg.stevelosh.com reskin on my lunch break. Just had to get the READMEs looking decent.

Finally figured out a strategy for syncing a small subset of my dotfiles onto remote servers easily, so I don't have to be completely miserable on them. Goals:

Ended up making a remote/ folder in my dotfiles. Almost everything inside it is a symlink except for a bootstrap.sh. The folder gets rsync --copy-linksed over and then bootstrap.sh will symlink all the resulting files into their places. Hopefully this should be easier to maintain. Did a bunch of cleanup on my bash and fish setups while I was at it.

Still need to figure out how to get a decent Vim setup remotely. Syncing all my plugins is probably not ideal, but maybe. Who knows.

2020-01-18

KA biology and coffee.

Cellular respiration:

Did the TRIE problem in Rosalind. Making an actual trie was easy, but getting an adjacency list for the flattened/copypasteable output and using cl-dot to graph it nicely was fiddly. Annoying bits about cl-dot to remember for next time:

Cleanup up the styling of hg.stevelosh.com a bit more. This time it was the <code> tags inside <li>'s that weren't styled. I styled <p><code> before, because I wanted to avoid applying those styles to <pre><code> content (otherwise it'd have double borders all over the place). But that missed out on the <code> tags that weren't inside paragraphs. Ended up just styling all <code>s and then undoing the borders in <pre><code>. Sigh.

Also realized that the repository descriptions served by hgweb can contain arbitrary HTML. Neat. Used that to clean a few more things up.

Updated all the Bitbucket repository READMEs to point to hg.stevelosh.com. The Great Source Hut Yak Shaving of 2020 is finally complete, thank christ. Time to move on to doing actual work.

2020-01-19

Finally got around to setting up packages for all of my Rosalind problems. In all my "coding challenge" repos (Rosalind, Project Euler, Advent of Code, etc) I originally just shoved everything into one big package. This was easy to start with, but as I solved more and more problems I started to hit naming conflicts more and more often. Eventually I realized that for my own sanity I should really give each problem its own package, with all the utilities in a utils package. I did this for my Advent of Code repo a while back and while it was a pain in the ass to refactor everything, the result is much nicer. Today I refactored the Rosalind repo. The Project Euler repo is a work in progress, mostly because it's so much bigger than the other two.

Did Rosalind problem PDST. Pretty easy. Most annoying part was the output, but I extended my float printer to handle 2D arrays to make it easier to output what they want in the future.

2020-01-20

Wanted to do a Rosalind problem over lunch. On a whim I decided to use CCL just to keep myself honest — I usually use SBCL, so using another implementation now and then helps ensure I'm writing standards-compliant code. But apparently a recent Vlime commit broke Vlime in CCL. So I had to shave that yak first. Sent a PR.

Did Rosalind problem ASPC. The problem itself was trivial. Made a handy iterate driver to make it even more trivial.

Running the test suite in CCL gave me exactly what I wanted: examples of where I was unintentionally relying on SBCL's behavior. Today there were two:

  1. Relying on the traversal order of iterate's :in-hashtable (which uses with-hash-table-iterator) to get the output of TRIE. Made a helper to convert the hash tables to sorted alists to fix the output.
  2. Relying on the ~,…F format directive having a half-up rounding strategy. Turns out that when format is rounding a float to a given number of decimal places, the implementation is allowed to round half however it wants. SBCL seems to round half up (which matches what Rosalind shows in their examples, which is why I never noticed) but CCL seems to use banker's rounding. Had to do an ugly hack in float-string to make the output consistently match Rosalind's style everywhere.

2020-01-22

Started doing Rosalind problems in shell/Awk, to join in with a group of folks at work doing them for the first time. The first few problems are simple, but I still managed to learn a few tricks by implementing them with UNIXy tools.

Fixed a few remaining broken links on my site. It's been a week or two since I poked around at it, and that was on another machine, but ./deploy.sh Just Worked. It's so nice to have a site generator I designed myself that's not going to change out from under me when I just want to fix something simple and move on with my life.

2020-01-23

Lesk book exercises, chapter 1:

  1. Average density would be (/ 3r4 3r9) = 1/100000 or 1 gene every ~100kb.
  2. Parts:
    • A: Two humans would have roughly 250k differences.
    • B: A human and a chimpanzee would have roughly 3m differences.
  3. Parts:
    • A: Average density would be 2 SNPs per 1000 base pairs.
    • B: Is this a trick question? I think 1.1% of the differences would be in protein-coding regions.
  4. Parts:
    • A: I am confused. The glossary defines a "haplotype" as a group of closely-linked genes that are typically inherited together. So… 1 haplotype? But the text talks about a haplotype as being a combination of SNPs in a recombination-poor region. So if "haplotype" means the combination of SNPs, not a set of genes, then I think this would be 4^10 = 1048576.
    • B: On a diploid chromosome, you'd have two separate sets of SNPs to combine (and order doesn't matter), so I think it's 4^10 * 4^10 / 2.
    • C: (1 SNP / 5 kb) * (100 kb) = 20 SNPs = 2^20 sequences = 1048576 sequences
  5. TODO
  6. TODO
  7. TODO
  8. A much less drastic change, though it could still cause issues if it's at a critical point in the RNA molecule (e.g. the anticodon in tRNA).
  9. TODO
  10. If the gene Kangaroo coat color is not on the X chromosome, you wouldn't be able to get the calico pattern that comes from X chromosome inactivation.
  11. Parts:
    • A: No, because if it were a mutation in the gene itself it would act by affecting the structure of the protein, and should have the same effect in adults and children. But it doesn't, so it's more likely to be a mutation in regulatory region.
    • B: It's not on the Y chromosome?
  12. TODO
  13. TODO
  14. TODO
  15. TODO
  16. TODO
  17. TODO
  18. TODO

2020-01-24

Solved the FIB Rosalind problem in my shell/Awk repo. Tried doing it in straight bash, but I couldn't figure out how to get the bash function to use a global bash array for memoization after like 10 minutes of searching. Ended up rewriting it in Awk, which took two minutes, was simple, and runs fast. Great.

Solved MMCH in my Lisp repo. Much like the previous problem in this line, it was way easier than they made it seem. Also fixed the Uniprot cache to use the filesystem instead of just keeping a hash table, to save a few seconds of my life every time I (run-tests) in a fresh Lisp session.

2020-01-25

Rebooted my Mansion repo to try to simplify it enough that I can start dogfooding it for actual use, then turn it into something publicly-usable.

2020-01-27

Trying to decide on data to use for the BIOL-650 project. A list of possibilities:

2020-01-29

Figured out how to get the mutt HTMLified email working on my local machine, like I had done for work. Had to install NeoMutt from source because the Ubuntu version is too old to include group-alternatives:

More searching around for projects for HTSA:

https://www.sciencedirect.com/science/article/pii/S2352340918302920
doi 10.1016/j.dib.2018.03.079
Illumina HiSeq 2500
https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA432903

https://www.sciencedirect.com/science/article/pii/S2352340919303750
doi 10.1016/j.dib.2019.104022
Ion Proton sequencer
https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=506922

Started going through the gnuplot book at night. Going to try to do that in parallel with the class.

2020-01-30

Drove to RIT only to discover class was not being held in person, and we're supposed to work online instead. Welp.

Installed FastQC locally and got it running. Really just downloaded the binary version and symlinked it into place.

Downloading the actual data is taking a while. While I wait, thinking about disk setups: now that I have a NAS and have moved my local backups to it, I can format my previously-for-backup internal drives and use them for actual data. I really need to spend the time to grok how Linux formats and mounts drives — I've been flailing and slapping things together and it has mostly worked until now, but I really should just learn it. Ordered a copy of the new edition of the sysadmin book that I remember being really good from my Dumbwaiter days. Hopefully that will be a good reference.

Finished the FastQC stuff early, so spent a few minutes going through the gnuplot book. Random notes:

Finally got around to writing a first draft of a lines utility. Discovered and fixed a stupid bug in Adopt in the process.

2020-01-31

Now that I've got a more thorough backup system, went through and did my yearly pruning of unnecessary crap on my hard drive, and adding restic excludes to avoid backin up track. Number of scans was cut drastically:

Much cleaner now.

February 2020

2020-02-01

Fixed up my $PATH in my bash profile, for when I'm forced to use bash.

Also had to fix up the uses of set_color in my Fish shell config to only happen for interactive connections, because Fish shits out some errors when you call set_color on a noninteractive connection (e.g. when using scp or rsync). Sigh.

Now that I've got my backups pushed to my NAS (plus off-site) I can repurpose those 4 TB drives sitting empty in my tower. Gonna use them for a data dump site on the Linux side of things, e.g. for all the FASTQs and resulting data I'll be dealing with for my class this semester. Read through the sysadmin book chapters on the filesystem which were super helpful. Decided to pair the drives as a btrfs raid 0 array. I'm not super concerned with durability here — everything should be reproducible from the original raw FASTQs from SRA if I'm doing my job right, so I may as well just have one giant pool of storage. Also I don't need this /dump to be encrypted — it'll be handling non-sensitive data.

To get them ready for btrfs I used:

Even with all that, mkfs.btrfs complained about a partition table still existing on one of the drives. At that point I gave up and passed -f to it, and everything Just Worked. In hindsight, I bet I could have just done that to begin with and let mkfs.btrfs handle blowing away everything, but it was still a fun exercise.

Attempted to figure out how to acquire MS Word for the class. Apparently there's a web version of Word now? If I can use that, at least I won't have to spin up a full Windows VM to write the papers…

Installed sra-toolkit from Ubuntu's repository. Unfortunately it doesn't seem to come with fasterq-dump, only the older fastq-dump, at least according to dpkg-query -L sra-toolkit. Installed the latest binary from their site instead.

Poked around at both fastq-dump and fasterq-dump. fasterq-dump seems more documented and user friendly, and is definitely faster. The biggest speedup came from telling it to use RAM as its temporary filesystem with -t /dev/shm, which cut the time down to 1/3. Guess all this RAM is actually good for something.

After rebooting later, the boot process was suddenly hanging for over a minute. Another fucking yak to shave. Figured out how to disable the goddamn splash screen (edit /etc/default/grub to remove quiet and default and then update-grub2) so I could get some logs. Eventually saw some systemd horseshit about a job is running for dev-disk-by\x2duuid-…. Tried to figure out which drive had that UUID, but none of them did. After a bunch of flailing in fstab I realized that it might be the UUID of the old backup drive I killed. Sure enough, there was still an entry in /etc/crypttab for it, and a bunch of systemd shit in /run/systemd/generator/ left over that I had to clean out. After that, everything boots fast(ish) again. I hate computing.

2020-02-02

Spent some time cleaning up the Makefile for my individual class project. Learned some stuff about make in the process:

Finished writing up my class project pick paper.

Did more in the gnuplot book.

Updated my lines tool to add a --limit option and handle broken pipes. I could have sworn I handled broken pipes in some other Lisp script at some point, but can't seem to find it now. I used sb-int:broken-pipe but I'm not sure if that's something I should be relying on. Need to ask in #sbcl.

2020-02-04

Class today. Did a bunch of various things after the lecture (I hope I can remember it all).

Got the group permissions sorted out on the server in two steps:

Once that was sorted out, started diving into aligning the Trapnell data with tophat and bowtie. This is… not pleasant.

First I need to install both tools. There's an Ubuntu package for tophat, but it wanted to install NodeJS as a prerequisite. Why on earth does tophat want NodeJS? SIGINTed that immediately and downloaded the binary version instead.

The wrapper script tophat2 tries to set up $PATH magically so everything just works:

#!/usr/bin/env bash
pbin=""
fl=$(readlink $0)
if [[ -z "$fl" ]]; then
   pbin=$(dirname $0)
 else
   pbin=$(dirname $fl)
fi
export PATH=$pbin:$PATH
$pbin/tophat "$@"

Unfortunately it doesn't try hard enough. The readlink is attempting to deal with symbolic links there, but only works with absolute links — relative symlinks (e.g. ln -s ./tophat-…/tophat2 tophat2) just return the relative path, which then gets passed to dirname and treated as relative from whatever directory you happen to be in. And let's ignore all those unquoted variable expansions I guess. Sigh. Better version:

#!/usr/bin/env bash
pbin=""
fl=$(readlink -f "$0")
if [[ -z "$fl" ]]; then
    pbin=$(dirname -- "$0")
 else
    pbin=$(dirname -- "$fl")
fi
export PATH="$pbin:$PATH"
"$pbin"/tophat "$@"

Once it's installed, I can finally run it. This is where things get even more confusing. After a whole bunch of flailing I think I managed to narrow down what the hell all these magic parameters are, and how things have changed since the paper was written. Notes:

That's as far as I got today. Next step is to clean up this mess and get it into the Makefile.

2020-02-06

Refactored the Makefile and added in all the alignment stuff. Wrote up the weekly report far too late.

2020-02-08

Decided to try using gnuplot to do the graphing exercises in the Campbell Biology textbook, to combine the things I've been reading lately.

The first exercise I did was from chapter 9. Data:

Low        4.3
Normal     4.8
Elevated   8.7

Gnuplot file:

set title "{/:Bold*1.5 Thyroid Hormome Levels and Oxygen Consumption}"
set xlabel "Thyroid Hormone Level"
set ylabel "Oxygen Consumption Rate [nmol O_2 / (min * mg cells)]"

unset key
set boxwidth 0.8

plot [][0:10] "bio/thyroid" u 0:2:xtic(1) w boxes fillstyle solid fc rgb "black"

That was pretty simple. Then I went back to a previous chapter. Data:

# Time (min)  Concentration of inorganic Phosphate (μmol/mL)
0               0
5              10
10             90
15            180
20            270
25            330
30            355
35            355
40            355

Gnuplot file:

set title "{/:Bold*1.3 Glucose-6-Phosphate Activity in Isolated Liver Cells}"
set xlabel "Time (minutes)"
set ylabel "Concentration of ℗_i (μmol/mL)"

set xtics format "%h min"

unset key

set grid xtics ytics

set lmargin 12
set rmargin  6

plot [0:45][0:400] "liver-glucose" w lines linewidth 2

This one was a little trickier. I was trying to play around in the gnuplot REPL, but pasting the Unicode characters was producing garbage at the gnuplot command line. Pasting into a normal terminal/fish shell worked fine. After much flailing I tracked down the problem to gnuplot itself:

At this point I just gave up on using Unicode at the REPL and created .gp files that I then load. The GPL strikes again.

2020-02-09

Installed R Studio from the .deb. Needed to apt install r-base first to get the dependencies. It installed… a lot of stuff. Let's hope it hasn't wreaked too much havoc.

Moved on to differential gene expression with cuffdiff. Finally starting to understand what the actual data files are. The input FASTQs are:

GSM794483_C1_R1_1.fq.gz
GSM794483_C1_R1_2.fq.gz
GSM794484_C1_R2_1.fq.gz
GSM794484_C1_R2_2.fq.gz
GSM794485_C1_R3_1.fq.gz
GSM794485_C1_R3_2.fq.gz
GSM794486_C2_R1_1.fq.gz
GSM794486_C2_R1_2.fq.gz
GSM794487_C2_R2_1.fq.gz
GSM794487_C2_R2_2.fq.gz
GSM794488_C2_R3_1.fq.gz
GSM794488_C2_R3_2.fq.gz

The C here apparently stands for "condition". I still have no idea what R[123] is trying to tell me. The _[12] is the paired-end read pairs. Would it kill people to name things more descriptively and/or include a README that explains what all their one-letter abbreviations mean?

Thinking about restructuring the filesystem layout to make the relationships between the FASTQ files more explicit:

C1/
    R1/
        GSM794483_C1_R1_1.fq.gz
        GSM794483_C1_R1_2.fq.gz
    R3/
        GSM794484_C1_R2_1.fq.gz
        GSM794484_C1_R2_2.fq.gz
    R3/
        GSM794485_C1_R3_1.fq.gz
        GSM794485_C1_R3_2.fq.gz
C2/
    R1/
        GSM794486_C2_R1_1.fq.gz
        GSM794486_C2_R1_2.fq.gz
    R2/
        GSM794487_C2_R2_1.fq.gz
        GSM794487_C2_R2_2.fq.gz
    R3/
        GSM794488_C2_R3_1.fq.gz
        GSM794488_C2_R3_2.fq.gz

But I might be too lazy at this point to do it.

The more I work with make the more I realize how limited it is. The general idea behind it (computing a dependency graph and using it to rebuild things) is sound, but the implementation is a clusterfuck of string processing and horrific syntax. It really gives me the urge to create Yet Another Build System. Sigh.

Eventually cuffdiff finished. Time to move on. Tried to install cummerbund in R, but ran into a bunch of errors. Had to install numerous third-party dependencies outside of R:

Why on earth does installing cummerbund require a goddamn MySQL client?! This is completely nuts.

Ran into a giant pile of inscrutable errors even after installing all these bananas dependencies. Eventually tracked it down to the R version shipped in the Ubuntu package repository being too out of date to work with some of the required packages. I hate volatile software. Installed a newer version of R with this guide.

Finally managed to get cummerbund working and started graphing stuff. It mostly works, but my graphs don't really look quite like theirs. Need to ask the professor about this.

In the mean time, played around with using gnuplot to plot the data instead of R. Turned out to be not too bad, since cuffdiff already dumps its output in a textual format. Example of the plot comparing the expression levels of the various genes:

set logscale
unset key
plot "gene_exp.diff" using \
    "value_1":"value_2" lt 3 pt 1 lw 1, \
    x+(0.5*x) lt -1 lw 1 dt 2, \
    x-(0.5*x) lt -1 lw 1 dt 2

I also played around with adding some nicer default linetypes to my ~/.gnuplot using colors from ColorBrewer. Fairly happy with the result so far, but I'll need to play around with them more over time to see if they hold up.

2020-02-10

Started making some actual gnuplot scripts to draw the stuff from yesterday. Got a couple of scripts working, but went down a horrific rabbit hole.

First: the graphs produced by cummerbund are titled very poorly ("Genes" is not particularly helpful). I'm going to need to chat with the professor to try to decipher what these graphs are trying to show me.

I reproduced the scatter and volcano plots without too much trouble. Used the PDF plotter and came up with the simple printable line styles that should be okay.

Then I tried to reproduce the first graph, the "distribution of expression levels for each sample", and all hell broke loose. First of all, what is this graph even trying to show? The title is just "Genes", which is useless garbage. The y-axis is labelled "Density", the x-axis is pabelled log_10(FPKM), and the function used to create it is csDensity, so maybe this is a kernel density plot of the log10(FPKM) values? Well, I tried that, and I ended up with fucking spaghetti in gnuplot — there were hundreds of lines all overlaying each other instead of the expected single line. I spent like an hour dicking around online trying to figure out what the hell gnuplot was doing. Eventually, finally, I realized no one online was going to help me, and decided to break things down myself. I extracted the problematic column from the file and computed the log10 values myself in Lisp, and found an issue: some of them are 0, which means their log10 value is undefined. After removing those and replotting, I confirmed that those undefined values were what was causing all the discontinuities in the original gnuplot kernel density plot — instead of ignoring the values, it would start a new line every time it hit one. Christ. So then I cloned down the cummerbund repo to see how they were handling this. The code is… not particularly easy to read. I searched around more online, now that I realized the problem, and eventually found a post where someone notes that they add 1 to the FPKM values first, before taking the log, to avoid negative and undefined log values. So the x-axis label that says log_10(FPKM) is fucking lying — what's actually on the graph is log_10(FPKM+1). Once I plugged that in, everything works and the graph looks roughly like theirs. To hell with all this, I'm done for tonight.

2020-02-11

Class today. Chatted about various stuff. The XLOC identifiers aren't actual identifiers, they're just arbitrary IDs generated during the alignment process. That explains why I was seeing 0s when trying to plot the XLOC_012662, which was regucalcin in the paper — it's a different gene in my run, so I'd need to look up the gene in my own results. Also talked about P and Q values, and plotting each to find interesting cutoffs. I really need to refresh my memory on how to more easily make histograms in gnuplot.

Started reading A Curious Moon and got about 1/3 of the way through. An interesting approach to teaching SQL, though the writing voice is a little bit annoying. Still turning out to be a good book though.

Finished the gnuplot book. It's really, really good.

Finished my weekly report and homework for this week. Good to have a couple days of buffer. Figured out how to build the assignment zip file in a Makefile which was overkill, but "fun":

assignments/sjl7678-alignment.zip: $(ALIGNMENTS)
    rm -rf assignments/sjl7678-alignment
    mkdir -p assignments/sjl7678-alignment/
    cd assignments/sjl7678-alignment && mkdir -p $(KEY_PREFIXES)
    mcp 'data/02-alignment/*/align_summary.txt' 'assignments/sjl7678-alignment/#1/align_summary.txt'
    mcp 'data/02-alignment/*/logs/tophat.log'   'assignments/sjl7678-alignment/#1/tophat.log'
    cd assignments && zip -r sjl7678-alignment.zip sjl7678-alignment

I'm slowly becoming more and more disillusioned of make. Even though I'm only using it for what it's actually designed for (running commands to turn files into new files) it's still a tremendous pain in the ass to do anything nontrivial. The "scripting" support is completely batshit string-based munging, and even the basic definition of any dependencies beyond explicit lists of files is miserable.

2020-02-12

Spent some (far too much) time getting the Trapnell data into Postgres, figuring out how to ergonomically plot it with gnuplot, and figuring out how to do decent histogram and CDF plots. I'm not thrilled with what I came up with, but it'll have to do for now because it's already 11 PM.

2020-02-15

Tracked down the root cause of https://github.com/dimitri/pgloader/issues/1005 — need to write it up in a comment.

2020-02-16

Finished the make book. I'm becoming less and less enthusiastic about make the more I learn about it and see real-world examples of it.

Wrote up the pgloader issue debugging comment.

2020-02-18

Class today was a lot of chatting and planning ahead for the group project. Things to think about:

One more big question: need to figure out what the overrepresented sequences FastQC is complaining about are. I figured I'd do BLAST searches for them, so I wanted to get a list of all the overrepresented sequences. So I need to pull them out of the FastQC reports… which are HTML files… all on one line. Did some horrific greping to get what I wanted:

 sjl at ouroboros in /dump/karkinos/fastqs/qc
 ><((°> grep -Po 'id="M9".*<h2' *.html | head | grep -Po '<td>.*?</td>' | grep -v 'No Hit' | grep '<td>[ACTG]' | tr -s '<>' '  ' | cut -d' ' -f3 | sort | uniq -c
       2 CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT
       2 CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT
       2 CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC
       2 CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT
       2 CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
       2 CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG

 sjl at ouroboros in /dump/karkinos/fastqs/qc
 ><((°> grep -Po 'id="M9".*<h2' *.html | head | grep -Po '<td>.*?</td>' | grep -v 'No Hit' | grep '<td>[ACTG]' | tr -s '<>' '  ' | cut -d' ' -f3 | sort | uniq | xargs -ISEQ grep --files-with-match SEQ *.html | sort | uniq
 SRR6671104.1_1_fastqc.html
 SRR6671104.1_2_fastqc.html

Interesting — only one of the samples has this issue, and the same sequences appear in both sides of the pair. Immediately after doing this horrible grep incantation I remembered the .zip produced by FastQC and sure enough, there's a text file in it with the results in a much less awful format. But I still needed a janky awk incantation:

 ><((°> awk 'BEGIN { RS=">>END_MODULE\n"; FS="\n" } $1 ~ /Overrepre/ { for (i=1;i<=NF;i++) print FILENAME, $i }' */fastqc_data.txt
 SRR6671104.1_1_fastqc/fastqc_data.txt >>Overrepresented sequences       warn
 SRR6671104.1_1_fastqc/fastqc_data.txt #Sequence Count   Percentage      Possible Source
 SRR6671104.1_1_fastqc/fastqc_data.txt CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT        58594   0.18785731410785536     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC        39157   0.12554065004132323     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG        36886   0.11825963218388151     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT        34457   0.11047205297836592     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT        33273   0.10667604895229328     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT        31923   0.1023478349022949      No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt
 SRR6671104.1_2_fastqc/fastqc_data.txt >>Overrepresented sequences       warn
 SRR6671104.1_2_fastqc/fastqc_data.txt #Sequence Count   Percentage      Possible Source
 SRR6671104.1_2_fastqc/fastqc_data.txt CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT        58938   0.18896020716948458     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC        39362   0.12619789736002668     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG        36752   0.11783001686336315     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT        34782   0.11151403043484702     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT        34471   0.11051693816110664     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT        31900   0.10227409495922085     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt
 SRR6671105.1_1_fastqc/fastqc_data.txt >>Overrepresented sequences       pass
 SRR6671105.1_1_fastqc/fastqc_data.txt
 …

Some quick cuting and such and I've got what I need, in a slightly less horrible way. I guess.

><((°> awk 'BEGIN { RS=">>END_MODULE\n"; FS="\n" } $1 ~ /Overrepre/ { for (i=3;i<NF;i++) printf(">%s/%d %s\n", FILENAME, x++, $i) }' */fastqc_data.txt | cut -f1 | sort -k2 | uniq -f1 | tr ' ' '\n'
>SRR6671104.1_1_fastqc/fastqc_data.txt/0
CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT
>SRR6671104.1_1_fastqc/fastqc_data.txt/5
CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT
>SRR6671104.1_1_fastqc/fastqc_data.txt/1
CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC
>SRR6671104.1_1_fastqc/fastqc_data.txt/4
CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT
>SRR6671104.1_1_fastqc/fastqc_data.txt/3
CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
>SRR6671104.1_1_fastqc/fastqc_data.txt/2
CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG

Did a BLAST search for these sequences and they're all 18S Ribosomal RNA. Professor wasn't sure how that could have gotten into the mix, since the paper says they only used poly(A) RNA. Need to ask around at work to see if anyone has any ideas what the heck happened.

(This weirdness is only on sample C0. Unfortunately that's not the sample the paper identified as being low quality (C3) so that isn't the issue.)

Did a bit more on the Enceladus book. Decided to just read the rest, because the DB/Makefile crap is getting way too messy. I really need a sqlfmt tool, but every time I try to use smug to write a parser I get depressed.

Merged a GitHub PR into badwolf and tried to pull it down locally with hg-git, which fucking broke for the millionth time, with a lying error message blaming hg-prompt. Fucking fantastic. Updated hg-git (apparently there's a new repo on Heptapod since Atlassian is shitcanning BitBucket's Mercurial support) and now there are four heads on default. I have no idea which one I'm supposed to use. I picked the latest one arbitrarily and it seems to work, at least well enough to pull down the PR. Good enough.

2020-02-19

Finished the weekly report for class.

Fixed the scroll overflow of the code blocks in this site.

2020-02-22

Put together some initial structure for the group project. Got the files unzipped and FastQC run, so tomorrow I can dive into new stuff.

Tried to track down a human reference genome for use later. I think I got what I need, but the sites are… not very user friendly. Will confirm with the professor in class.

Started reading one of Tufte's books. Found a beautiful graph from a 1979 paper (10.1126/science.206.4421.987) with a gorgeous plotter font. Need to try to find a similar font to use myself.

2020-02-23

Did another graphing exercise from Campbell Biology in gnuplot for fun.

Figured out that dashtype '' in my style files was causing the line to show as solid, but the entry in the key was blank! Turns out dashtype solid is what I actually want. Also learned that I need to explicitly reset that when I want a solid line, otherwise the dashtypes I set in previous styles will carry over.

Refined my color plotting style to use dashed lines instead of lighter colors for types 6-10. Seems easier to read that way — the light colors were really light.

Figured out how to do linear regression in gnuplot, and made some helper functions to make it less annoying:

# lr  = linear regression (the function to fit)
# lrt = linear regression title

baselrt(mm, bb, mprecision, bprecision) = sprintf(\
        sprintf("%%.%dfx + %%.%df", mprecision, bprecision), \
    mm, bb)

lr(x) = m * x + b
lrt(mp, bp) = baselrt(m, b, mp, bp)

lr2(x) = m2 * x + b2
lrt2(mp, bp) = baselrt(m2, b2, mp, bp)

fit lr(x)  "plant-co2" using "CO2":"Corn"       via m, b
fit lr2(x) "plant-co2" using "CO2":"Velvetleaf" via m2, b2

plot [300:1050] \
    "plant-co2" u "CO2":"Corn"       w lp, \
    "plant-co2" u "CO2":"Velvetleaf" w lp, \
    lr(x)  t lrt(3,0)   lt 6, \
    lr2(x) t lrt2(3, 0) lt 7

Looked into why our first few bases are called at lower quality. Found this which says that it's because Illumina uses the first few bases to calibrate their quality score analyzer, and so they mark them as being less confident to account for the calibration process. Signed up for that forum (over bare-ass HTTP, sigh) to get the damn tech note PDF too.

Dug out the old AxiDraw to play around with https://anvaka.github.io/city-roads/?q=rochester%2C%20ny&areaId=3600176069. First problem: after installing the new AxiDraw version, connecting to the AxiDraw in Inkscape failed with Failed to connect to AxiDraw. Sigh. After much flailing, it turns out the problem was that something had pip installed the serial module, which is some kind of serialization/deserialization library, and so import serial was importing that instead of pySerial, which is the serial port communication library that happens to use the module name serial too. Ugh. So pip uninstall serial; pip install pySerial fixed that. Then I tried to get the Render > Hershey Text thing working again, but it just… did nothing at all. Looks like there's a second Hershey Text plugin, this time under the AxiDraw extension menu. I guess I'm supposed to use that one now? Who knows?

Played around with how to make a dot plot in gnuplot while the AxiDraw drew Rochester. First I grabbed the read counts for our various FASTQ files, pulled from the FastQC output files:

ffind _data.txt \
    | xargs grep 'Total Sequences' \
    | sed -Ee 's|./([^/]+)_fastqc/.*\t|\1 |' \
    | sort --stable -k1,1 -r \
    | sort -k2,2 -n \
    > seqs.txt

cat seqs.txt | head -n2
C3_1 21081615
C3_2 21081615

Then I can plot a dot plot with something like:

# https://twitter.com/stevelosh/status/1231434196213796864
set termoption font "Routed Gothic,12"

# I wonder if there's a way to capitalize stylistically
set title "READ COUNTS OF INDIVIDUAL FASTQ FILES"
set xlabel "READS (MILLIONS)"

# major x tics every 2 million, with 2 minor divisions per major (i.e. minor tics are every 1 million)
set xtics 2
set mxtics 2

set grid y nox

# left justify
set ytics left offset -2.2, 0.0

unset key

#                                     scale to millions by hand
#                                                       read FASTQ names from file
plot [20:35][-1:26] "seqs.txt" using ($2 / 1000000.0):0:ytic(1) lc rgb 'black'

And I get something like:

plot

Neat!

Hacked together some Awk to remove overrepresented sequences. But I don't think a simple grep -v approach works, because the two FASTQ files are expected to have the paired reads at the same positions in the file. So if we remove a read from one file but not the other, now all the reads are going to be offset. So we need to remove these reads a bit more carefully (really, we need tools that process the paired-end reads together). Need to think about this a little bit more.

2020-02-24

Figured out how to mount a directory on my home machine on my laptop with SSHFS, in advance for tomorrow in case I need it:

# mount
mkdir ~/foo
sshfs -o idmap=user sjl@server:/whatever ~/foo

# unmount
fusermount -u ~/foo

Downloaded STAR and got it running. Read through the manual which, despite the broken English, was fairly useful. Tried running it on the gzipped reference genome but it complains that it requires the genome to be uncompressed. Tried to trick it into working by psubbing zcat but that didn't work. Gave up and decompressed the damn reference.

Once the genome index was built, started aligning the reads. It failed right away complaining about a zero-length read. zcat | greping showed the read as being there in both files, but I bet this was an issue of the pairs getting out of sync because we grepped out the overrepresented sequences naively. Tried restarting the alignment on the original data and it's still going, so that's probably it. Need to figure out how to filter those bad reads properly tomorrow.

2020-02-25

Class. Chatting about QC and such.

Professor says he asked around and people haven't seen the first 5 bases being lower quality before, but that the explanation from Illumina makes sense, and that we should not trim those bases unless they seem to be causing issues. Also talked about how to filter out the overrepresented sequences — he thought fastx had something for this, but I can't seem to find anything in the documentation. I may need to write some code.

Professor installed pigz on the server, so I can remove my hacky xargs workaround.

Need to find a way to filter bad sequences from BOTH files at once.

Need to align to the NCBI reference GTF instead of the other weird one.

2020-02-26

Tried rebuilding the genome index with the other GTF file. Initially got:

Feb 26 21:00:45 ..... inserting junctions into the genome indices
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
./src/03-index-genome.sh: line 19: 14902 Aborted                 (core dumped) STAR --runThreadN $CORES --runMode genomeGenerate --genomeDir data/03-genome-index …

That seems… bad. But perhaps this is because I didn't have enough RAM, because I still had my Lisp process from 2 days ago open consuming ~25 GB of RAM. I killed that so I've got about 60 GB now. Let's see if that helps. Otherwise it sounds like I may need to look at this.

Still need to find that other paper my partner found from the same people.

Submitted weekly report.

March 2020

2020-03-16

Installed TrimGalore. Had to install cutadapt first, which is some Python package. Luckily it's apt installable so I did that, then grabbed TrimGalore which is blessedly self-contained.

Installed pardre. The "release" is just a source dump of C++ files with no README, cool. Had to install mpi-default-dev to get make to work. Gross.

Used ParDRe to dedupe a couple of the samples as a test. It took 927 seconds to finish the first pair, where I used the .fastq.gz files as input and the -z option to enable compression. I noticed a bunch of the time (the last 700 seconds or so) was spent bottlenecked on a single CPU core while writing the output. So then I tried the next sample without compression, and it took 309 seconds. Welp.

Ran FastQC on the first two sample results. For the most part FastQC is happy now, but there are still some duplicated sequences left. I'm guessing this is because ParDRe is paired-end aware, and will only remove a pair if the entire thing is a duplicate, while FastQC works only on a per-fastq basis. I don't know this for sure though.

I did the initial run with 0 mismatches allowed. We should also try with N mismatches, though note that this will not work perfectly because ParDRe clusters reads by their prefix, so reads with <=N mismatches in their prefix (and none in the suffix) will not be caught because they'll never get compared. Welp.

Cleaned up the scripts to run all the ParDRe runs (0/1/2 mismatches for every pair) and FastQC them. Seems to take roughly 5 minutes per pair to do ParDRe. We have 13 pairs, so it should take around an hour per mismatch setting (i.e. 3 hours) to do them all.

ParDRe finished. Results look reasonable. There's not a ton of difference between 0 and 1 mismatch (there is some though). The difference between 1 and 2 is minimal. Next step is to trim low-quality stuff with trim galore. Gonna do that during "class" tomorrow and let it run while I'm at work.

2020-03-17

Started sketching out some trimming scripts using Trim Galore. It has a bunch of options — I'm not really sure how to go about deciding what the correct values for each are, other than… try some and see if the resulting FastQC run seems reasonable?

Attempted to do something foolish: update software. I wanted to fix the StumpWM window/frame focus thing, but I figured I would update StumpWM and SBCL and my dotfiles first. Clearly, this was a mistake. I updated all of them, added the fix, and restarted StumpWM, and everything went to shit. StumpWM suddenly started chewing up tons of a CPU core, lagging to hell, and Firefox and Chrome wouldn't even open. Tried reverting back to the previous versions but it didn't work. Eventually I SSH'ed into the machine and rebooted it, and then everything started working again. I think. We'll see. Going to give up on updating for today — wasted enough time dicking around with this tonight. I hate computers.

2020-03-18

Debugged a gnuplot issue a bit in IRC with someone. I think I'm losing my mind.

Submitted weekly report for class.

2020-03-19

Met with the professor for office hours to chat about the deduplication/trimming results.

For deduplication, we first talked about what might cause the actual problem it's trying to solve. A cell may have many duplicate copies of a particular mRNA floating around if the gene is highly expressed — this is exactly what RNAseq is trying to measure. However, once those mRNAs are fragmented, it's extremely unlikely that they would fragment exactly in the same place, and so when you sequence the resulting fragments they should be offset from each other. If you see exact duplicate reads, it means that the fragment was somehow duplicated, which is probably an artifact of your laboratory process and not biologically significant.

Next: mismatches. ParDRe supports removing duplicates with up to N mismatches, so we talked about why we might want to do this. If two reads match with `N

0` mismatches it indicates one of the following:

At the end he recommended deduplicating with 1 mismatch.

Another complication: ParDRe can operate in paired-end mode, which means that in some cases FastQC still finds some duplication. This happens when one side of the read is an exact duplicate while the other is not. ParDRe does not remove this (because of the non-duplicate side) but FastQC only looks at the individual FASTQs. He said this is not a problem, and that ParDRe is doing what we want.

Also chatted a bit about trimming. We mostly talked about the issue mentioned in the Trim Galore documentation: because it trims any matching adapter sequences, it will always trim off a particular base at the end of a read (because that one base always matches the adapter sequence's last base). This is not ideal. He recommended running Trim Galore and requiring at least adapter 5 bases to match before trimming to deal with that issue.

Plan for today: rerun the trimming with the 5 base requirement and using the 1 mismatch deduplication data as the source, and then kick off an alignment if I still have time (unlikely).

Neomutt has suddenly started shitting the bed and failing to quite with Unable to write errors. I hate software. Tried updating to the latest master and rebuilding. I guess we'll see if it fixes things.

Trimming finished. It looks like increasing the minimum adapter length to 5 fixed the base imbalance issue. Good.

Decided to rebuild the STAR index I had made because I can't remember whether I had already rebuilt it with the different reference file. Shouldn't take too long.

Index rebuild finished in ~35 minutes. Kicked off the 26 alignment runs. That should take a lot longer.

Checked in on the STAR progress. It's actually only taking ~5-10 minutes to run each alignment. That's not nearly as bad as I expected.

Initial alignments finished. Started digging into what I can actually do with the results. I found the Integrative Genomics Viewer (IGV) that seems promising. Downloaded it and loaded the genome into it, then tried to load the alignments but apparently it wants BAMs, not SAMs, and additionally it wants indexes of the BAMs.

I looked into how to convert the SAMs to BAMs+indexes. Went down a bit of an unnecessary rabbit hole, but it was still productive. My initial attempt was using samtools to do a bunch of conversion:

STAR … # produces Aligned.out.sam

samtools view -S -b Aligned.out.sam > sample.unsorted.bam
samtools sort sample.unsorted.bam > sample.bam
samtools index sample.bam

This worked, but took a while and wrote a bunch of intermediate files I don't really need. Eventually I realized that STAR can produce sorted BAM files itself, so all I need to do with samtools is the final indexing. Learned a bunch about running STAR too. One nice way to save time across all the alignments is to have STAR load the genome index into shared memory once at the beginning and use it for all the alignments, the flush it out at the end:

function cleanup {
    STAR --genomeDir "${genome}" --genomeLoad Remove
}

trap cleanup EXIT

STAR --genomeDir "${genome}" --genomeLoad LoadAndExit

Then the STAR invocations look like:

    STAR \
        --runMode alignReads                \
        --runThreadN $CORES                 \
        --genomeDir "${genome}"             \
        --genomeLoad LoadAndKeep            \
        --limitBAMsortRAM "${sortram}"      \
        --outSAMtype BAM SortedByCoordinate \
        --outBAMcompression 0               \
        --outBAMsortingThreadN $CORES       \
        --outFileNamePrefix "${outdir}/"    \
        --readFilesIn "${in1}" "${in2}"

I tried putting the temporary directory on /dev/shm but it ended up being too much with the persistent genome index also in RAM. Unfortunately my order from Crucial is still backordered. Oh well, I'll just watch more TNG while I wait for all the alignments to complete.

2020-03-31

Ran the Trapnell plots from the paper on our data. Results seem at least plausible on the surface. Pinged my partner to see if his plots match.

Attempted to install RSeQC to do the TIN score stuff mentioned in the paper. Ran pip3 install RSeQC as described in their docs, but that failed with:

ValueError: no cython installed, but can not find pysam/libchtslib.c.Make sure that cython is installed when building from the repository

After some searching online I found that this can be fixed by upgrading pip with pip3 install --upgrade pip. It's… fine, I guess?

Once I got it installed, I tried running tin.py to produce the TIN scores. The script wants the BAM (which I have) and a "BED" file. BED apparently stands for Browser Extensible Data and is a TSV file with lines like this:

chr1    67092175    67109072    XM_011541469.1  0   -   67093004    67103382    0   5   1429,187,70,145,44, 0,3059,4076,11062,16853,
chr1    67092175    67131183    XM_011541467.1  0   -   67093004    67127240    0   9   1429,187,70,106,68,113,158,92,42,   0,3059,4076,11062,19401,23176,33576,34990,38966,
chr1    67092175    67131227    XM_017001276.1  0   -   67093004    67127240    0   9   1429,187,70,145,68,113,158,92,86,   0,3059,4076,11062,19401,23176,33576,34990,38966,

The collections of stuff I downloaded from UCSC didn't include one of these, but the RSeQC site has one available for download. Actually, it has several BED files. I'm not sure which one is correct. I downloaded hg38_RefSeq.bed because it sounds the most like the GTF I used (hg38.ncbiRefSeq.gtf). Was that correct? Who knows?

Now that I have that, I'm doing an initial run with:

tin.py --input data/05-alignment/clean/C0/C0.bam -r data/00-raw/hg38_RefSeq.bed

It seems like it's going to take a long time. Guess I'll wait.

The goddamn thing didn't output an ending timestamp. Come on. It took somewhere between 2-5 hours, but I have no idea how long because I wasn't watching it when it finished. I guess I'll do another run with a time prefix. Sigh.

April 2020

2020-04-04

Spent an hour wrangling gnuplot for class to try to make a multiplot of the TIN scores. Mostly it was just a bunch of fiddly stuff to try to make the graph look exactly like I wanted. Some of the key points follow.

Basic multiplotting happens with:

set multiplot \
    title "TIN Score Distributions by Sample" font ",14" \
    layout 3,5

…

unset multiplot

Looped through the samples with this:

do for[c=0:9] {
    set title "C" . c
    bad = (c == 3)
    plot [][0:1] "data/05-alignment/tin/C" . c . ".tin.xls" using 5:(1) smooth cnorm notitle lw 3 \
        lc (bad ? 3 : '#000000') \
        lt (bad ? 6 : 1)
}

do for[n=1:3] {
    set title "N" . n
    bad = (n == 2 || n == 3)
    plot [][0:1] "data/05-alignment/tin/N" . n . ".tin.xls" using 5:(1) smooth cnorm notitle lw 3 \
        lc (bad ? 3 : '#000000') \
        lt (bad ? 6 : 1)
}

Note the smooth cnorm requiring the columns to be using data-column:literal-value. This tripped me up for a while — I was doing using 0:data-column and that was fucking things up completely. I realized the problem when I noticed there wasn't a peak at zero, and also that sorting changed the results (which it shouldn't). smooth cnormal needs exactly one column of the value, plus a weight value for the point. Remember this!

You can skip around the plot with set multiplot next and set multiplot prev. Handy for adjusting the layout.

Instead of trying to draw labels on every graph, I decided to try putting a key at the bottom right. This was very fiddly:

set multiplot next

$empty << EOF
EOF

set termoption fontscale "0.1"
set title "{/*0.9 Key}"

set ylabel "{/*0.9 CDF}"
set xlabel "{/*0.9 TIN Score}"
set format x "{/*0.9 %g}"
set format y "{/*0.9 %.2f}"

set origin 0.75,0.0
set size 0.25,0.33

plot [0:100][0:1] $empty using 1:(1) smooth cnorm notitle

Note the %g for "humanized" formatting. I'm going to remember the g as standing for god dammit, what is the magic character?.

2020-04-11

Merged some PRs for t and did some basic test maintenance. I still love cram for CLI tests — it's such an intuitive, usable way to test things.