January 2020

2020-01-03

Converted paste.stevelosh.com to HTTPS. Forgot it in the previous conversion.

2020-01-14

Moved all my repositories from Source Hut to hg.stevelosh.com. Wasn't expecting to have to shave this yak so soon, but such is life. On the plus side: pushing is so much faster now. Rough timing for an empty push was ~5 seconds on Source Hut, down to around a second now. No more sitting and staring at the command line waiting to get back to work.

2020-01-15

Finished (mostly) reskinning hg.stevelosh.com. Had to shave a couple of yaks in Mercurial itself (diff headers getting truncated) and the hgweb Markdown extension. Ended up just completely forking the extension and tearing out almost all of its guts. The result should (hopefully) be easier to maintain as Mercurial breaks internal shit over time.

Still need to finish the CSS for the rendered Markdown, but that can wait for now.

2020-01-17

Got the RIT VPN set up on my Linux machine so I can connect for class. I was worried I'd have to install some horrible Cisco VPN client thing, but it turns out you can use the openconnect client to connect, so sudo apt install openconnect and then sudo openconnect vpn.rit.edu just magically works. Fantastic.

Finished the hg.stevelosh.com reskin on my lunch break. Just had to get the READMEs looking decent.

Finally figured out a strategy for syncing a small subset of my dotfiles onto remote servers easily, so I don't have to be completely miserable on them. Goals:

Ended up making a remote/ folder in my dotfiles. Almost everything inside it is a symlink except for a bootstrap.sh. The folder gets rsync --copy-linksed over and then bootstrap.sh will symlink all the resulting files into their places. Hopefully this should be easier to maintain. Did a bunch of cleanup on my bash and fish setups while I was at it.

Still need to figure out how to get a decent Vim setup remotely. Syncing all my plugins is probably not ideal, but maybe. Who knows.

2020-01-18

KA biology and coffee.

Cellular respiration:

Did the TRIE problem in Rosalind. Making an actual trie was easy, but getting an adjacency list for the flattened/copypasteable output and using cl-dot to graph it nicely was fiddly. Annoying bits about cl-dot to remember for next time:

Cleanup up the styling of hg.stevelosh.com a bit more. This time it was the <code> tags inside <li>'s that weren't styled. I styled <p><code> before, because I wanted to avoid applying those styles to <pre><code> content (otherwise it'd have double borders all over the place). But that missed out on the <code> tags that weren't inside paragraphs. Ended up just styling all <code>s and then undoing the borders in <pre><code>. Sigh.

Also realized that the repository descriptions served by hgweb can contain arbitrary HTML. Neat. Used that to clean a few more things up.

Updated all the Bitbucket repository READMEs to point to hg.stevelosh.com. The Great Source Hut Yak Shaving of 2020 is finally complete, thank christ. Time to move on to doing actual work.

2020-01-19

Finally got around to setting up packages for all of my Rosalind problems. In all my "coding challenge" repos (Rosalind, Project Euler, Advent of Code, etc) I originally just shoved everything into one big package. This was easy to start with, but as I solved more and more problems I started to hit naming conflicts more and more often. Eventually I realized that for my own sanity I should really give each problem its own package, with all the utilities in a utils package. I did this for my Advent of Code repo a while back and while it was a pain in the ass to refactor everything, the result is much nicer. Today I refactored the Rosalind repo. The Project Euler repo is a work in progress, mostly because it's so much bigger than the other two.

Did Rosalind problem PDST. Pretty easy. Most annoying part was the output, but I extended my float printer to handle 2D arrays to make it easier to output what they want in the future.

2020-01-20

Wanted to do a Rosalind problem over lunch. On a whim I decided to use CCL just to keep myself honest — I usually use SBCL, so using another implementation now and then helps ensure I'm writing standards-compliant code. But apparently a recent Vlime commit broke Vlime in CCL. So I had to shave that yak first. Sent a PR.

Did Rosalind problem ASPC. The problem itself was trivial. Made a handy iterate driver to make it even more trivial.

Running the test suite in CCL gave me exactly what I wanted: examples of where I was unintentionally relying on SBCL's behavior. Today there were two:

  1. Relying on the traversal order of iterate's :in-hashtable (which uses with-hash-table-iterator) to get the output of TRIE. Made a helper to convert the hash tables to sorted alists to fix the output.
  2. Relying on the ~,…F format directive having a half-up rounding strategy. Turns out that when format is rounding a float to a given number of decimal places, the implementation is allowed to round half however it wants. SBCL seems to round half up (which matches what Rosalind shows in their examples, which is why I never noticed) but CCL seems to use banker's rounding. Had to do an ugly hack in float-string to make the output consistently match Rosalind's style everywhere.

2020-01-22

Started doing Rosalind problems in shell/Awk, to join in with a group of folks at work doing them for the first time. The first few problems are simple, but I still managed to learn a few tricks by implementing them with UNIXy tools.

Fixed a few remaining broken links on my site. It's been a week or two since I poked around at it, and that was on another machine, but ./deploy.sh Just Worked. It's so nice to have a site generator I designed myself that's not going to change out from under me when I just want to fix something simple and move on with my life.

2020-01-23

Lesk book exercises, chapter 1:

  1. Average density would be (/ 3r4 3r9) = 1/100000 or 1 gene every ~100kb.
  2. Parts:
    • A: Two humans would have roughly 250k differences.
    • B: A human and a chimpanzee would have roughly 3m differences.
  3. Parts:
    • A: Average density would be 2 SNPs per 1000 base pairs.
    • B: Is this a trick question? I think 1.1% of the differences would be in protein-coding regions.
  4. Parts:
    • A: I am confused. The glossary defines a "haplotype" as a group of closely-linked genes that are typically inherited together. So… 1 haplotype? But the text talks about a haplotype as being a combination of SNPs in a recombination-poor region. So if "haplotype" means the combination of SNPs, not a set of genes, then I think this would be 4^10 = 1048576.
    • B: On a diploid chromosome, you'd have two separate sets of SNPs to combine (and order doesn't matter), so I think it's 4^10 * 4^10 / 2.
    • C: (1 SNP / 5 kb) * (100 kb) = 20 SNPs = 2^20 sequences = 1048576 sequences
  5. TODO
  6. TODO
  7. TODO
  8. A much less drastic change, though it could still cause issues if it's at a critical point in the RNA molecule (e.g. the anticodon in tRNA).
  9. TODO
  10. If the gene Kangaroo coat color is not on the X chromosome, you wouldn't be able to get the calico pattern that comes from X chromosome inactivation.
  11. Parts:
    • A: No, because if it were a mutation in the gene itself it would act by affecting the structure of the protein, and should have the same effect in adults and children. But it doesn't, so it's more likely to be a mutation in regulatory region.
    • B: It's not on the Y chromosome?
  12. TODO
  13. TODO
  14. TODO
  15. TODO
  16. TODO
  17. TODO
  18. TODO

2020-01-24

Solved the FIB Rosalind problem in my shell/Awk repo. Tried doing it in straight bash, but I couldn't figure out how to get the bash function to use a global bash array for memoization after like 10 minutes of searching. Ended up rewriting it in Awk, which took two minutes, was simple, and runs fast. Great.

Solved MMCH in my Lisp repo. Much like the previous problem in this line, it was way easier than they made it seem. Also fixed the Uniprot cache to use the filesystem instead of just keeping a hash table, to save a few seconds of my life every time I (run-tests) in a fresh Lisp session.

2020-01-25

Rebooted my Mansion repo to try to simplify it enough that I can start dogfooding it for actual use, then turn it into something publicly-usable.

2020-01-27

Trying to decide on data to use for the BIOL-650 project. A list of possibilities:

2020-01-29

Figured out how to get the mutt HTMLified email working on my local machine, like I had done for work. Had to install NeoMutt from source because the Ubuntu version is too old to include group-alternatives:

More searching around for projects for HTSA:

https://www.sciencedirect.com/science/article/pii/S2352340918302920
doi 10.1016/j.dib.2018.03.079
Illumina HiSeq 2500
https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA432903

https://www.sciencedirect.com/science/article/pii/S2352340919303750
doi 10.1016/j.dib.2019.104022
Ion Proton sequencer
https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=506922

Started going through the gnuplot book at night. Going to try to do that in parallel with the class.

2020-01-30

Drove to RIT only to discover class was not being held in person, and we're supposed to work online instead. Welp.

Installed FastQC locally and got it running. Really just downloaded the binary version and symlinked it into place.

Downloading the actual data is taking a while. While I wait, thinking about disk setups: now that I have a NAS and have moved my local backups to it, I can format my previously-for-backup internal drives and use them for actual data. I really need to spend the time to grok how Linux formats and mounts drives — I've been flailing and slapping things together and it has mostly worked until now, but I really should just learn it. Ordered a copy of the new edition of the sysadmin book that I remember being really good from my Dumbwaiter days. Hopefully that will be a good reference.

Finished the FastQC stuff early, so spent a few minutes going through the gnuplot book. Random notes:

Finally got around to writing a first draft of a lines utility. Discovered and fixed a stupid bug in Adopt in the process.

2020-01-31

Now that I've got a more thorough backup system, went through and did my yearly pruning of unnecessary crap on my hard drive, and adding restic excludes to avoid backin up track. Number of scans was cut drastically:

Much cleaner now.

February 2020

2020-02-01

Fixed up my $PATH in my bash profile, for when I'm forced to use bash.

Also had to fix up the uses of set_color in my Fish shell config to only happen for interactive connections, because Fish shits out some errors when you call set_color on a noninteractive connection (e.g. when using scp or rsync). Sigh.

Now that I've got my backups pushed to my NAS (plus off-site) I can repurpose those 4 TB drives sitting empty in my tower. Gonna use them for a data dump site on the Linux side of things, e.g. for all the FASTQs and resulting data I'll be dealing with for my class this semester. Read through the sysadmin book chapters on the filesystem which were super helpful. Decided to pair the drives as a btrfs raid 0 array. I'm not super concerned with durability here — everything should be reproducible from the original raw FASTQs from SRA if I'm doing my job right, so I may as well just have one giant pool of storage. Also I don't need this /dump to be encrypted — it'll be handling non-sensitive data.

To get them ready for btrfs I used:

Even with all that, mkfs.btrfs complained about a partition table still existing on one of the drives. At that point I gave up and passed -f to it, and everything Just Worked. In hindsight, I bet I could have just done that to begin with and let mkfs.btrfs handle blowing away everything, but it was still a fun exercise.

Attempted to figure out how to acquire MS Word for the class. Apparently there's a web version of Word now? If I can use that, at least I won't have to spin up a full Windows VM to write the papers…

Installed sra-toolkit from Ubuntu's repository. Unfortunately it doesn't seem to come with fasterq-dump, only the older fastq-dump, at least according to dpkg-query -L sra-toolkit. Installed the latest binary from their site instead.

Poked around at both fastq-dump and fasterq-dump. fasterq-dump seems more documented and user friendly, and is definitely faster. The biggest speedup came from telling it to use RAM as its temporary filesystem with -t /dev/shm, which cut the time down to 1/3. Guess all this RAM is actually good for something.

After rebooting later, the boot process was suddenly hanging for over a minute. Another fucking yak to shave. Figured out how to disable the goddamn splash screen (edit /etc/default/grub to remove quiet and default and then update-grub2) so I could get some logs. Eventually saw some systemd horseshit about a job is running for dev-disk-by\x2duuid-…. Tried to figure out which drive had that UUID, but none of them did. After a bunch of flailing in fstab I realized that it might be the UUID of the old backup drive I killed. Sure enough, there was still an entry in /etc/crypttab for it, and a bunch of systemd shit in /run/systemd/generator/ left over that I had to clean out. After that, everything boots fast(ish) again. I hate computing.

2020-02-02

Spent some time cleaning up the Makefile for my individual class project. Learned some stuff about make in the process:

Finished writing up my class project pick paper.

Did more in the gnuplot book.

Updated my lines tool to add a --limit option and handle broken pipes. I could have sworn I handled broken pipes in some other Lisp script at some point, but can't seem to find it now. I used sb-int:broken-pipe but I'm not sure if that's something I should be relying on. Need to ask in #sbcl.

2020-02-04

Class today. Did a bunch of various things after the lecture (I hope I can remember it all).

Got the group permissions sorted out on the server in two steps:

Once that was sorted out, started diving into aligning the Trapnell data with tophat and bowtie. This is… not pleasant.

First I need to install both tools. There's an Ubuntu package for tophat, but it wanted to install NodeJS as a prerequisite. Why on earth does tophat want NodeJS? SIGINTed that immediately and downloaded the binary version instead.

The wrapper script tophat2 tries to set up $PATH magically so everything just works:

#!/usr/bin/env bash
pbin=""
fl=$(readlink $0)
if [[ -z "$fl" ]]; then
   pbin=$(dirname $0)
 else
   pbin=$(dirname $fl)
fi
export PATH=$pbin:$PATH
$pbin/tophat "$@"

Unfortunately it doesn't try hard enough. The readlink is attempting to deal with symbolic links there, but only works with absolute links — relative symlinks (e.g. ln -s ./tophat-…/tophat2 tophat2) just return the relative path, which then gets passed to dirname and treated as relative from whatever directory you happen to be in. And let's ignore all those unquoted variable expansions I guess. Sigh. Better version:

#!/usr/bin/env bash
pbin=""
fl=$(readlink -f "$0")
if [[ -z "$fl" ]]; then
    pbin=$(dirname -- "$0")
 else
    pbin=$(dirname -- "$fl")
fi
export PATH="$pbin:$PATH"
"$pbin"/tophat "$@"

Once it's installed, I can finally run it. This is where things get even more confusing. After a whole bunch of flailing I think I managed to narrow down what the hell all these magic parameters are, and how things have changed since the paper was written. Notes:

That's as far as I got today. Next step is to clean up this mess and get it into the Makefile.

2020-02-06

Refactored the Makefile and added in all the alignment stuff. Wrote up the weekly report far too late.

2020-02-08

Decided to try using gnuplot to do the graphing exercises in the Campbell Biology textbook, to combine the things I've been reading lately.

The first exercise I did was from chapter 9. Data:

Low        4.3
Normal     4.8
Elevated   8.7

Gnuplot file:

set title "{/:Bold*1.5 Thyroid Hormome Levels and Oxygen Consumption}"
set xlabel "Thyroid Hormone Level"
set ylabel "Oxygen Consumption Rate [nmol O_2 / (min * mg cells)]"

unset key
set boxwidth 0.8

plot [][0:10] "bio/thyroid" u 0:2:xtic(1) w boxes fillstyle solid fc rgb "black"

That was pretty simple. Then I went back to a previous chapter. Data:

# Time (min)  Concentration of inorganic Phosphate (μmol/mL)
0               0
5              10
10             90
15            180
20            270
25            330
30            355
35            355
40            355

Gnuplot file:

set title "{/:Bold*1.3 Glucose-6-Phosphate Activity in Isolated Liver Cells}"
set xlabel "Time (minutes)"
set ylabel "Concentration of ℗_i (μmol/mL)"

set xtics format "%h min"

unset key

set grid xtics ytics

set lmargin 12
set rmargin  6

plot [0:45][0:400] "liver-glucose" w lines linewidth 2

This one was a little trickier. I was trying to play around in the gnuplot REPL, but pasting the Unicode characters was producing garbage at the gnuplot command line. Pasting into a normal terminal/fish shell worked fine. After much flailing I tracked down the problem to gnuplot itself:

At this point I just gave up on using Unicode at the REPL and created .gp files that I then load. The GPL strikes again.

2020-02-09

Installed R Studio from the .deb. Needed to apt install r-base first to get the dependencies. It installed… a lot of stuff. Let's hope it hasn't wreaked too much havoc.

Moved on to differential gene expression with cuffdiff. Finally starting to understand what the actual data files are. The input FASTQs are:

GSM794483_C1_R1_1.fq.gz
GSM794483_C1_R1_2.fq.gz
GSM794484_C1_R2_1.fq.gz
GSM794484_C1_R2_2.fq.gz
GSM794485_C1_R3_1.fq.gz
GSM794485_C1_R3_2.fq.gz
GSM794486_C2_R1_1.fq.gz
GSM794486_C2_R1_2.fq.gz
GSM794487_C2_R2_1.fq.gz
GSM794487_C2_R2_2.fq.gz
GSM794488_C2_R3_1.fq.gz
GSM794488_C2_R3_2.fq.gz

The C here apparently stands for "condition". I still have no idea what R[123] is trying to tell me. The _[12] is the paired-end read pairs. Would it kill people to name things more descriptively and/or include a README that explains what all their one-letter abbreviations mean?

Thinking about restructuring the filesystem layout to make the relationships between the FASTQ files more explicit:

C1/
    R1/
        GSM794483_C1_R1_1.fq.gz
        GSM794483_C1_R1_2.fq.gz
    R3/
        GSM794484_C1_R2_1.fq.gz
        GSM794484_C1_R2_2.fq.gz
    R3/
        GSM794485_C1_R3_1.fq.gz
        GSM794485_C1_R3_2.fq.gz
C2/
    R1/
        GSM794486_C2_R1_1.fq.gz
        GSM794486_C2_R1_2.fq.gz
    R2/
        GSM794487_C2_R2_1.fq.gz
        GSM794487_C2_R2_2.fq.gz
    R3/
        GSM794488_C2_R3_1.fq.gz
        GSM794488_C2_R3_2.fq.gz

But I might be too lazy at this point to do it.

The more I work with make the more I realize how limited it is. The general idea behind it (computing a dependency graph and using it to rebuild things) is sound, but the implementation is a clusterfuck of string processing and horrific syntax. It really gives me the urge to create Yet Another Build System. Sigh.

Eventually cuffdiff finished. Time to move on. Tried to install cummerbund in R, but ran into a bunch of errors. Had to install numerous third-party dependencies outside of R:

Why on earth does installing cummerbund require a goddamn MySQL client?! This is completely nuts.

Ran into a giant pile of inscrutable errors even after installing all these bananas dependencies. Eventually tracked it down to the R version shipped in the Ubuntu package repository being too out of date to work with some of the required packages. I hate volatile software. Installed a newer version of R with this guide.

Finally managed to get cummerbund working and started graphing stuff. It mostly works, but my graphs don't really look quite like theirs. Need to ask the professor about this.

In the mean time, played around with using gnuplot to plot the data instead of R. Turned out to be not too bad, since cuffdiff already dumps its output in a textual format. Example of the plot comparing the expression levels of the various genes:

set logscale
unset key
plot "gene_exp.diff" using \
    "value_1":"value_2" lt 3 pt 1 lw 1, \
    x+(0.5*x) lt -1 lw 1 dt 2, \
    x-(0.5*x) lt -1 lw 1 dt 2

I also played around with adding some nicer default linetypes to my ~/.gnuplot using colors from ColorBrewer. Fairly happy with the result so far, but I'll need to play around with them more over time to see if they hold up.

2020-02-10

Started making some actual gnuplot scripts to draw the stuff from yesterday. Got a couple of scripts working, but went down a horrific rabbit hole.

First: the graphs produced by cummerbund are titled very poorly ("Genes" is not particularly helpful). I'm going to need to chat with the professor to try to decipher what these graphs are trying to show me.

I reproduced the scatter and volcano plots without too much trouble. Used the PDF plotter and came up with the simple printable line styles that should be okay.

Then I tried to reproduce the first graph, the "distribution of expression levels for each sample", and all hell broke loose. First of all, what is this graph even trying to show? The title is just "Genes", which is useless garbage. The y-axis is labelled "Density", the x-axis is pabelled log_10(FPKM), and the function used to create it is csDensity, so maybe this is a kernel density plot of the log10(FPKM) values? Well, I tried that, and I ended up with fucking spaghetti in gnuplot — there were hundreds of lines all overlaying each other instead of the expected single line. I spent like an hour dicking around online trying to figure out what the hell gnuplot was doing. Eventually, finally, I realized no one online was going to help me, and decided to break things down myself. I extracted the problematic column from the file and computed the log10 values myself in Lisp, and found an issue: some of them are 0, which means their log10 value is undefined. After removing those and replotting, I confirmed that those undefined values were what was causing all the discontinuities in the original gnuplot kernel density plot — instead of ignoring the values, it would start a new line every time it hit one. Christ. So then I cloned down the cummerbund repo to see how they were handling this. The code is… not particularly easy to read. I searched around more online, now that I realized the problem, and eventually found a post where someone notes that they add 1 to the FPKM values first, before taking the log, to avoid negative and undefined log values. So the x-axis label that says log_10(FPKM) is fucking lying — what's actually on the graph is log_10(FPKM+1). Once I plugged that in, everything works and the graph looks roughly like theirs. To hell with all this, I'm done for tonight.

2020-02-11

Class today. Chatted about various stuff. The XLOC identifiers aren't actual identifiers, they're just arbitrary IDs generated during the alignment process. That explains why I was seeing 0s when trying to plot the XLOC_012662, which was regucalcin in the paper — it's a different gene in my run, so I'd need to look up the gene in my own results. Also talked about P and Q values, and plotting each to find interesting cutoffs. I really need to refresh my memory on how to more easily make histograms in gnuplot.

Started reading A Curious Moon and got about 1/3 of the way through. An interesting approach to teaching SQL, though the writing voice is a little bit annoying. Still turning out to be a good book though.

Finished the gnuplot book. It's really, really good.

Finished my weekly report and homework for this week. Good to have a couple days of buffer. Figured out how to build the assignment zip file in a Makefile which was overkill, but "fun":

assignments/sjl7678-alignment.zip: $(ALIGNMENTS)
    rm -rf assignments/sjl7678-alignment
    mkdir -p assignments/sjl7678-alignment/
    cd assignments/sjl7678-alignment && mkdir -p $(KEY_PREFIXES)
    mcp 'data/02-alignment/*/align_summary.txt' 'assignments/sjl7678-alignment/#1/align_summary.txt'
    mcp 'data/02-alignment/*/logs/tophat.log'   'assignments/sjl7678-alignment/#1/tophat.log'
    cd assignments && zip -r sjl7678-alignment.zip sjl7678-alignment

I'm slowly becoming more and more disillusioned of make. Even though I'm only using it for what it's actually designed for (running commands to turn files into new files) it's still a tremendous pain in the ass to do anything nontrivial. The "scripting" support is completely batshit string-based munging, and even the basic definition of any dependencies beyond explicit lists of files is miserable.

2020-02-12

Spent some (far too much) time getting the Trapnell data into Postgres, figuring out how to ergonomically plot it with gnuplot, and figuring out how to do decent histogram and CDF plots. I'm not thrilled with what I came up with, but it'll have to do for now because it's already 11 PM.

2020-02-15

Tracked down the root cause of https://github.com/dimitri/pgloader/issues/1005 — need to write it up in a comment.

2020-02-16

Finished the make book. I'm becoming less and less enthusiastic about make the more I learn about it and see real-world examples of it.

Wrote up the pgloader issue debugging comment.

2020-02-18

Class today was a lot of chatting and planning ahead for the group project. Things to think about:

One more big question: need to figure out what the overrepresented sequences FastQC is complaining about are. I figured I'd do BLAST searches for them, so I wanted to get a list of all the overrepresented sequences. So I need to pull them out of the FastQC reports… which are HTML files… all on one line. Did some horrific greping to get what I wanted:

 sjl at ouroboros in /dump/karkinos/fastqs/qc
 ><((°> grep -Po 'id="M9".*<h2' *.html | head | grep -Po '<td>.*?</td>' | grep -v 'No Hit' | grep '<td>[ACTG]' | tr -s '<>' '  ' | cut -d' ' -f3 | sort | uniq -c
       2 CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT
       2 CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT
       2 CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC
       2 CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT
       2 CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
       2 CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG

 sjl at ouroboros in /dump/karkinos/fastqs/qc
 ><((°> grep -Po 'id="M9".*<h2' *.html | head | grep -Po '<td>.*?</td>' | grep -v 'No Hit' | grep '<td>[ACTG]' | tr -s '<>' '  ' | cut -d' ' -f3 | sort | uniq | xargs -ISEQ grep --files-with-match SEQ *.html | sort | uniq
 SRR6671104.1_1_fastqc.html
 SRR6671104.1_2_fastqc.html

Interesting — only one of the samples has this issue, and the same sequences appear in both sides of the pair. Immediately after doing this horrible grep incantation I remembered the .zip produced by FastQC and sure enough, there's a text file in it with the results in a much less awful format. But I still needed a janky awk incantation:

 ><((°> awk 'BEGIN { RS=">>END_MODULE\n"; FS="\n" } $1 ~ /Overrepre/ { for (i=1;i<=NF;i++) print FILENAME, $i }' */fastqc_data.txt
 SRR6671104.1_1_fastqc/fastqc_data.txt >>Overrepresented sequences       warn
 SRR6671104.1_1_fastqc/fastqc_data.txt #Sequence Count   Percentage      Possible Source
 SRR6671104.1_1_fastqc/fastqc_data.txt CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT        58594   0.18785731410785536     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC        39157   0.12554065004132323     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG        36886   0.11825963218388151     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT        34457   0.11047205297836592     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT        33273   0.10667604895229328     No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT        31923   0.1023478349022949      No Hit
 SRR6671104.1_1_fastqc/fastqc_data.txt
 SRR6671104.1_2_fastqc/fastqc_data.txt >>Overrepresented sequences       warn
 SRR6671104.1_2_fastqc/fastqc_data.txt #Sequence Count   Percentage      Possible Source
 SRR6671104.1_2_fastqc/fastqc_data.txt CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT        58938   0.18896020716948458     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC        39362   0.12619789736002668     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG        36752   0.11783001686336315     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT        34782   0.11151403043484702     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT        34471   0.11051693816110664     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT        31900   0.10227409495922085     No Hit
 SRR6671104.1_2_fastqc/fastqc_data.txt
 SRR6671105.1_1_fastqc/fastqc_data.txt >>Overrepresented sequences       pass
 SRR6671105.1_1_fastqc/fastqc_data.txt
 …

Some quick cuting and such and I've got what I need, in a slightly less horrible way. I guess.

><((°> awk 'BEGIN { RS=">>END_MODULE\n"; FS="\n" } $1 ~ /Overrepre/ { for (i=3;i<NF;i++) printf(">%s/%d %s\n", FILENAME, x++, $i) }' */fastqc_data.txt | cut -f1 | sort -k2 | uniq -f1 | tr ' ' '\n'
>SRR6671104.1_1_fastqc/fastqc_data.txt/0
CCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCT
>SRR6671104.1_1_fastqc/fastqc_data.txt/5
CGGACATCTAAGGGCATCACAGACCTGTTATTGCTCAATCTCGGGTGGCT
>SRR6671104.1_1_fastqc/fastqc_data.txt/1
CTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTC
>SRR6671104.1_1_fastqc/fastqc_data.txt/4
CTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGT
>SRR6671104.1_1_fastqc/fastqc_data.txt/3
CTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
>SRR6671104.1_1_fastqc/fastqc_data.txt/2
CTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACG

Did a BLAST search for these sequences and they're all 18S Ribosomal RNA. Professor wasn't sure how that could have gotten into the mix, since the paper says they only used poly(A) RNA. Need to ask around at work to see if anyone has any ideas what the heck happened.

(This weirdness is only on sample C0. Unfortunately that's not the sample the paper identified as being low quality (C3) so that isn't the issue.)

Did a bit more on the Enceladus book. Decided to just read the rest, because the DB/Makefile crap is getting way too messy. I really need a sqlfmt tool, but every time I try to use smug to write a parser I get depressed.

Merged a GitHub PR into badwolf and tried to pull it down locally with hg-git, which fucking broke for the millionth time, with a lying error message blaming hg-prompt. Fucking fantastic. Updated hg-git (apparently there's a new repo on Heptapod since Atlassian is shitcanning BitBucket's Mercurial support) and now there are four heads on default. I have no idea which one I'm supposed to use. I picked the latest one arbitrarily and it seems to work, at least well enough to pull down the PR. Good enough.

2020-02-19

Finished the weekly report for class.

Fixed the scroll overflow of the code blocks in this site.