75e74c98dc01

Update
[view raw] [browse files]
author Steve Losh <steve@stevelosh.com>
date Tue, 04 Feb 2020 21:27:33 -0500
parents 504550dd21d5
children 90a8a841f2ba
branches/tags (none)
files README.markdown

Changes

--- a/README.markdown	Sun Feb 02 21:12:04 2020 -0500
+++ b/README.markdown	Tue Feb 04 21:27:33 2020 -0500
@@ -387,3 +387,94 @@
 point, but can't seem to find it now.  I used `sb-int:broken-pipe` but I'm not
 sure if that's something I should be relying on.  Need to ask in `#sbcl`.
 
+## 2020-02-04
+
+Class today.  Did a bunch of various things after the lecture (I hope I can
+remember it all).
+
+Got the group permissions sorted out on the server in two steps:
+
+* Had the admin [add a `setgid` bit to the team directory][dir-setgid], so new
+  stuff in there will get created with the team group as the owner by default.
+* Then we all added `umask 002` to our `rc` files to change our default `umask`,
+  so when we create files they'll be `664` by default instead of `644`.  This
+  isn't a problem for non-team files because they get created with your personal
+  group as the group.
+
+[dir-setgid]: https://www.gnu.org/software/coreutils/manual/html_node/Directory-Setuid-and-Setgid.html
+
+Once that was sorted out, started diving into aligning the Trapnell data with
+tophat and bowtie.  This is… not pleasant.
+
+First I need to install both tools.  There's an Ubuntu package for `tophat`, but
+it wanted to install NodeJS as a prerequisite.  Why on earth does tophat want
+NodeJS?  `SIGINT`ed that immediately and downloaded the binary version instead.
+
+The wrapper script `tophat2` tries to set up `$PATH` magically so everything
+just works:
+
+    #!/usr/bin/env bash
+    pbin=""
+    fl=$(readlink $0)
+    if [[ -z "$fl" ]]; then
+       pbin=$(dirname $0)
+     else
+       pbin=$(dirname $fl)
+    fi
+    export PATH=$pbin:$PATH
+    $pbin/tophat "$@"
+
+Unfortunately it doesn't try hard enough.  The `readlink` is *attempting* to
+deal with symbolic links there, but only works with absolute links — relative
+symlinks (e.g. `ln -s ./tophat-…/tophat2 tophat2`) just return the relative
+path, which then gets passed to `dirname` and treated as relative from whatever
+directory you happen to be in.  And let's ignore all those unquoted variable
+expansions I guess.  Sigh.  Better version:
+
+    #!/usr/bin/env bash
+    set -euo pipefail
+    pbin=""
+    fl=$(readlink -f "$0")
+    if [[ -z "$fl" ]]; then
+        pbin=$(dirname -- "$0")
+     else
+        pbin=$(dirname -- "$fl")
+    fi
+    export PATH="$pbin:$PATH"
+    "$pbin"/tophat "$@"
+
+Once it's installed, I can finally run it.  This is where things get even more
+confusing.  After a whole bunch of flailing I think I managed to narrow down
+what the hell all these magic parameters are, and how things have changed since
+the paper was written.  Notes:
+
+* Bowtie is an aligner — it aligns reads to a reference genome.
+* Tophat uses bowtie to align reads, but adds splice-awareness on top.
+* To use Bowtie to align, you need to preprocess your reference genome and
+  create an index of the genome.
+* Additionally/separately, you also need to create a *transcriptome* reference
+  genome for the splice awareness.  You do this with tophat, which invokes
+  bowtie under the hood.  This means you end up with *two* bowtie indexes:
+  a genome index and a transcriptome index.
+* To create the genome index: `bowtie2-build --threads 15 foo.fa some/prefix/foo`
+  * This creates a bunch of files in `some/prefix` named `foo.…`.
+  * Why does this take a basename and not just a directory name?
+  * Do the name of the FASTA and the name of the index files have to match?
+* To create the transcriptome index: `tophat2 -p 15 --GTF foo.gff --transcriptome-index=another/prefix/bar some/prefix/foo`
+  * There are three separate things specified here, each in a different way:
+    * The gene annotation file is provided with the `--GTF` option, as a path.  (Even if it's a GFF file, not a GTF file.  There's no `--GFF` option.  Of course.)
+    * The genome index is provided with a positional parameter, as a basename.
+    * The destination for the transcriptome index is a basename, and it'll splat out a bunch of files with various suffixes there.
+* Note that the chromosome names in the `.gtf`/`.gff` file and the sequence
+  headers in the `.fa` file must match.
+  * This caused me much pain when I was trying to use the `.gtf` file (foolishly assuming `--GTF` required a `.gtf`), the inclusion of which seems to be a trap in the input data laid for the unwary student.  I will not be fooled again.
+  * Tophat gave me a completely useless error (`"Couldn't build bowtie index with err = 1"`) when I tried this, rather than tell me anything useful that might have pointed me in the right direction.  Extremely Good Software.
+* Some/all of the prefixes/basenames may or may not have to match.  By the time
+  I flailed enough to get everything working I had renamed everything to
+  a single prefix, but I'm not sure if it was strictly necessary.  I'll
+  investigate tomorrow or Thursday, when I'm less annoyed.
+* Once you've got the two indexes created, you can run the alignment:
+  `tophat2 -p 15 --transcriptome=another/prefix/bar --output-dir baz/ some/prefix/foo *.fq`
+
+That's as far as I got today.  Next step is to clean up this mess and get it
+into the `Makefile`.