# HG changeset patch # User Steve Losh # Date 1580869653 18000 # Node ID 75e74c98dc015885350ce62e39d7aabb052a2332 # Parent 504550dd21d546612e735ed623e06d6a0c203936 Update diff -r 504550dd21d5 -r 75e74c98dc01 README.markdown --- a/README.markdown Sun Feb 02 21:12:04 2020 -0500 +++ b/README.markdown Tue Feb 04 21:27:33 2020 -0500 @@ -387,3 +387,94 @@ point, but can't seem to find it now. I used `sb-int:broken-pipe` but I'm not sure if that's something I should be relying on. Need to ask in `#sbcl`. +## 2020-02-04 + +Class today. Did a bunch of various things after the lecture (I hope I can +remember it all). + +Got the group permissions sorted out on the server in two steps: + +* Had the admin [add a `setgid` bit to the team directory][dir-setgid], so new + stuff in there will get created with the team group as the owner by default. +* Then we all added `umask 002` to our `rc` files to change our default `umask`, + so when we create files they'll be `664` by default instead of `644`. This + isn't a problem for non-team files because they get created with your personal + group as the group. + +[dir-setgid]: https://www.gnu.org/software/coreutils/manual/html_node/Directory-Setuid-and-Setgid.html + +Once that was sorted out, started diving into aligning the Trapnell data with +tophat and bowtie. This is… not pleasant. + +First I need to install both tools. There's an Ubuntu package for `tophat`, but +it wanted to install NodeJS as a prerequisite. Why on earth does tophat want +NodeJS? `SIGINT`ed that immediately and downloaded the binary version instead. + +The wrapper script `tophat2` tries to set up `$PATH` magically so everything +just works: + + #!/usr/bin/env bash + pbin="" + fl=$(readlink $0) + if [[ -z "$fl" ]]; then + pbin=$(dirname $0) + else + pbin=$(dirname $fl) + fi + export PATH=$pbin:$PATH + $pbin/tophat "$@" + +Unfortunately it doesn't try hard enough. The `readlink` is *attempting* to +deal with symbolic links there, but only works with absolute links — relative +symlinks (e.g. `ln -s ./tophat-…/tophat2 tophat2`) just return the relative +path, which then gets passed to `dirname` and treated as relative from whatever +directory you happen to be in. And let's ignore all those unquoted variable +expansions I guess. Sigh. Better version: + + #!/usr/bin/env bash + set -euo pipefail + pbin="" + fl=$(readlink -f "$0") + if [[ -z "$fl" ]]; then + pbin=$(dirname -- "$0") + else + pbin=$(dirname -- "$fl") + fi + export PATH="$pbin:$PATH" + "$pbin"/tophat "$@" + +Once it's installed, I can finally run it. This is where things get even more +confusing. After a whole bunch of flailing I think I managed to narrow down +what the hell all these magic parameters are, and how things have changed since +the paper was written. Notes: + +* Bowtie is an aligner — it aligns reads to a reference genome. +* Tophat uses bowtie to align reads, but adds splice-awareness on top. +* To use Bowtie to align, you need to preprocess your reference genome and + create an index of the genome. +* Additionally/separately, you also need to create a *transcriptome* reference + genome for the splice awareness. You do this with tophat, which invokes + bowtie under the hood. This means you end up with *two* bowtie indexes: + a genome index and a transcriptome index. +* To create the genome index: `bowtie2-build --threads 15 foo.fa some/prefix/foo` + * This creates a bunch of files in `some/prefix` named `foo.…`. + * Why does this take a basename and not just a directory name? + * Do the name of the FASTA and the name of the index files have to match? +* To create the transcriptome index: `tophat2 -p 15 --GTF foo.gff --transcriptome-index=another/prefix/bar some/prefix/foo` + * There are three separate things specified here, each in a different way: + * The gene annotation file is provided with the `--GTF` option, as a path. (Even if it's a GFF file, not a GTF file. There's no `--GFF` option. Of course.) + * The genome index is provided with a positional parameter, as a basename. + * The destination for the transcriptome index is a basename, and it'll splat out a bunch of files with various suffixes there. +* Note that the chromosome names in the `.gtf`/`.gff` file and the sequence + headers in the `.fa` file must match. + * This caused me much pain when I was trying to use the `.gtf` file (foolishly assuming `--GTF` required a `.gtf`), the inclusion of which seems to be a trap in the input data laid for the unwary student. I will not be fooled again. + * Tophat gave me a completely useless error (`"Couldn't build bowtie index with err = 1"`) when I tried this, rather than tell me anything useful that might have pointed me in the right direction. Extremely Good Software. +* Some/all of the prefixes/basenames may or may not have to match. By the time + I flailed enough to get everything working I had renamed everything to + a single prefix, but I'm not sure if it was strictly necessary. I'll + investigate tomorrow or Thursday, when I'm less annoyed. +* Once you've got the two indexes created, you can run the alignment: + `tophat2 -p 15 --transcriptome=another/prefix/bar --output-dir baz/ some/prefix/foo *.fq` + +That's as far as I got today. Next step is to clean up this mess and get it +into the `Makefile`.