2ceedb8b12e9

Update
[view raw] [browse files]
author Steve Losh <steve@stevelosh.com>
date Tue, 17 Mar 2020 16:15:30 -0400
parents 21840941f626
children f4dee6126a31
branches/tags (none)
files README.markdown

Changes

--- a/README.markdown	Fri Feb 28 18:18:42 2020 -0500
+++ b/README.markdown	Tue Mar 17 16:15:30 2020 -0400
@@ -1049,3 +1049,50 @@
 Still need to find that other paper my partner found from the same people.
 
 Submitted weekly report.
+
+# March 2020
+
+## 2020-03-16
+
+Installed TrimGalore.  Had to install `cutadapt` first, which is some Python
+package.  Luckily it's `apt install`able so I did that, then grabbed TrimGalore
+which is blessedly self-contained.
+
+Installed pardre.  The "release" is just a source dump of C++ files with no
+`README`, cool.  Had to install `mpi-default-dev` to get `make` to work.  Gross.
+
+Used ParDRe to dedupe a couple of the samples as a test.  It took 927 seconds to
+finish the first pair, where I used the `.fastq.gz` files as input and the `-z`
+option to enable compression.  I noticed a bunch of the time (the last 700
+seconds or so) was spent bottlenecked on a single CPU core while writing the
+output.  So then I tried the next sample without compression, and it took 309
+seconds.  Welp.
+
+Ran FastQC on the first two sample results.  For the most part FastQC is happy
+now, but there *are* still some duplicated sequences left.  I'm guessing this is
+because ParDRe is paired-end aware, and will only remove a pair if *the entire
+thing* is a duplicate, while FastQC works only on a per-fastq basis.  I don't
+know this for sure though.
+
+I did the initial run with 0 mismatches allowed.  We should also try with
+N mismatches, though note that this will not work perfectly because ParDRe
+clusters reads by their prefix, so reads with `<=N` mismatches in their prefix
+(and none in the suffix) will not be caught because they'll never get compared.
+Welp.
+
+Cleaned up the scripts to run all the ParDRe runs (0/1/2 mismatches for every
+pair) and FastQC them.  Seems to take roughly 5 minutes per pair to do ParDRe.
+We have 13 pairs, so it should take around an hour per mismatch setting (i.e.
+3 hours) to do them all.
+
+ParDRe finished.  Results look reasonable.  There's not a *ton* of difference
+between 0 and 1 mismatch (there is *some* though).  The difference between 1 and
+2 is minimal.  Next step is to trim low-quality stuff with trim galore.  Gonna
+do that during "class" tomorrow and let it run while I'm at work.
+
+## 2020-03-17
+
+Started sketching out some trimming scripts using Trim Galore.  It has a bunch
+of options — I'm not really sure how to go about deciding what the correct
+values for each are, other than… try some and see if the resulting FastQC run
+seems reasonable?