# HG changeset patch # User Steve Losh # Date 1584476130 14400 # Node ID 2ceedb8b12e9716a645bb1c7bf35dfc83d8c79af # Parent 21840941f62678a96213b0168e6ba46248defc7c Update diff -r 21840941f626 -r 2ceedb8b12e9 README.markdown --- a/README.markdown Fri Feb 28 18:18:42 2020 -0500 +++ b/README.markdown Tue Mar 17 16:15:30 2020 -0400 @@ -1049,3 +1049,50 @@ Still need to find that other paper my partner found from the same people. Submitted weekly report. + +# March 2020 + +## 2020-03-16 + +Installed TrimGalore. Had to install `cutadapt` first, which is some Python +package. Luckily it's `apt install`able so I did that, then grabbed TrimGalore +which is blessedly self-contained. + +Installed pardre. The "release" is just a source dump of C++ files with no +`README`, cool. Had to install `mpi-default-dev` to get `make` to work. Gross. + +Used ParDRe to dedupe a couple of the samples as a test. It took 927 seconds to +finish the first pair, where I used the `.fastq.gz` files as input and the `-z` +option to enable compression. I noticed a bunch of the time (the last 700 +seconds or so) was spent bottlenecked on a single CPU core while writing the +output. So then I tried the next sample without compression, and it took 309 +seconds. Welp. + +Ran FastQC on the first two sample results. For the most part FastQC is happy +now, but there *are* still some duplicated sequences left. I'm guessing this is +because ParDRe is paired-end aware, and will only remove a pair if *the entire +thing* is a duplicate, while FastQC works only on a per-fastq basis. I don't +know this for sure though. + +I did the initial run with 0 mismatches allowed. We should also try with +N mismatches, though note that this will not work perfectly because ParDRe +clusters reads by their prefix, so reads with `<=N` mismatches in their prefix +(and none in the suffix) will not be caught because they'll never get compared. +Welp. + +Cleaned up the scripts to run all the ParDRe runs (0/1/2 mismatches for every +pair) and FastQC them. Seems to take roughly 5 minutes per pair to do ParDRe. +We have 13 pairs, so it should take around an hour per mismatch setting (i.e. +3 hours) to do them all. + +ParDRe finished. Results look reasonable. There's not a *ton* of difference +between 0 and 1 mismatch (there is *some* though). The difference between 1 and +2 is minimal. Next step is to trim low-quality stuff with trim galore. Gonna +do that during "class" tomorrow and let it run while I'm at work. + +## 2020-03-17 + +Started sketching out some trimming scripts using Trim Galore. It has a bunch +of options — I'm not really sure how to go about deciding what the correct +values for each are, other than… try some and see if the resulting FastQC run +seems reasonable?