# HG changeset patch # User Steve Losh # Date 1698020148 14400 # Node ID 8de6c03950e7cfe625acccd39f7920231893d0cd # Parent f76b2070bb71dc2f84cf4cfeb4d1ba0edea969cd Move month diff -r f76b2070bb71 -r 8de6c03950e7 2023.markdown --- a/2023.markdown Sun Oct 22 20:15:21 2023 -0400 +++ b/2023.markdown Sun Oct 22 20:15:48 2023 -0400 @@ -795,3 +795,1092 @@ The number of recombination events (crossovers) per chromosome is random, but is usually relatively low (3-5 per chromosome). +# September 2023 + +## 2023-09-01 + +HG545. Looked over the slides last night and was a little worried, but felt +okay after the lecture for the most part. Still a few things I need to look up +and I do still need to get my fleeting notes into this, but I feel okay. + +Continuing the Snakemake tutorial. + +Threads can be specified for a given job with `threads: 8`, and you need to +propagate that to the command yourself with `{threads}`. Will be scaled down if +run with fewer cores than threads, otherwise will wait until that many are +available. + +Snakemake has some support for noticing log files, but it seems like you have to +manually create them yourself? This seems… tedious? + + rule bwa_map: + input: + "data/genome.fa", + lambda wc: SAMPLES[wc.sample] + output: + "mapped_reads/{sample}.bam" + threads: 8 + params: + rg=r"@RG\tID:{sample}\tSM:{sample}" + log: "logs/bwa_map/{sample}.log" + shell: + "(" + "bwa mem -R '{params.rg}' -t {threads} {input}" + " | samtools view -Sb - > {output}" + ") >{log} 2>&1" + +Do I really have to wrap everything in `(…) >{log} 2>&1` by hand myself? + +You can get a summary of file provenance with `snakemake --summary`. The output +is a TSV, so I went down a rathole of pretty-printing TSVs and eventually found +that `| column -s $'\t' -t` works (mnemonic: `s$tt`). I love how every UNIX +program gets to invent its own bespoke command line interface for specifying +special characters. Really great. + +Can mark outputs as `temp()` and `protected()`, which is nice. + +Need to install singularity *inside* my VM: + + # Ensure repositories are up-to-date + sudo apt-get update + + # Install debian packages for dependencies + sudo apt-get install -y \ + wget \ + build-essential \ + libseccomp-dev \ + libglib2.0-dev \ + pkg-config \ + squashfs-tools \ + cryptsetup \ + runc + + # Install Golang + export VERSION=1.21.0 OS=linux ARCH=amd64 && \ + wget https://dl.google.com/go/go$VERSION.$OS-$ARCH.tar.gz && \ + sudo tar -C /usr/local -xzvf go$VERSION.$OS-$ARCH.tar.gz && \ + rm go$VERSION.$OS-$ARCH.tar.gz + + echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrc && \ + source ~/.bashrc + + # Install Singularity + export VERSION=3.11.4 && \ + wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-ce-${VERSION}.tar.gz && \ + tar -xzf singularity-ce-${VERSION}.tar.gz && \ + cd singularity-ce-${VERSION} + + ./mconfig && \ + make -C builddir && \ + sudo make -C builddir install + +## 2023-09-02 + +It is time to shave the LaTeX yak again. Installed it with `texlive-latex-base` +to start, we'll see if I need to add some more crud in later. Going to go +through some guides for now. + +Going to note some things to remember. Skeleton of document: + + \documentclass{article} + \begin{document} + + Basic text. + + \end{document} + +Math: + + Inline math $y = 3 \sin x$ example. + + Block equation: + \[ + y = 3 \sin x + \] + + With reference: + \begin{equation}\label{equa} + y' = 3 \cos x + \end{equation} + refer to it by label, e.g. equation (\ref{equa}). + + More complicated: $x^2$ and $x^{2+\alpha}$ and $y_{n+1}$. + +Verbatim: + + Verbatim text: \verb"$x^{2+\alpha}$". Delimiter can be anything ala sed, + \verb_%%&_ or \verb+$$+. + + Must escape special characters \&, \$, \%, \_, \{, \}, and \#. + + \begin{verbatim} + A whole verbatim region. + + (defun square (x) + (* x x)) + \end{verbatim} + +Comments: + + Comments exist. % This is a comment. + +Type styles: + + Shapes: + \textup{Upright} + \textit{Italic} + \textsl{Slanted} + \textsc{Small} + + Series (weight): + \textmd{Medium} + \textbf{Boldface} + + Families: + \textrm{Roman} + \textsf{Sans} + \texttt{Typewriter} + +Emphasis: + + \emph{Never} do Foo! + +"Environments" are sections that are treated differently, made with `\begin{…}` +and `\end{…}`. + +Lists: + + Unordered list: + \begin{itemize} + \item Foo + \item Bar + \item Baz + \end{itemize} + + Ordered list: + \begin{enumerate} + \item One + \item Two + \item Three + \end{enumerate} + + Customizable labels: + \begin{description} + \item[Rule 1.] Foo + \item[Rule 2.] Bar + \item[Rule 3.] Baz + \end{description} + +Sizes (note the brace comes BEFORE the command!): + + {\Huge Huge} + {\huge huge} + {\LARGE LARGE} + {\Large Large} + {\large large} + {\normalsize normalsize} + {\small small} + {\footnotesize footnotesize} + {\scriptsize scriptsize} + {\tiny tiny} + +Centering: + + \begin{center} + {\large\textbf{Assignment 1}}\\% The \\ linebreaks. + Steve Losh\\ + BS521 + \end{center} + +Example table. + + \begin{tabular}{l|rc} % lrc = cols should be left, right, centered, pipe for vertical line + Name & Mark & Grade \\ + \hline\hline + Foo & 99 & A+ \\ + Bar & 51 & C \\ + Baz & 5 & F + \end{tabular} + +Colspan with multicolumn command. + + \begin{tabular}{|l||r|r|} + \hline + & \multicolumn{2}{c|}{Grades} \\ + \cline{2-3} + Name & Class 1 & Class 2 \\ + \hline\hline + Foo & 99 & 88 \\ + Bar & 51 & 65 \\ + Baz & 5 & 58 \\ + \hline + \end{tabular} + +Full example, with referencing and caption, e.g. `Table~\ref{tab:a} on page~\pageref{tab:a}`. + + % b = try to put at Bottom. Also t top, h here, p separate page. + % Can do multiple in order of preference. + % [!t] ! = try harder + \begin{table}[b] + \begin{center} + \caption{An Example Table} + \label{tab:a} + + \begin{tabular}{lr} + Name & Value \\ + \hline + Foo & 1.0 \\ + Bar & 15.9 \\ + Baz & 6.2 + \end{tabular} + % \caption{Caption at the end works too.} + \end{center} + \end{table} + +Sections: + + \section{Some section} % includes numbering + \subsection{Some subsection} + + \section*{Some section} % no numbering + \subsection*{Some subsection} + +Quotation marks (hilarious): + + `Single quoted' + ``Double quoted'' + +Change overall text size (simple): + + \documentclass[11pt]{article} % only valid sizes are 10/11/12. + +Palatino instead of Computer Modern: + + \usepackage{mathpazo} + \linespread{1.05} % needs more leading (space between lines) + \usepackage[T1]{fontenc} + +Vimtex stuff: + +* Close thing with `]]` in insert mode. + +Got about that far, which was enough to start my BIOSTAT-521 homework. Will dig +in again later, but it's nice to be able to use it for something real to +practice. + +Puttered around a bit looking at other fonts, but didn't find anything new or +interesting. + +## 2023-09-03 + +Spent most of today getting the reading room in my apartment ready. Went to +IKEA, looked around a lot and got some ideas, picked up a chair and bookshelf +for my still-boxed unread books. Not the most productive Sunday, but sitting in +my chair and reading at night felt good. + +## 2023-09-04 + +Continuing the EdX genetics course over breakfast. + +## 2023-09-05 + +BIOSTAT-521 and BIOINF-500 classes this morning. + +Going to spend my non-class time looking into Unicycler today (and taking care +of paperwork if anything that needs my attention crops up) and Bandage to +visualize the results. + +Grabbed the container from StaPH-B and tried to figure out where the executable +is. I think it's `/unicycler/unicycler-runner.py`. Grabbed the sample data +from the Unicycler repo and got something running: + + singularity exec containers/unicycler.sif \ + /unicycler/unicycler-runner.py \ + --short1 sample_data/short_reads_1.fastq.gz \ + --short2 sample_data/short_reads_2.fastq.gz \ + --out assembly/ + +Pretty straightforward so far. The result directory contains a log and a bunch +of GFA files. GFA apparently stands for [Graphical Fragment +Assembly](https://gfa-spec.github.io/GFA-spec/GFA1.html), but I think it's +"graphical" in the "directed/undirected graph" sense and not in the "pixels" +sense. Text-based format which seems pretty straightforward to parse (maybe +I should have a go at parsing it for fun). + +Installed Bandage and viewed the results. Not sure what exactly I'm looking at, +but it works and looks pretty enough I guess. [This +example](https://github.com/rrwick/Bandage/wiki/Simple-example) was helpful to +get a sense of what it's trying to show me. + +Unicycler has some configuration knobs to tweak: + +> Unicycler can be run in three modes: conservative, normal (the default) and +> bold, set with the --mode option. Conservative mode is least likely to produce +> a complete assembly but has a very low risk of misassembly. Bold mode is most +> likely to produce a complete assembly but carries greater risk of misassembly. +> Normal mode is intermediate regarding both completeness and misassembly risk. + +Reran with both conservative and bold modes and looked at the difference in the +results for the sample data. They're not the same, but I can't visually detect +any major obvious differences. Maybe it's not a big deal on this sample. + +Once I got that running in a shell, I got it ported into Snakemake, shaving +a bunch of yaks along the way. + +I noticed that Snakemake can take a JSON config file instead of YAML. Fantastic, +switched over to that right away: + + { + "containers": { + "fastqc": "docker://staphb/fastqc:0.12.1", + "unicycler": "docker://staphb/unicycler:0.5.0" + }, + "samples": { + "short_read_example": [ + "sample_data/short_reads_1.fastq.gz", + "sample_data/short_reads_2.fastq.gz" + ] + } + } + +Got the containers downloading via snakemake as well, so it's snakes all the way +down: + + rule containers: + input: + expand("containers/{name}.sif", name=config["containers"].keys()), + + rule container: + output: + "containers/{name}.sif", + params: + source=lambda wc: config["containers"][wc.name], + shell: + "singularity pull {output} {params.source}" + +Got `snakefmt` working with Neoformat so I can `F6` in Vim to reformat. Had to +fuck around with the config because it was just emptying out the file — I think +the key was that: + +* `snakefmt` edits in-place by default (gross). +* Need to use `replace`: `1` in the Neoformat config to deal with this. + +And finally we can assemble: + + def get_input_fastqs(wildcards): + return config["samples"][wildcards.sample] + + rule unicycler_assemble: + log: + "logs/unicycler_assemble/{sample}.log", + container: + "containers/unicycler.sif" + threads: 8 + input: + "containers/unicycler.sif", + get_input_fastqs, + output: + "assemblies/{sample}/assembly.gfa", + "assemblies/{sample}/assembly.fasta", + shell: + logged( + "/unicycler/unicycler-runner.py" + " --threads {threads}" + " --short1 {input[0]}" + " --short2 {input[1]}" + " --out assemblies/{wildcards.sample}/" + ) + +Puttered around changing my StumpWM and terminal colors/borders/etc a bit to +make them a little easier on my eyes. We'll see if it sticks. + +## 2023-09-06 + +HG545 in the morning. Mostly understood things. + +Asked the professor after about a question I had while reading the paper. One +of the things the paper did to confirm the region of interest was to use a PAC +(P1-derived artificial chromosome) to "rescue" the golden embryos. The +resulting fish showed mosaic rescue, confirming that the wild-type gene was +likely on that PAC, i.e. in the region of interest. + +What I didn't understand is how injecting the plasmid into the embryos resulted +in the expression of the genes farther down the developmental line, e.g. does it +somehow get incorporated into the cells' genomes? It turns out to be messy. + +First: you don't inject "a PAC" into the embryos, you inject "a shitload of +copies of the PAC" into the embryo. So all of the embryo's cells will have +copies of the PAC floating around inside. As the cell divides, those will get +diluted in daughter cells over time. Some of these copies will, by chance, make +it into the nucleus of their cells. And some of those (rarely) will get +randomly incorporated into the cell's genome, and from then on mitosis takes +over and the gene gets propagated normally. So the mosaic region from that +point forward will have the gene (and if the region happens to contain some +melanophores, also the rescued wild-type phenotype). + +Got my Armis access at some point during class, so it's time to figure out how +to log into the various HPC clusters today. + +Doubled checked exam schedule to make sure nothing conflicts. I think it's +fine. + +Changed my school password after the network clusterfuck last week. Sigh. + +Wanted to print something in the lab, realized I never installed any printing +support on this laptop, lol. `apt install cups` will hopefully Just Work. CUPS +interface is at `http://localhost:631/`. It did not Just Work. Surely 2024 +will be The Year of Linux on the Desktop. Printer wouldn't configure itself, +driver didn't appear in the list when I tried to manually configure it through +CUPS. `apt install printer-driver-all` got me more drivers but not this one. +Tracked it down on the brother site and downloaded some `.deb` packages but +they're 32-bit instead of 64. Gave up at this point, what a janky shitshow of +an OS. If only everything else didn't suck in even worse ways. + +Tried getting the VPN running. Installed with the script into `/opt/cisco` +(good). Got mysterious errors when trying to connect. Tried the GUI connection +manager, which runs, but gave a more informative error message I could search +for. Looks like I need to install `libwebkit2gtk-4.0-37`. Installed that, now +I get the login screen, but I can't 2FA because I only have Yubikeys set up but +that requires a real browser, not this jank webkitgtk thing, so now I need to +set up *another* 2FA method just for this. Good god. Tried to add a new 2FA +device via Duo, but that requires 2FAing *again*, but *this* time it's through +the Duo site which doesn't actually fucking work on Linux, so I can't add it +here, I'll need to use the Windows shitbox at home. God, I hate two-factor +authentication so much. It's *always* miserable. + +Tried to do the homework for BIOINF-500 (creating a pubmed search alert). To do +this you need an NCBI account. Tried to log in via my UM account and managed to +500 the NCBI site. Incredible. Poked around and eventually got it working (I +think)? Created the alert, took a screenshot. Uploaded the PNG into Canvas for +the assignment. Canvas shows an error trying to retrieve it. Uploaded it to +the Canvas "My Files" and used *that* to submit the assignment, instead of +uploading the file directly, and that worked. *Incredible*. Why does *nothing* +ever work correctly? + +Came home, tried to add the 2FA with the Windows box, but it failed in the same +way (hanging on the popup after successfully touching the yubikey). But +I finally figured out a workaround (in retrospect, I vaguely recall having to do +this when I added the extra yubikeys originally): log out of everything, then go +to log back into something (e.g. Wolverine Access), but **before** you 2FA in +*that* login process there will be a link to add a new device on the left side +of the screen. That will then require you to 2FA, but doing it *here* does +actually work. It seems absolutely wild to me that you need to *not* be logged +in if you want to manage 2FA, but here we are. + +Now that I have the Duo app thing on my phone and connected, logging into the +VPN seems to work great. Yak shaved successfully. + +## 2023-09-07 + +BS521 and its lab this morning. + +Got my dotfiles synced to GL. One tricky thing: my remote `.bash_profile` +sources `/etc/profile` if it exists, but that causes problems on the cluster +because there's some read-only variable set in there that it doesn't like. +Commented out that line and everything looks okay. `ControlMaster` does work, +so I won't have to auth a billion times a day (thank god). + +BS521 lab. Of course the AC isn't working so it's a billion degrees, lovely. + +Installing tidyverse failed with inscrutable errors. After some googling it +[looks like](https://blog.zenggyu.com/posts/en/2018-01-29-installing-r-r-packages-e-g-tidyverse-and-rstudio-on-ubuntu-linux/index.html) +there's *extra* dependencies you need to install on Linux (C programming is +wonderful): + + sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev \ + libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev \ + libpng-dev libtiff5-dev libjpeg-dev + +Still fucked, so I manually trawled through the tons of log output, googled +around more, and found that I need to configure an env variable: + + Sys.setenv(PKG_CONFIG_PATH="/usr/lib/x86_64-linux-gnu/pkgconfig") + +And finally it works. Started going through the lab stuff but then time was up +— will come back to it later. + +Moving on to rotation lab work. Working from home today, so I wanted to get my +laptop working with my external monitor and keyboard and such. Had to swap out +the USB-C cable I was using previously because apparently that one doesn't work +for video ("universal" serial bus my ass) and then adjust my StumpWM `xrandr` +commands, but I did get it working, so now I can use the laptop at my desk, +which is nice. + +Note to self: I can probably use the text-mode VPN UI now that I've got the 2FA +sorted out, and maybe remove the webkit crap I installed for it originally. + +Finished BS521 lab 0. Wasn't too bad once I shaved the tidyverse yak. Getting +it written up with Latex required more Latex derusting, this time for code +listings and images (pulled some of this from the MS thesis). First, preamble +stuff: + + % Listing package for code listings. + \RequirePackage{listings} + \lstdefinestyle{default}{ + basicstyle=\footnotesize\ttfamily, + showtabs=true, + frame=lines, + aboveskip=10pt, + } + \lstset{ + language=, + style=default, + } + + % Used to embed plots. + \usepackage{graphicx} + + % No paragraph indentation for homework, just looks awkward. + \setlength{\parindent}{0pt} + + % Inline code. + \def\code#1{\small\texttt{#1}} + +I really need to split these up into actual files I can include instead of +copypasting them a million times, but maybe I'll wait for one more practice +round first. + +Usage: + + % Code listings. + \begin{lstlisting} + prop.table(table(DATA$Race)) + + Black MexicanAmerican Other OtherHispanic White + 0.22921790 0.23954747 0.04230202 0.03885883 0.45007378 + \end{lstlisting} + + + % Graphics. + \includegraphics[]{figures/bmi-hist} + + \begin{center} + \includegraphics[width=0.45\textwidth]{figures/hist-age} + \includegraphics[width=0.45\textwidth]{figures/hist-log-age} + \end{center} + +To actually save PDF plots with R: + + pdf("figures/foo.pdf", height=6, width=6) + hist(DATA$Foo, main = "Distribution of Foo", xlab="Foo") + dev.off() + +Went to the poster session. Lots of stuff I don't understand, and a tiny bit +that I do. + +## 2023-09-08 + +HG545 this morning. + +Papers never say what *could* have gone wrong with what they did — you have to +just read between the lines and actively think about that (and what it would +have meant, and what you would have done if it did). + +Learned about nonsense-mediated decay: a mechanism where mRNA with premature +stop codons is degraded, instead of expressing a (probably truncated) protein. +Without this, if you have a mutation that creates a stop codon in the middle of +the gene, you would see truncated protein expressed. But because of NMD, the +mRNA is degraded and doesn't express the broken protein (as much). This is good +not only to reduce wasted translation, but because the truncated proteins can be +actively bad. + +One important control that was left out of the study where they wanted to find +where in the organism the target gene is being expressed: inject a probe with +GFP that intentionally shouldn't match *anything*, and expect it to show up +vaguely all over (or not at all). + +Another example covered during class: if you suspected a phenotype was caused by +a mutation in a promoter (instead of in an exon), how would you test this? +There were a couple of things folks came up with: + +* Could sequence the region in the mutant and wild-type population, compare to + see if the mutation segregates the two reliably. +* Old school: "reporter genes". I'm a bit fuzzy on this, but I think you insert + the promoter into a vector with some easily observable gene (e.g. luciferase, + a bioluminescent protein). Then you see if that product is expressed more or + less with the different variants of the promoter. This is a bit janky because + just yanking the promoter completely out of context can be problematic (e.g. + loses the chromatin structure around it, nearby enhancers/repressors, etc). +* Could use RNAseq to see if the mutants with the variants are producing more of + the RNA for that gene. +* Could use CHIPseq, if you know the transcription factors that bind to that + promoter. Fix, fragment, attach antibodies to the TFs, precipitate them out, + unfix, extract the DNA (all the remaining is whatever was bound to the + transcription factors), and then do the sequencing. You would expect to see + a larger signal if the mutation in the promoter is causing transcription + factors to be more likely to bind. + + +Got back and tried VPN'ing with the command-line client. It seemed to hang +after entering my password, but then I realized it had just silently tried to +2FA with my phone and I didn't notice. Trying again and being ready with Duo +let me log in, so I think I can probably ditch the webkit crap I installed for +the graphical thing. + +Desktop machine wouldn't take input from my USB hub all of a sudden. Found some +bullshit in the logs, probably not worth debugging Yet More Linux Jank if I'm +just going to wipe this machine and install Debian on it soon anyway. Tried to +reboot and systemd hung at the end, so I just powercycled the damn thing. If +I could just have one single day where no computer broke for me, that would be +so nice. + +Flu shots are available, need to get one so PI doesn't get pinged all the time. + +Read for BS521 class. All still pretty basic. Cleaned up and turned in lab 0. +Finished homework 2 as well, just to get it out of the way. Or at least +I thought I did, except there are apparently Surprise Questions™ not in the book +to do with R. I'll do that this weekend. + +## 2023-09-09 + +Actually finished BS521 homework 2. Realized my Latex `\code` shortcut was +broken: + + % Broken, doesn't scope the \small so later text is changed. + \def\code#1{\small\texttt{#1}} + + % Fixed v v + \def\code#1{{\small\texttt{#1}}} + +Did a first draft of the HG545 assignment 1. This one is a lot harder than the +stats homework. Need to polish it up and submit it tomorrow. + +## 2023-09-10 + +Polished and submitted the HG545 homework. We'll see how it goes, I guess. + +## 2023-09-11 + +HG545 discussion. This paper was pretty straightforward. + +Met with PIBS peer mentor. + +HG545 second paper was posted, need to do an initial read of that tonight. + +## 2023-09-12 + +BS521 again. Mostly basic linear regression stuff, but got a few interesting +tidbits out of it, mostly about the coefficient of determination, also called +`R²` or `r²`. This is the square of the correlation coefficient `r`, and it is +said to mean "the fraction of the variability in the data that is explained by +the linear model". So an `r²` of `0.7` would mean "70% of the variation in the +data is explained by the model". + +*Intuitively* what this means would be to look at the total variability in the +data, i.e.: + + (- (reduce #'max y) (reduce #'min y)) + +Then convert the data to residuals by subtracting out the model: + + (mapcar #'- (mapcar #'model x) y) + +and look at home much variability remains: + + (- (reduce #'max residuals) (reduce #'min residuals)) + +Compare the two to see the fraction that remains after accounting for the model. +₂ +Looked into some "R for actual programmers" resources so maybe I can feel like +I'm flailing less: + +* +* +* +* +* +* + +Lunch at a place called Maizie's. Was actually pretty good! + +Doing Yet Another Round of Paperwork for the VA. So much red tape. Did what +I could here, but there's a bunch I can't do until I get home after class today. + +So far I'm loving the look of the stumpwm config changes I made the other day. +Shouldn't have waited this long to clean things up. TODO: use +`select-from-menu` to implement a better screen-switching shortcut in stump. + +Figured out how to print. Use to find +a reasonable printer nearby, then Print Here to use it. You upload a PDF or +whatever through the web UI. Good enough, it works. One color paper cost +$3.22 of my (apparent) $24 print budget. Welp. + +PIBS800. Getting… another lecture about how to use the library? Didn't we +already do this in the other class? + +Spent a bit more time tracking down my white whale font from that 1979 Science +issue. Identifont came to the rescue and I think I finally have an answer, or +at least something very close: "Rotation" by Arthur Ritzel from 1971. +Unfortunately a 50-year old font still has ghastly licensing options, so I'll +probably never be able to *use* it, but at least I have peace of mind, I guess. + +## 2023-09-13 + +HG545. This module is focusing on how to create physical maps of chromosomes, +especially the bizarre human Y chromosome. + +There's a difference between a genetic map and a physical map. A genetic map +can be created with e.g. linkage analysis, and can tell you relative distances +but not necessarily the exact locations of things. A physical map shows the +actual locations. Note that physically linked genes might not necessarily be +genetically linked if they're far enough apart that the recombination chance is +50%. + +We can't use genetic mapping for the Y chromosome because there's not +recombination with another chromosome. + +In the paper they used hierarchical shotgun sequencing to sequence the +Y chromosome, which goes roughly like this: + +1. Fragmented the human genome into ~200kb fragments. +2. Cloned those into BACs +3. You want to retrieve *only* the fragments from the Y chromosome, not from the others. +4. Start with a known gene on Y (e.g. a well-known gene like Sry, the + sex-determining gene) and you PCR that to amplify the fragment(s) that + contain it. +5. Sequence those fragments (split into 20kb and shotgun sequence). +6. Design more PCR primers that *start* at the ends of *those* fragments, use + those to amplify things next to it. +7. Repeat to get overlapping tiles. + +You end up with overlapping tiles: + + ---------Sry---------- ----------Zry-------------- + >>> <<< + ----------------- + >>> + ------------------- + +Nowadays we can take advantage of long read tech to eliminate a lot of the grunt +work in the process, e.g.: + +* Oxford Nanopore: 50-500kb, 90% accuracy. +* PacBio: 20kb, >99% accuracy. + +Oxford is still pretty bad accuracy, but is useful to resolve things when PacBio +still runs into trouble with some of the crazy-long repeats. + +Also learned about some kind of "bionano" thing that was glossed over very +quickly. Looks like it's a company? Need to ask someone about this. + +Next talked about content of the human genome: + + Human Genome + Unique DNA (1/3) + Repetative DNA (2/3) + Dispersed Repeats + Transposable Elements (e.g. LINEs, Alu) + Retrogenes (e.g. CDY) + Transposed Genes (e.g. DAZ) + tDNA + Local Repeats + Segmental Duplication (e.g. palindromes) + Satellite Duplication + rDNA + +Repeats are challenging to assemble, e.g. if you have: + + Unique A | LINE1 | Unique B | LINE 1 | Unique C + +You might get reads like: + + A1 + 1B1 + 1C + +It's hard to tell which direction the `1B1` should go, or whether `A` should go +directly to `C`. `LINE1` specifically can be resolved with PacBio because it's +only ~6.5kb, far less than the 20kb you get from PacBio, but other segments +still cause problems. + +Example of problematic things are the large palindromes from the paper: + + 1.45mb arm + <------------------------ Unique --------------------------> + arms have ~99.97% nucleotide identity + +Even if there are a few SNPs on the arms, if the segments right around the +unique part happen to be identical it's hard to tell which arm goes where. + +Looked into the PACCAR thing from yesterday, but the application form is +extremely long and I already have enough red tape to deal with through the VA, +so I'm not going to add more paperwork for myself. Oh well. + +Met with John Prensner about possibly rotating in his lab. Next steps for +rotations are pretty clearly to set up some chats with his students and some +from Boyle/Parker labs to make a choice for the next 1-2 slots. I'll try to do +that for next week I think. Also want to talk to Shavit again — I really liked +chatting with him, and I think if I wanted to rotate there I would need him to +join the department as an affiliate of some kind, so I'd need to see if he's +okay with doing that. + +## 2023-09-14 + +I am going to `mark` every time I have to log in and/or 2FA for school for at +least a week, so I can graph it and be sad. Adjusted my `marks` thing to go to +my Syncthing dir. + +Sped up my shell prompt by wrapping the Mercurial prompt in a basic `.hg` +existence check. Had to relearn how to write a fish function. + +BS521. Chatted with the professor at office hours a bit to ask a couple of +things. + +Made a TODO list with all the homework/exam/lab stuff for school. Hopefully +this will make it easier to see what's coming up since Canvas is barely usable. + +Started reaching out to set up chats with folks in a few labs I might be +interested in for my next rotation. + +## 2023-09-15 + +HG545 this morning. + +## 2023-09-16 + +BS521 reading. + +Z-score means "number of standard deviations above the mean". + +Successes-based distributions: + +* Geometric: number of trials before observing a success. +* Binomial: number of successes in a fixed number of trials. + +The chapters of this book are getting sloppier as they go on — I'm noticing +a lot more typos now than in the first couple of chapters. + +Went back to John D Cook's R for Programmers post when the `pnorm` function was +mentioned. R has several of these functions with veyr confusing names: + + + + : d: PDF ("density") + p: CDF ("probability") + q: Quantile, i.e. CDF⁻¹ + r: Random sample + + : norm: Normal aka gaussian + unif: Uniform + +So `pnorm` is "the CDF of a normal distribution". + +Found a way to view which fonts a PDF file embeds and/or references: `pdffonts`. +Nice. + +Tired of CACL crashing on my laptop because I don't have CCL, so I'll just +install CCL. + + git clone https://github.com/Clozure/ccl.git ccl + curl -L -O https://github.com/Clozure/ccl/releases/download/v1.12.2/linuxx86.tar.gz + cd ccl + tar xf ../linuxx86.tar.gz + ./lx86cl64 + (rebuild-ccl :full t) + + sudo ln -s /home/sjl/src/ccl/lx86cl64 /usr/local/bin/ccl64 + +Finally discovered the reason my bash prompt gets mangled sometimes: +non-printing characters in `PS1` have to be wrapped in `\[…\]`. So I need to do +something ugly like this: + + export PS1='\n\[${PINK}\]\u \[${D}\]at \[${HOST_COLOR}\]\h \[${D}\]in \[${GREEN}\]\w\[${D}\] $(last_return_value)$ ' + +But at least it works properly now and won't drive me crazy. + +Did a bit more font hunting. Looking for something to use for figures that +looks plotter-esque, but isn't something with a couple of scattered glyphs and +no weights like the plotter fonts I've found. Licensing is a minefield, but +Google Fonts has a bunch of stuff that's under the open font license, and +I think I found a couple that might work: Quicksand and Nunito. Of the two, +Nunito seems a little nicer to me. Will need to try it in some graphs and see +how it works. + +## 2023-09-17 + +Trying to get ahead of classwork for the next couple of weeks, since I've got so +many other things going on. + +Did the reading for BS521 for the next two weeks. + +Finished BS521 homework 3. + +## 2023-09-18 + +HG545. Feeling better about this module than the last, which is surprising +because I enjoyed this paper less. + +Chatted with someone about one of the labs I'm thinking about rotating in. + +Cleaned up HW2 for HG545 a bit. Still not done, but at least I'm getting it +into shape. + +## 2023-09-19 + +BS521. + +Meeting with two more grad students to chat about their labs. + +BI500. + +DCMB has full time IT staff: `DCMB-IT-Services@umich.edu`. Might email them +about Ethernet connection? + +Chat widget on is a decent way to get help. +Also walk-in help in THSL 4020. ARC support: . + +Slurm tutorial. Learned a couple of interesting things: + +* `sq` is an alias for `squeue --me`. Nice. +* `my_job_header` can help debug weird Slurm shit, handy. +* Emails will include core/mem high-water marks. Need to figure out if I can + get this programatically, might be more accurate than the Snakemake benchmarks + (or at least worth comparing). + +Chatted about Boyle lab with a current grad student. + +PIBS 800. + +Finished HG545 homework 2. + +## 2023-09-20 + +HG545 discussion. Talked a lot about the Y chromosome paper. + +## 2023-09-21 + +BS521. Went over the binomial distribution. Seeing this yet another time gave +me an actual intuitive understanding this time, which is nice. + +BISTRO seminar. + +## 2023-09-22 + +Retreat. + +Lightning talks. + +Breakout panel with current grad students. Lots of stuff, probably not going to +write it all down here. + +## 2023-09-23 + +Finished HW 4 for BS521. + +Got some random Latex shit to remember for next time. Aligned equations: + + \begin{eqnarray*} + foo &=& bar \\ + meow &=& wow \\ + \end{eqnarray*} + +References to figures: + + (See figure~\ref{fig:g-a}) + + … + + \begin{figure}[H] + \centering + \includegraphics[width=0.65\textwidth]{figures/g-a} + \caption{Graph for exercise 4.1 part a.} + \label{fig:g-a} + \end{figure} + +Units: + + \usepackage{units} + + Drink 500 \unit{ml} of water at lunch. + +And some random R shit to remember for next time: + + dbinom = Binomial PDF + pbinom = Binomial CDF + +Came up with some absolutely cursed code to made shaded normal graphs. +Surprised that's not already a thing. + +## 2023-09-25 + +HG545. Need to retype all my notes for this module here when I get some time so +I don't lose them. + +Today started with a description of RNAseq. Something vaguely familiar was +a nice change for this class. Then reviewed STARR-seq which I think I mostly +understand now. + +Talked about the similarity between enhancers and promoters. Polymerase can +sometimes actually sit down at enhancers and produce small RNAs, but +transcription doesn't ever elongate. But this might be an example of how genes +could evolve. + +Then talked about heat shock proteins and heat shock factor as an example of how +rapid transcription can happen. + +* HSE: "Heat Shock Element", an enhancer sequence located upstream of a gene, + e.g. hsp90. +* hsp90: "Heat Shock Protein 90", a protein that's used in cells to help other + proteins fold in the presence of heat that might otherwise prevent it. The 90 + is from its weight in kilodaltons (lol). +* HSF1: "Heat Shock Factor 1", a transcription factor that trimerizes, binds to + HSE, and recruits another thing to activate the transcription of hsp90. + +There's a self-regulation loop here where, when things are cold, hsp90 binds to +HSF1 outside the nucleus and prevents it from enhancing transcription of hsp90 +(i.e. of itself). But when heat is applied, other proteins unfold and hsp90 +starts chaperoning them more, which leaves HSF1 free to enter the nucleus and +enhance transcription of hsp90. + +Remembering how to create a local Postgres DB for testing: + + sudo -u postgres psql + + CREATE DATABASE example; + CREATE USER testuser WITH PASSWORD 'pass'; + GRANT ALL PRIVILEGES ON DATABASE example TO testuser; + + \c example + GRANT ALL ON SCHEMA public TO testuser; + + \q + + psql postgresql://testuser:pass@localhost:5432/example + +## 2023-09-26 + +BS521. Exam is on Thursday. Today is about sampling distributions and +statistical inference. + +BIOINF500. Fire alarm for the first half of class, nice. Rest of the class +will be recorded, need to remember to watch it later. + +## 2023-09-27 + +HG545 this morning. Did an initial pass on the homework, then met up with some +other grad students later to chat about it and now I'm even less confident, lol. +Welp. + +## 2023-09-28 + +BS521 exam. Did okay, though I really should have had a couple of more things +on my note sheet than I did. Next time I need to go through the slides too, not +just the book — there were things on the test from class only, not in the book. +I think I did alright though. + +Finished HG545 homework. I think I did alright, but my brain is now fried. + +## 2023-09-29 + +HG545 discussion this morning. + +Sent a few emails to try to nail down my next three rotations. I think at this +point I have a pretty good idea of where I want to try, so if I can just get +them all nailed down now it'll be less stuff to deal with later. + +Signed up for the 503 discussion sections. What a painful process to get +registered. I should have waited til I was home on my large monitor because +trying to flip back and forth between the 90%-whitespace-filled list of sessions +and my calendar/TODO list was extremely tedious. I think I've got it all mapped +out now though. + diff -r f76b2070bb71 -r 8de6c03950e7 README.markdown --- a/README.markdown Sun Oct 22 20:15:21 2023 -0400 +++ b/README.markdown Sun Oct 22 20:15:48 2023 -0400 @@ -5,1095 +5,6 @@ [TOC] -# September 2023 - -## 2023-09-01 - -HG545. Looked over the slides last night and was a little worried, but felt -okay after the lecture for the most part. Still a few things I need to look up -and I do still need to get my fleeting notes into this, but I feel okay. - -Continuing the Snakemake tutorial. - -Threads can be specified for a given job with `threads: 8`, and you need to -propagate that to the command yourself with `{threads}`. Will be scaled down if -run with fewer cores than threads, otherwise will wait until that many are -available. - -Snakemake has some support for noticing log files, but it seems like you have to -manually create them yourself? This seems… tedious? - - rule bwa_map: - input: - "data/genome.fa", - lambda wc: SAMPLES[wc.sample] - output: - "mapped_reads/{sample}.bam" - threads: 8 - params: - rg=r"@RG\tID:{sample}\tSM:{sample}" - log: "logs/bwa_map/{sample}.log" - shell: - "(" - "bwa mem -R '{params.rg}' -t {threads} {input}" - " | samtools view -Sb - > {output}" - ") >{log} 2>&1" - -Do I really have to wrap everything in `(…) >{log} 2>&1` by hand myself? - -You can get a summary of file provenance with `snakemake --summary`. The output -is a TSV, so I went down a rathole of pretty-printing TSVs and eventually found -that `| column -s $'\t' -t` works (mnemonic: `s$tt`). I love how every UNIX -program gets to invent its own bespoke command line interface for specifying -special characters. Really great. - -Can mark outputs as `temp()` and `protected()`, which is nice. - -Need to install singularity *inside* my VM: - - # Ensure repositories are up-to-date - sudo apt-get update - - # Install debian packages for dependencies - sudo apt-get install -y \ - wget \ - build-essential \ - libseccomp-dev \ - libglib2.0-dev \ - pkg-config \ - squashfs-tools \ - cryptsetup \ - runc - - # Install Golang - export VERSION=1.21.0 OS=linux ARCH=amd64 && \ - wget https://dl.google.com/go/go$VERSION.$OS-$ARCH.tar.gz && \ - sudo tar -C /usr/local -xzvf go$VERSION.$OS-$ARCH.tar.gz && \ - rm go$VERSION.$OS-$ARCH.tar.gz - - echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrc && \ - source ~/.bashrc - - # Install Singularity - export VERSION=3.11.4 && \ - wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-ce-${VERSION}.tar.gz && \ - tar -xzf singularity-ce-${VERSION}.tar.gz && \ - cd singularity-ce-${VERSION} - - ./mconfig && \ - make -C builddir && \ - sudo make -C builddir install - -## 2023-09-02 - -It is time to shave the LaTeX yak again. Installed it with `texlive-latex-base` -to start, we'll see if I need to add some more crud in later. Going to go -through some guides for now. - -Going to note some things to remember. Skeleton of document: - - \documentclass{article} - \begin{document} - - Basic text. - - \end{document} - -Math: - - Inline math $y = 3 \sin x$ example. - - Block equation: - \[ - y = 3 \sin x - \] - - With reference: - \begin{equation}\label{equa} - y' = 3 \cos x - \end{equation} - refer to it by label, e.g. equation (\ref{equa}). - - More complicated: $x^2$ and $x^{2+\alpha}$ and $y_{n+1}$. - -Verbatim: - - Verbatim text: \verb"$x^{2+\alpha}$". Delimiter can be anything ala sed, - \verb_%%&_ or \verb+$$+. - - Must escape special characters \&, \$, \%, \_, \{, \}, and \#. - - \begin{verbatim} - A whole verbatim region. - - (defun square (x) - (* x x)) - \end{verbatim} - -Comments: - - Comments exist. % This is a comment. - -Type styles: - - Shapes: - \textup{Upright} - \textit{Italic} - \textsl{Slanted} - \textsc{Small} - - Series (weight): - \textmd{Medium} - \textbf{Boldface} - - Families: - \textrm{Roman} - \textsf{Sans} - \texttt{Typewriter} - -Emphasis: - - \emph{Never} do Foo! - -"Environments" are sections that are treated differently, made with `\begin{…}` -and `\end{…}`. - -Lists: - - Unordered list: - \begin{itemize} - \item Foo - \item Bar - \item Baz - \end{itemize} - - Ordered list: - \begin{enumerate} - \item One - \item Two - \item Three - \end{enumerate} - - Customizable labels: - \begin{description} - \item[Rule 1.] Foo - \item[Rule 2.] Bar - \item[Rule 3.] Baz - \end{description} - -Sizes (note the brace comes BEFORE the command!): - - {\Huge Huge} - {\huge huge} - {\LARGE LARGE} - {\Large Large} - {\large large} - {\normalsize normalsize} - {\small small} - {\footnotesize footnotesize} - {\scriptsize scriptsize} - {\tiny tiny} - -Centering: - - \begin{center} - {\large\textbf{Assignment 1}}\\% The \\ linebreaks. - Steve Losh\\ - BS521 - \end{center} - -Example table. - - \begin{tabular}{l|rc} % lrc = cols should be left, right, centered, pipe for vertical line - Name & Mark & Grade \\ - \hline\hline - Foo & 99 & A+ \\ - Bar & 51 & C \\ - Baz & 5 & F - \end{tabular} - -Colspan with multicolumn command. - - \begin{tabular}{|l||r|r|} - \hline - & \multicolumn{2}{c|}{Grades} \\ - \cline{2-3} - Name & Class 1 & Class 2 \\ - \hline\hline - Foo & 99 & 88 \\ - Bar & 51 & 65 \\ - Baz & 5 & 58 \\ - \hline - \end{tabular} - -Full example, with referencing and caption, e.g. `Table~\ref{tab:a} on page~\pageref{tab:a}`. - - % b = try to put at Bottom. Also t top, h here, p separate page. - % Can do multiple in order of preference. - % [!t] ! = try harder - \begin{table}[b] - \begin{center} - \caption{An Example Table} - \label{tab:a} - - \begin{tabular}{lr} - Name & Value \\ - \hline - Foo & 1.0 \\ - Bar & 15.9 \\ - Baz & 6.2 - \end{tabular} - % \caption{Caption at the end works too.} - \end{center} - \end{table} - -Sections: - - \section{Some section} % includes numbering - \subsection{Some subsection} - - \section*{Some section} % no numbering - \subsection*{Some subsection} - -Quotation marks (hilarious): - - `Single quoted' - ``Double quoted'' - -Change overall text size (simple): - - \documentclass[11pt]{article} % only valid sizes are 10/11/12. - -Palatino instead of Computer Modern: - - \usepackage{mathpazo} - \linespread{1.05} % needs more leading (space between lines) - \usepackage[T1]{fontenc} - -Vimtex stuff: - -* Close thing with `]]` in insert mode. - -Got about that far, which was enough to start my BIOSTAT-521 homework. Will dig -in again later, but it's nice to be able to use it for something real to -practice. - -Puttered around a bit looking at other fonts, but didn't find anything new or -interesting. - -## 2023-09-03 - -Spent most of today getting the reading room in my apartment ready. Went to -IKEA, looked around a lot and got some ideas, picked up a chair and bookshelf -for my still-boxed unread books. Not the most productive Sunday, but sitting in -my chair and reading at night felt good. - -## 2023-09-04 - -Continuing the EdX genetics course over breakfast. - -## 2023-09-05 - -BIOSTAT-521 and BIOINF-500 classes this morning. - -Going to spend my non-class time looking into Unicycler today (and taking care -of paperwork if anything that needs my attention crops up) and Bandage to -visualize the results. - -Grabbed the container from StaPH-B and tried to figure out where the executable -is. I think it's `/unicycler/unicycler-runner.py`. Grabbed the sample data -from the Unicycler repo and got something running: - - singularity exec containers/unicycler.sif \ - /unicycler/unicycler-runner.py \ - --short1 sample_data/short_reads_1.fastq.gz \ - --short2 sample_data/short_reads_2.fastq.gz \ - --out assembly/ - -Pretty straightforward so far. The result directory contains a log and a bunch -of GFA files. GFA apparently stands for [Graphical Fragment -Assembly](https://gfa-spec.github.io/GFA-spec/GFA1.html), but I think it's -"graphical" in the "directed/undirected graph" sense and not in the "pixels" -sense. Text-based format which seems pretty straightforward to parse (maybe -I should have a go at parsing it for fun). - -Installed Bandage and viewed the results. Not sure what exactly I'm looking at, -but it works and looks pretty enough I guess. [This -example](https://github.com/rrwick/Bandage/wiki/Simple-example) was helpful to -get a sense of what it's trying to show me. - -Unicycler has some configuration knobs to tweak: - -> Unicycler can be run in three modes: conservative, normal (the default) and -> bold, set with the --mode option. Conservative mode is least likely to produce -> a complete assembly but has a very low risk of misassembly. Bold mode is most -> likely to produce a complete assembly but carries greater risk of misassembly. -> Normal mode is intermediate regarding both completeness and misassembly risk. - -Reran with both conservative and bold modes and looked at the difference in the -results for the sample data. They're not the same, but I can't visually detect -any major obvious differences. Maybe it's not a big deal on this sample. - -Once I got that running in a shell, I got it ported into Snakemake, shaving -a bunch of yaks along the way. - -I noticed that Snakemake can take a JSON config file instead of YAML. Fantastic, -switched over to that right away: - - { - "containers": { - "fastqc": "docker://staphb/fastqc:0.12.1", - "unicycler": "docker://staphb/unicycler:0.5.0" - }, - "samples": { - "short_read_example": [ - "sample_data/short_reads_1.fastq.gz", - "sample_data/short_reads_2.fastq.gz" - ] - } - } - -Got the containers downloading via snakemake as well, so it's snakes all the way -down: - - rule containers: - input: - expand("containers/{name}.sif", name=config["containers"].keys()), - - rule container: - output: - "containers/{name}.sif", - params: - source=lambda wc: config["containers"][wc.name], - shell: - "singularity pull {output} {params.source}" - -Got `snakefmt` working with Neoformat so I can `F6` in Vim to reformat. Had to -fuck around with the config because it was just emptying out the file — I think -the key was that: - -* `snakefmt` edits in-place by default (gross). -* Need to use `replace`: `1` in the Neoformat config to deal with this. - -And finally we can assemble: - - def get_input_fastqs(wildcards): - return config["samples"][wildcards.sample] - - rule unicycler_assemble: - log: - "logs/unicycler_assemble/{sample}.log", - container: - "containers/unicycler.sif" - threads: 8 - input: - "containers/unicycler.sif", - get_input_fastqs, - output: - "assemblies/{sample}/assembly.gfa", - "assemblies/{sample}/assembly.fasta", - shell: - logged( - "/unicycler/unicycler-runner.py" - " --threads {threads}" - " --short1 {input[0]}" - " --short2 {input[1]}" - " --out assemblies/{wildcards.sample}/" - ) - -Puttered around changing my StumpWM and terminal colors/borders/etc a bit to -make them a little easier on my eyes. We'll see if it sticks. - -## 2023-09-06 - -HG545 in the morning. Mostly understood things. - -Asked the professor after about a question I had while reading the paper. One -of the things the paper did to confirm the region of interest was to use a PAC -(P1-derived artificial chromosome) to "rescue" the golden embryos. The -resulting fish showed mosaic rescue, confirming that the wild-type gene was -likely on that PAC, i.e. in the region of interest. - -What I didn't understand is how injecting the plasmid into the embryos resulted -in the expression of the genes farther down the developmental line, e.g. does it -somehow get incorporated into the cells' genomes? It turns out to be messy. - -First: you don't inject "a PAC" into the embryos, you inject "a shitload of -copies of the PAC" into the embryo. So all of the embryo's cells will have -copies of the PAC floating around inside. As the cell divides, those will get -diluted in daughter cells over time. Some of these copies will, by chance, make -it into the nucleus of their cells. And some of those (rarely) will get -randomly incorporated into the cell's genome, and from then on mitosis takes -over and the gene gets propagated normally. So the mosaic region from that -point forward will have the gene (and if the region happens to contain some -melanophores, also the rescued wild-type phenotype). - -Got my Armis access at some point during class, so it's time to figure out how -to log into the various HPC clusters today. - -Doubled checked exam schedule to make sure nothing conflicts. I think it's -fine. - -Changed my school password after the network clusterfuck last week. Sigh. - -Wanted to print something in the lab, realized I never installed any printing -support on this laptop, lol. `apt install cups` will hopefully Just Work. CUPS -interface is at `http://localhost:631/`. It did not Just Work. Surely 2024 -will be The Year of Linux on the Desktop. Printer wouldn't configure itself, -driver didn't appear in the list when I tried to manually configure it through -CUPS. `apt install printer-driver-all` got me more drivers but not this one. -Tracked it down on the brother site and downloaded some `.deb` packages but -they're 32-bit instead of 64. Gave up at this point, what a janky shitshow of -an OS. If only everything else didn't suck in even worse ways. - -Tried getting the VPN running. Installed with the script into `/opt/cisco` -(good). Got mysterious errors when trying to connect. Tried the GUI connection -manager, which runs, but gave a more informative error message I could search -for. Looks like I need to install `libwebkit2gtk-4.0-37`. Installed that, now -I get the login screen, but I can't 2FA because I only have Yubikeys set up but -that requires a real browser, not this jank webkitgtk thing, so now I need to -set up *another* 2FA method just for this. Good god. Tried to add a new 2FA -device via Duo, but that requires 2FAing *again*, but *this* time it's through -the Duo site which doesn't actually fucking work on Linux, so I can't add it -here, I'll need to use the Windows shitbox at home. God, I hate two-factor -authentication so much. It's *always* miserable. - -Tried to do the homework for BIOINF-500 (creating a pubmed search alert). To do -this you need an NCBI account. Tried to log in via my UM account and managed to -500 the NCBI site. Incredible. Poked around and eventually got it working (I -think)? Created the alert, took a screenshot. Uploaded the PNG into Canvas for -the assignment. Canvas shows an error trying to retrieve it. Uploaded it to -the Canvas "My Files" and used *that* to submit the assignment, instead of -uploading the file directly, and that worked. *Incredible*. Why does *nothing* -ever work correctly? - -Came home, tried to add the 2FA with the Windows box, but it failed in the same -way (hanging on the popup after successfully touching the yubikey). But -I finally figured out a workaround (in retrospect, I vaguely recall having to do -this when I added the extra yubikeys originally): log out of everything, then go -to log back into something (e.g. Wolverine Access), but **before** you 2FA in -*that* login process there will be a link to add a new device on the left side -of the screen. That will then require you to 2FA, but doing it *here* does -actually work. It seems absolutely wild to me that you need to *not* be logged -in if you want to manage 2FA, but here we are. - -Now that I have the Duo app thing on my phone and connected, logging into the -VPN seems to work great. Yak shaved successfully. - -## 2023-09-07 - -BS521 and its lab this morning. - -Got my dotfiles synced to GL. One tricky thing: my remote `.bash_profile` -sources `/etc/profile` if it exists, but that causes problems on the cluster -because there's some read-only variable set in there that it doesn't like. -Commented out that line and everything looks okay. `ControlMaster` does work, -so I won't have to auth a billion times a day (thank god). - -BS521 lab. Of course the AC isn't working so it's a billion degrees, lovely. - -Installing tidyverse failed with inscrutable errors. After some googling it -[looks like](https://blog.zenggyu.com/posts/en/2018-01-29-installing-r-r-packages-e-g-tidyverse-and-rstudio-on-ubuntu-linux/index.html) -there's *extra* dependencies you need to install on Linux (C programming is -wonderful): - - sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev \ - libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev \ - libpng-dev libtiff5-dev libjpeg-dev - -Still fucked, so I manually trawled through the tons of log output, googled -around more, and found that I need to configure an env variable: - - Sys.setenv(PKG_CONFIG_PATH="/usr/lib/x86_64-linux-gnu/pkgconfig") - -And finally it works. Started going through the lab stuff but then time was up -— will come back to it later. - -Moving on to rotation lab work. Working from home today, so I wanted to get my -laptop working with my external monitor and keyboard and such. Had to swap out -the USB-C cable I was using previously because apparently that one doesn't work -for video ("universal" serial bus my ass) and then adjust my StumpWM `xrandr` -commands, but I did get it working, so now I can use the laptop at my desk, -which is nice. - -Note to self: I can probably use the text-mode VPN UI now that I've got the 2FA -sorted out, and maybe remove the webkit crap I installed for it originally. - -Finished BS521 lab 0. Wasn't too bad once I shaved the tidyverse yak. Getting -it written up with Latex required more Latex derusting, this time for code -listings and images (pulled some of this from the MS thesis). First, preamble -stuff: - - % Listing package for code listings. - \RequirePackage{listings} - \lstdefinestyle{default}{ - basicstyle=\footnotesize\ttfamily, - showtabs=true, - frame=lines, - aboveskip=10pt, - } - \lstset{ - language=, - style=default, - } - - % Used to embed plots. - \usepackage{graphicx} - - % No paragraph indentation for homework, just looks awkward. - \setlength{\parindent}{0pt} - - % Inline code. - \def\code#1{\small\texttt{#1}} - -I really need to split these up into actual files I can include instead of -copypasting them a million times, but maybe I'll wait for one more practice -round first. - -Usage: - - % Code listings. - \begin{lstlisting} - prop.table(table(DATA$Race)) - - Black MexicanAmerican Other OtherHispanic White - 0.22921790 0.23954747 0.04230202 0.03885883 0.45007378 - \end{lstlisting} - - - % Graphics. - \includegraphics[]{figures/bmi-hist} - - \begin{center} - \includegraphics[width=0.45\textwidth]{figures/hist-age} - \includegraphics[width=0.45\textwidth]{figures/hist-log-age} - \end{center} - -To actually save PDF plots with R: - - pdf("figures/foo.pdf", height=6, width=6) - hist(DATA$Foo, main = "Distribution of Foo", xlab="Foo") - dev.off() - -Went to the poster session. Lots of stuff I don't understand, and a tiny bit -that I do. - -## 2023-09-08 - -HG545 this morning. - -Papers never say what *could* have gone wrong with what they did — you have to -just read between the lines and actively think about that (and what it would -have meant, and what you would have done if it did). - -Learned about nonsense-mediated decay: a mechanism where mRNA with premature -stop codons is degraded, instead of expressing a (probably truncated) protein. -Without this, if you have a mutation that creates a stop codon in the middle of -the gene, you would see truncated protein expressed. But because of NMD, the -mRNA is degraded and doesn't express the broken protein (as much). This is good -not only to reduce wasted translation, but because the truncated proteins can be -actively bad. - -One important control that was left out of the study where they wanted to find -where in the organism the target gene is being expressed: inject a probe with -GFP that intentionally shouldn't match *anything*, and expect it to show up -vaguely all over (or not at all). - -Another example covered during class: if you suspected a phenotype was caused by -a mutation in a promoter (instead of in an exon), how would you test this? -There were a couple of things folks came up with: - -* Could sequence the region in the mutant and wild-type population, compare to - see if the mutation segregates the two reliably. -* Old school: "reporter genes". I'm a bit fuzzy on this, but I think you insert - the promoter into a vector with some easily observable gene (e.g. luciferase, - a bioluminescent protein). Then you see if that product is expressed more or - less with the different variants of the promoter. This is a bit janky because - just yanking the promoter completely out of context can be problematic (e.g. - loses the chromatin structure around it, nearby enhancers/repressors, etc). -* Could use RNAseq to see if the mutants with the variants are producing more of - the RNA for that gene. -* Could use CHIPseq, if you know the transcription factors that bind to that - promoter. Fix, fragment, attach antibodies to the TFs, precipitate them out, - unfix, extract the DNA (all the remaining is whatever was bound to the - transcription factors), and then do the sequencing. You would expect to see - a larger signal if the mutation in the promoter is causing transcription - factors to be more likely to bind. - - -Got back and tried VPN'ing with the command-line client. It seemed to hang -after entering my password, but then I realized it had just silently tried to -2FA with my phone and I didn't notice. Trying again and being ready with Duo -let me log in, so I think I can probably ditch the webkit crap I installed for -the graphical thing. - -Desktop machine wouldn't take input from my USB hub all of a sudden. Found some -bullshit in the logs, probably not worth debugging Yet More Linux Jank if I'm -just going to wipe this machine and install Debian on it soon anyway. Tried to -reboot and systemd hung at the end, so I just powercycled the damn thing. If -I could just have one single day where no computer broke for me, that would be -so nice. - -Flu shots are available, need to get one so PI doesn't get pinged all the time. - -Read for BS521 class. All still pretty basic. Cleaned up and turned in lab 0. -Finished homework 2 as well, just to get it out of the way. Or at least -I thought I did, except there are apparently Surprise Questions™ not in the book -to do with R. I'll do that this weekend. - -## 2023-09-09 - -Actually finished BS521 homework 2. Realized my Latex `\code` shortcut was -broken: - - % Broken, doesn't scope the \small so later text is changed. - \def\code#1{\small\texttt{#1}} - - % Fixed v v - \def\code#1{{\small\texttt{#1}}} - -Did a first draft of the HG545 assignment 1. This one is a lot harder than the -stats homework. Need to polish it up and submit it tomorrow. - -## 2023-09-10 - -Polished and submitted the HG545 homework. We'll see how it goes, I guess. - -## 2023-09-11 - -HG545 discussion. This paper was pretty straightforward. - -Met with PIBS peer mentor. - -HG545 second paper was posted, need to do an initial read of that tonight. - -## 2023-09-12 - -BS521 again. Mostly basic linear regression stuff, but got a few interesting -tidbits out of it, mostly about the coefficient of determination, also called -`R²` or `r²`. This is the square of the correlation coefficient `r`, and it is -said to mean "the fraction of the variability in the data that is explained by -the linear model". So an `r²` of `0.7` would mean "70% of the variation in the -data is explained by the model". - -*Intuitively* what this means would be to look at the total variability in the -data, i.e.: - - (- (reduce #'max y) (reduce #'min y)) - -Then convert the data to residuals by subtracting out the model: - - (mapcar #'- (mapcar #'model x) y) - -and look at home much variability remains: - - (- (reduce #'max residuals) (reduce #'min residuals)) - -Compare the two to see the fraction that remains after accounting for the model. -₂ -Looked into some "R for actual programmers" resources so maybe I can feel like -I'm flailing less: - -* -* -* -* -* -* - -Lunch at a place called Maizie's. Was actually pretty good! - -Doing Yet Another Round of Paperwork for the VA. So much red tape. Did what -I could here, but there's a bunch I can't do until I get home after class today. - -So far I'm loving the look of the stumpwm config changes I made the other day. -Shouldn't have waited this long to clean things up. TODO: use -`select-from-menu` to implement a better screen-switching shortcut in stump. - -Figured out how to print. Use to find -a reasonable printer nearby, then Print Here to use it. You upload a PDF or -whatever through the web UI. Good enough, it works. One color paper cost -$3.22 of my (apparent) $24 print budget. Welp. - -PIBS800. Getting… another lecture about how to use the library? Didn't we -already do this in the other class? - -Spent a bit more time tracking down my white whale font from that 1979 Science -issue. Identifont came to the rescue and I think I finally have an answer, or -at least something very close: "Rotation" by Arthur Ritzel from 1971. -Unfortunately a 50-year old font still has ghastly licensing options, so I'll -probably never be able to *use* it, but at least I have peace of mind, I guess. - -## 2023-09-13 - -HG545. This module is focusing on how to create physical maps of chromosomes, -especially the bizarre human Y chromosome. - -There's a difference between a genetic map and a physical map. A genetic map -can be created with e.g. linkage analysis, and can tell you relative distances -but not necessarily the exact locations of things. A physical map shows the -actual locations. Note that physically linked genes might not necessarily be -genetically linked if they're far enough apart that the recombination chance is -50%. - -We can't use genetic mapping for the Y chromosome because there's not -recombination with another chromosome. - -In the paper they used hierarchical shotgun sequencing to sequence the -Y chromosome, which goes roughly like this: - -1. Fragmented the human genome into ~200kb fragments. -2. Cloned those into BACs -3. You want to retrieve *only* the fragments from the Y chromosome, not from the others. -4. Start with a known gene on Y (e.g. a well-known gene like Sry, the - sex-determining gene) and you PCR that to amplify the fragment(s) that - contain it. -5. Sequence those fragments (split into 20kb and shotgun sequence). -6. Design more PCR primers that *start* at the ends of *those* fragments, use - those to amplify things next to it. -7. Repeat to get overlapping tiles. - -You end up with overlapping tiles: - - ---------Sry---------- ----------Zry-------------- - >>> <<< - ----------------- - >>> - ------------------- - -Nowadays we can take advantage of long read tech to eliminate a lot of the grunt -work in the process, e.g.: - -* Oxford Nanopore: 50-500kb, 90% accuracy. -* PacBio: 20kb, >99% accuracy. - -Oxford is still pretty bad accuracy, but is useful to resolve things when PacBio -still runs into trouble with some of the crazy-long repeats. - -Also learned about some kind of "bionano" thing that was glossed over very -quickly. Looks like it's a company? Need to ask someone about this. - -Next talked about content of the human genome: - - Human Genome - Unique DNA (1/3) - Repetative DNA (2/3) - Dispersed Repeats - Transposable Elements (e.g. LINEs, Alu) - Retrogenes (e.g. CDY) - Transposed Genes (e.g. DAZ) - tDNA - Local Repeats - Segmental Duplication (e.g. palindromes) - Satellite Duplication - rDNA - -Repeats are challenging to assemble, e.g. if you have: - - Unique A | LINE1 | Unique B | LINE 1 | Unique C - -You might get reads like: - - A1 - 1B1 - 1C - -It's hard to tell which direction the `1B1` should go, or whether `A` should go -directly to `C`. `LINE1` specifically can be resolved with PacBio because it's -only ~6.5kb, far less than the 20kb you get from PacBio, but other segments -still cause problems. - -Example of problematic things are the large palindromes from the paper: - - 1.45mb arm - <------------------------ Unique --------------------------> - arms have ~99.97% nucleotide identity - -Even if there are a few SNPs on the arms, if the segments right around the -unique part happen to be identical it's hard to tell which arm goes where. - -Looked into the PACCAR thing from yesterday, but the application form is -extremely long and I already have enough red tape to deal with through the VA, -so I'm not going to add more paperwork for myself. Oh well. - -Met with John Prensner about possibly rotating in his lab. Next steps for -rotations are pretty clearly to set up some chats with his students and some -from Boyle/Parker labs to make a choice for the next 1-2 slots. I'll try to do -that for next week I think. Also want to talk to Shavit again — I really liked -chatting with him, and I think if I wanted to rotate there I would need him to -join the department as an affiliate of some kind, so I'd need to see if he's -okay with doing that. - -## 2023-09-14 - -I am going to `mark` every time I have to log in and/or 2FA for school for at -least a week, so I can graph it and be sad. Adjusted my `marks` thing to go to -my Syncthing dir. - -Sped up my shell prompt by wrapping the Mercurial prompt in a basic `.hg` -existence check. Had to relearn how to write a fish function. - -BS521. Chatted with the professor at office hours a bit to ask a couple of -things. - -Made a TODO list with all the homework/exam/lab stuff for school. Hopefully -this will make it easier to see what's coming up since Canvas is barely usable. - -Started reaching out to set up chats with folks in a few labs I might be -interested in for my next rotation. - -## 2023-09-15 - -HG545 this morning. - -## 2023-09-16 - -BS521 reading. - -Z-score means "number of standard deviations above the mean". - -Successes-based distributions: - -* Geometric: number of trials before observing a success. -* Binomial: number of successes in a fixed number of trials. - -The chapters of this book are getting sloppier as they go on — I'm noticing -a lot more typos now than in the first couple of chapters. - -Went back to John D Cook's R for Programmers post when the `pnorm` function was -mentioned. R has several of these functions with veyr confusing names: - - - - : d: PDF ("density") - p: CDF ("probability") - q: Quantile, i.e. CDF⁻¹ - r: Random sample - - : norm: Normal aka gaussian - unif: Uniform - -So `pnorm` is "the CDF of a normal distribution". - -Found a way to view which fonts a PDF file embeds and/or references: `pdffonts`. -Nice. - -Tired of CACL crashing on my laptop because I don't have CCL, so I'll just -install CCL. - - git clone https://github.com/Clozure/ccl.git ccl - curl -L -O https://github.com/Clozure/ccl/releases/download/v1.12.2/linuxx86.tar.gz - cd ccl - tar xf ../linuxx86.tar.gz - ./lx86cl64 - (rebuild-ccl :full t) - - sudo ln -s /home/sjl/src/ccl/lx86cl64 /usr/local/bin/ccl64 - -Finally discovered the reason my bash prompt gets mangled sometimes: -non-printing characters in `PS1` have to be wrapped in `\[…\]`. So I need to do -something ugly like this: - - export PS1='\n\[${PINK}\]\u \[${D}\]at \[${HOST_COLOR}\]\h \[${D}\]in \[${GREEN}\]\w\[${D}\] $(last_return_value)$ ' - -But at least it works properly now and won't drive me crazy. - -Did a bit more font hunting. Looking for something to use for figures that -looks plotter-esque, but isn't something with a couple of scattered glyphs and -no weights like the plotter fonts I've found. Licensing is a minefield, but -Google Fonts has a bunch of stuff that's under the open font license, and -I think I found a couple that might work: Quicksand and Nunito. Of the two, -Nunito seems a little nicer to me. Will need to try it in some graphs and see -how it works. - -## 2023-09-17 - -Trying to get ahead of classwork for the next couple of weeks, since I've got so -many other things going on. - -Did the reading for BS521 for the next two weeks. - -Finished BS521 homework 3. - -## 2023-09-18 - -HG545. Feeling better about this module than the last, which is surprising -because I enjoyed this paper less. - -Chatted with someone about one of the labs I'm thinking about rotating in. - -Cleaned up HW2 for HG545 a bit. Still not done, but at least I'm getting it -into shape. - -## 2023-09-19 - -BS521. - -Meeting with two more grad students to chat about their labs. - -BI500. - -DCMB has full time IT staff: `DCMB-IT-Services@umich.edu`. Might email them -about Ethernet connection? - -Chat widget on is a decent way to get help. -Also walk-in help in THSL 4020. ARC support: . - -Slurm tutorial. Learned a couple of interesting things: - -* `sq` is an alias for `squeue --me`. Nice. -* `my_job_header` can help debug weird Slurm shit, handy. -* Emails will include core/mem high-water marks. Need to figure out if I can - get this programatically, might be more accurate than the Snakemake benchmarks - (or at least worth comparing). - -Chatted about Boyle lab with a current grad student. - -PIBS 800. - -Finished HG545 homework 2. - -## 2023-09-20 - -HG545 discussion. Talked a lot about the Y chromosome paper. - -## 2023-09-21 - -BS521. Went over the binomial distribution. Seeing this yet another time gave -me an actual intuitive understanding this time, which is nice. - -BISTRO seminar. - -## 2023-09-22 - -Retreat. - -Lightning talks. - -Breakout panel with current grad students. Lots of stuff, probably not going to -write it all down here. - -## 2023-09-23 - -Finished HW 4 for BS521. - -Got some random Latex shit to remember for next time. Aligned equations: - - \begin{eqnarray*} - foo &=& bar \\ - meow &=& wow \\ - \end{eqnarray*} - -References to figures: - - (See figure~\ref{fig:g-a}) - - … - - \begin{figure}[H] - \centering - \includegraphics[width=0.65\textwidth]{figures/g-a} - \caption{Graph for exercise 4.1 part a.} - \label{fig:g-a} - \end{figure} - -Units: - - \usepackage{units} - - Drink 500 \unit{ml} of water at lunch. - -And some random R shit to remember for next time: - - dbinom = Binomial PDF - pbinom = Binomial CDF - -Came up with some absolutely cursed code to made shaded normal graphs. -Surprised that's not already a thing. - -## 2023-09-25 - -HG545. Need to retype all my notes for this module here when I get some time so -I don't lose them. - -Today started with a description of RNAseq. Something vaguely familiar was -a nice change for this class. Then reviewed STARR-seq which I think I mostly -understand now. - -Talked about the similarity between enhancers and promoters. Polymerase can -sometimes actually sit down at enhancers and produce small RNAs, but -transcription doesn't ever elongate. But this might be an example of how genes -could evolve. - -Then talked about heat shock proteins and heat shock factor as an example of how -rapid transcription can happen. - -* HSE: "Heat Shock Element", an enhancer sequence located upstream of a gene, - e.g. hsp90. -* hsp90: "Heat Shock Protein 90", a protein that's used in cells to help other - proteins fold in the presence of heat that might otherwise prevent it. The 90 - is from its weight in kilodaltons (lol). -* HSF1: "Heat Shock Factor 1", a transcription factor that trimerizes, binds to - HSE, and recruits another thing to activate the transcription of hsp90. - -There's a self-regulation loop here where, when things are cold, hsp90 binds to -HSF1 outside the nucleus and prevents it from enhancing transcription of hsp90 -(i.e. of itself). But when heat is applied, other proteins unfold and hsp90 -starts chaperoning them more, which leaves HSF1 free to enter the nucleus and -enhance transcription of hsp90. - -Remembering how to create a local Postgres DB for testing: - - sudo -u postgres psql - - CREATE DATABASE example; - CREATE USER testuser WITH PASSWORD 'pass'; - GRANT ALL PRIVILEGES ON DATABASE example TO testuser; - - \c example - GRANT ALL ON SCHEMA public TO testuser; - - \q - - psql postgresql://testuser:pass@localhost:5432/example - -## 2023-09-26 - -BS521. Exam is on Thursday. Today is about sampling distributions and -statistical inference. - -BIOINF500. Fire alarm for the first half of class, nice. Rest of the class -will be recorded, need to remember to watch it later. - -## 2023-09-27 - -HG545 this morning. Did an initial pass on the homework, then met up with some -other grad students later to chat about it and now I'm even less confident, lol. -Welp. - -## 2023-09-28 - -BS521 exam. Did okay, though I really should have had a couple of more things -on my note sheet than I did. Next time I need to go through the slides too, not -just the book — there were things on the test from class only, not in the book. -I think I did alright though. - -Finished HG545 homework. I think I did alright, but my brain is now fried. - -## 2023-09-29 - -HG545 discussion this morning. - -Sent a few emails to try to nail down my next three rotations. I think at this -point I have a pretty good idea of where I want to try, so if I can just get -them all nailed down now it'll be less stuff to deal with later. - -Signed up for the 503 discussion sections. What a painful process to get -registered. I should have waited til I was home on my large monitor because -trying to flip back and forth between the 90%-whitespace-filled list of sessions -and my calendar/TODO list was extremely tedious. I think I've got it all mapped -out now though. - # October 2023 ## 2023-10-01