# HG changeset patch # User Steve Losh # Date 1695041292 14400 # Node ID e20df975295585c3d67ee776e1bbf92b0d5389bd # Parent 8c2e86015e80da3f65256e1a96f44d2cb8a51982 Update diff -r 8c2e86015e80 -r e20df9752955 2023.markdown --- a/2023.markdown Sun Sep 17 21:01:09 2023 -0400 +++ b/2023.markdown Mon Sep 18 08:48:12 2023 -0400 @@ -184,3 +184,614 @@ View > Docked Playlist to unbreak the playlist Figure out a graphical file manager solution (double commander?). + +# August 2023 + +## 2023-08-21 + +First day of orientation as a PhD student. Here we go again, back to school one +final time. + +Figured out the campus wifi despite the Linux jankery. Had to: + +1. Register as a special device, the laptop registration link redirected to + nothing useful. +2. Use `nm-connection-editor` from `gnome-network-manager` to edit the + connection and manually set up the WPA2+PEAP+user/pass for the connection. + +Finally fixed the dulwich errors from hg-git. Since I'm using Debian's hg now +instead of building from source, I needed to install dulwich somewhere that the +system python can find it. I almost installed `python3-pip` to do that, but +then realized I could just install `python3-dulwich` and be done. Cool. + +## 2023-08-22 + +Day 2. Lots of getting talked at, meeting with advisors, etc. Still trying to +get all the moving parts settled down — hopefully it'll be a lot clearer once +classes and rotations start (though I'm sure it'll be busy in a different way). + +First time trying the Ann Arbor bus system, the bus was 20 minutes late. This +bodes well. + +## 2023-08-23 + +Got a presentation about things to do/see/eat in Ann Arbor. Lots of things to +try. + +Advice from existing student panel: + +* Keep in touch and make connections with folks in your cohort/community who are + going through the same stuff as you. +* Rotate with anyone your want, don't let anyone tell you not to. +* Don't forget to have a life. Don't spend every weekend in the lab. +* Don't compare yourself to each other. Lots of variety in the incoming people. +* Ask for help. + +What to ask/think about when looking for a lab: + +* Do you have funding to support me? +* Do you have plans to stay at the university? +* Talk to current and previous students (and where those ended up). +* Talk to multiple students (people have different experiences). +* Work ethic. +* Mentoring style. +* What are your expectations of me? +* Priorities toward students (high/low?). +* How did they handle difficult situations in the lab? +* Find a good PI, because they're the one that will be there the entire time + (students/etc can and will leave). +* Tell them what you want to do (e.g. go to conferences), and based on their + responses you'll know whether they're a good fit. +* Switching labs is possible (not ideal though). +* Listen when people tell you a place sucks. +* Alumni are great because they don't have the power dynamic to worry about and + can actually be honest. +* You can change rotations if you really need to, even if it's the second week + in. +* Ask how many slots there are vs how many rotations they're taking to judge how + competitive joining the lab will be. + +## 2023-08-24 + +Did the Python pre-test so I don't have to take the introductory programming +class. + +Did more online training. + +More talks. Lots of meeting people after. + +## 2023-08-25 + +Lost power last night because of the storms and DTE says I won't get it back til +tomorrow. This is… not great. + +Doing more trainings and paperwork at a school building so I can plug in my +laptop. + +## 2023-08-28 + +Turns out I didn't get power until *last night*, i.e. three of God's own full +days after it went out. Ann Arbor is… not feeling great right now. + +Also, yesterday around 14:00 the university's network went down, entirely. +Unable to get online from the wifi, and any services hosted on the university +network are down, which is… not great. Of course, many of the services used +(e.g. Canvas) are third-party services not hosted on the university network. +Except they all use the university SSO (because Security™), which *is* hosted on +the network. So you can only use them if you're already logged in. And of +course sessions expire super quickly (because Security™) so, in summary: lol. + +Starting to use my HP-15C clone(s) from Swissmicros again. Was getting +misleading results when trying to do some stuff with logarithms: it seemed like +it only had two digits of precision. But after some digging around I realized +I must have configured the *display* precision to 2 SD's at some point in the +past and forgotten about it. The full precision is there, it was only rounding +to display. I set it to 6 to avoid this, but also the `PREFIX` key will show +the full precision temporarily. + +Figured out (again) how to program the calculator to do logs of arbitrary bases. +Essentially: + + logₓ(y) = ln(y)/ln(x) + + stack + LBL C x y + LN x ln(y) + x<->y ln(y) x + LN ln(y) ln(x) + ÷ ln(y)/ln(x) = logₓ(y) + RTN + +Trying out syncthing as a Dropbox replacement. Installing: + + sudo apt install syncthing + systemctl --user enable syncthing.service + systemctl --user start syncthing.service + +Then to access the admin. + +Got my budget scripts working and synced via syncthing (also shaved a couple of +yaks by making scripts to archive/create new hosts while I was at it). Seems to +work okay at the moment. Will gradually transition other stuff over time. + +Going to spend some time learning about Nextflow while I wait to hear from +rotation folks. Nextflow is basically a DAG, where: + +* Edges are FIFO queues (Nextflow calls them "channels") +* Vertices are things that consume input from their channels and produce output (Nextflow calls them "processes"). + +There are two types of channels. First: queue channels: asynchronous FIFO queues. Examples: + + # emits sequence of given values + ch = Channel.of(1, 1, 2, 3, 5, 8) + + # emits a single file path (queue of size 1) + ch = Channel.fromPath('data/one-single-file.txt') + + # emits multiple file paths + ch = Channel.fromPath('data/*.txt') + +Value channels: like queue channels, but just emit the same value over and over. +Basically `(constantly val)`. + +Processes: basically stages of a pipeline. Take input and output definitions, +plus something to run (e.g. `shell`). Example after a bit of poking around: + + + // allows you to define processes to be used for modular libraries + nextflow.enable.dsl = 2 + + workflow { + ids = Channel.fromPath('data/ids.txt') // single-item channel + chunksize = Channel.value(1000) // (constantly 1000) but will only ever be used once here + + // The process below produces a list of outputs. It will only ever be + // run once, but nextflow doesn't know that -- you could potentially + // have a process run multiple times that each produces a list. So by + // default it groups all the outputs into a single emitted value. But + // here we want to flatted [[aa ab ac ad ae]] into [aa ab ac ad ae]. + batched_ids = split_ids(ids, chunksize) | flatten + batched_ids.view() // .view() doesn't consume, good for debugging + + result = reverse(batched_ids) + result.view() + } + + process split_ids { + input: + path(ids) + val(chunksize) + + output: + file('batch-*') + + shell: + """ + split -l !{chunksize} !{ids} batch- + """ + } + + process reverse { + input: + path(batch_file) + + output: + file('result') + + shell: + """ + tac !{batch_file} > result + """ + } + +Nextflow seems to have the concept of a "run name", i.e. an identifier for +a particular run. It creates a `work/` directory with the output files, but +*also* seems to splat out a bunch of hidden `.nextflow/` and `.nextflow.log.*` +files in the current directory. `nextflow clean` removes `work` but not the +hidden files. + +Can run with some basic reporting with some extra flags: + + nextflow run example.nf -with-report report.html -with-dag graph.svg + +Of course there appears to be some backwards-incompatible jank in the language +already. Reading through +shows minor syntax changes I guess I'll need to be aware of when looking at +things a few years old. + +Putting some things on the TODO list for learning mroe about nextflow: + +* (uses old DSL) +* + +I also need to get some basic scratch VM infrastructure set up with qemu/vagrant +and Ansible so I can test out Nextflow/pipeline stuff without polluting my own +machines and/or any servers I eventually get access to with random testing +garbage. Maybe I'll put that on tomorrow's list too. + +## 2023-08-29 + +First BIOSTAT-521 class. This one seems like it's going to be quite easy, but +given I have a crazily-hard class and lab rotation work to do, I think that's +fine with me. It *does* use R, which will be a good excuse to poke around at +that since it's used so heavily in the industry. + +First lab meeting as well. It was good to meet everyone in person. Talked +about scheduling stuff and overall gist of my project, though we can't move +forward directly right now while the university network is still hosed. But +I did get some more information about things I'll want to look into: + +* Snakemake (not Nextflow, oh well). +* Singularity containers (should probably read the UNIX book section on + containerization first). +* I still want to get Vagrant/libvirt/qemu/Ansible working to make scratch envs. + +Going to meet again on Thursday to see if the network is available. If so I'll +dive in more then — until then I'll just poke around learning that stuff because +I know I'll need it eventually. + +Don't have a lot of time before next class, but I went ahead and installed +Snakemake highlighting/etc for Vim. + +## 2023-08-30 + +HG545 this morning. Class was mostly about how to succeed in the class. + +Still trying to decide on how I want to take notes while I'm here. I read the +Zettelkasten book and was considering trying that. But after poking around at +it I'm not sure I like the overhead of having to link everything up all the +time. I tried creating some notes while studying and it was a pain to have to +try to link everything to something else. Sometimes I want to just jot down +something without worrying about its place in the graph. I think I might end up +going with this current format (stream-of-consciousness .plan file style notes) +for everything, but taking a few things from Zettelkasten than might help: + +* Take fleeting notes as I read. +* Turn fleeting notes into permanent ones, but as text in my .plan instead of + linked entries in some other system. + +That seems like something I might actually *do*, and hopefully with grep it'll +be good enough for what I need. + +Installing JabRef to try as a reference manager. Zotero looks nicer (no stupid +flat UI) but syncing the DB requires sending it to their web thing. JabRef +seems to use a plain text file so I can probably just sync it with syncthing and +deal with conflicts manually. Spent some time adding a couple of papers to it. +Not sure it's great (it got the info wrong for 2/3 papers) but I guess that's +just typical open source jankery. + +Apparently you can just `C-v` a DOI into JabRef and it'll import it. Hard to +discover, but seems to work okay. JabRef is complaining about capital letters +in the titles but I'll figure out that jankery later. At least I've got +something for now. + +Had some wonkiness with my Syncthing budget stuff, but I think I just forgot to +reeval the location on my desktop. Will poke around more if anything else seems +to break. + +Watched some snakemake videos and read through their paper. This smells a lot +more academic than Nextflow did, which is a little worrying. I'm sure it'll be +fine in the end though. + +Send off the rest of my VA paperwork so things can get moving on that side. + +Read the ULSAH section on containers to get a high-level overview. Started +looking into Singularity and it's already looking spicy. Apparently the project +forked a couple of years ago and there are now two competing versions? Great. +Also you have to install it from source, which requires installing Golang. +I thought I was free of Rob Pike's Googly Tendrils but I guess I never will be. +Installed Go, built Singularity. At least it installs to a prefix +(`/opt/singularity`), so I can remove it easily if I want. + +Poked around a little to make sure it's working, e.g.: + + singularity pull docker://debian:bookwork-slim + singularity shell debian_bookworm-slim.sif + +Seems to be working as far as I can tell. + +Also installing snakemake. Using pip with a venv for now even though the +documentation tries to convince you not to. If anything breaks I can revisit +it, but for now it's probably fine to go through some tutorials without pulling +in some giant slab of junk. + +Started going through the Snakemake tutorial. + +> Since the rule has multiple input files, Snakemake will concatenate them, +> separated by a whitespace [sic] + +Oh boy. + +Realized I'd need to install a pile of stuff to get through the tutorial, +decided to pause and shave the qemu yak first so I can do this without dumping +a ton of stuff on my laptop. So many yaks. + +Shaved the qemu yak, now I've got a reliable VM setup. Committed the +instructions and a tiny script to a `vms` repo so I don't have to relearn this +again. + +With that out of the way, installed Snakemake and all the prereqs from their +tutorial on the VM with wild abandon. Now I can *actually* do the tutorial. +The simple tutorial was straightforward for the most part, but for this: + + rule bcftools_call: + input: + fa="data/genome.fa", + bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES), + bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES) + output: + "calls/all.vcf" + shell: + "bcftools mpileup -f {input.fa} {input.bam}" + " | bcftools call -mv - > {output}" + +It's not clear how the expanded input lists are ordered. Are they guaranteed to +always produce the same order given the same input list? + +Finally realized I could put my pubkeys online so I can just `curl +https://stevelosh.com/pubkeys >> ~/.ssh/authorized_keys`. No idea why I never +though of this before now. Also updated my site now that I've moved. Of course +everything just worked even though I haven't touched the site in months, because +it's written in Common Lisp which never changes. It's so nice to not have to +work with constantly-breaking shit. + +Installed R and RStudio for tomorrow's class. Base R is `r-base` in Debian. +Unfortunately Debian 12 has only been out for a few months and the RStudio +Debian package only supports 11, but apparently the `deb` one for Ubuntu 22 +works so I guess I'll just yolo it with that one for now. + +Reading for tomorrow's BS521 class: chapter 1 of OpenIntro Statistics (probably +also want to real Holm's stats book again as a refresher). All pretty basic so +far. + +## 2023-08-31 + +BS521 this morning. Still pretty straightforward. + +Going to create a separate note repo for lab notes, so I can take those +privately. I'll still put public notes here, but for certain things (e.g. DB +table names, etc) I don't want to have to worry about which are okay to make +public. + +Going to port the abortive attempt at a Zettelkasten into here in no particular +order (thank god for `grep`). Should have done this earlier but it's been +a hectic couple of days. Should I use heading more in this .plan for folding +purposes? Maybe. I still need to finish going through my fleeting notes that +I've got so far and getting them in here. I also should have already done this, +but I'll do that this weekend when I have some free time and try not to let it +pile up so much in the future. + +### Short-Read Sequencing + +Short-read sequencing aims to sequence long stretches of DNA (or RNA) by first +fragmenting them into small pieces, sequencing those pieces separately (usually +in parallel, for speed), then reassembling the fragments into contigs +computationally. + +There are several forms of short-read sequencing, but the most popular by far +today is Illumina's sequencing by synthesis approach. + +Short-read sequencing is often used for whole-genome sequencing. + +The cost of sequencing, especially short-read sequencing, has fallen +*dramatically* in the past 20 years, which has started to enable its use as +a clinical tool. It was impractical to sequence someone's genome to learn about +their health when it cost $1 billion, but it is much more reasonable to do it +when it costs $1,000. + +### Whole-Genome Sequencing + +Whole-genome sequencing refers to sequencing an organism's entire genome, +instead of just a limited part of it. + +This has the benefit of letting you get data on *all* of the genome at once, +without having to know what you're looking for in advance. But if you *do* know +what you're interested in, it can be much faster/cheaper to sequence just that +part (or can give you deeper coverage for the same cost). Like many things in +science and engineering, it's a tradeoff. + +### Fragmentation + +When sequencing DNA or RNA with short-read sequencing like Illumina, the strands +of nucleic acid need to be fragmented into shorter pieces. This usually happens +in one of a few ways: + +* Mechanical fragmentation by forcing it through a small passage +* Enzymatic fragmentation +* Sonication + +### Sequencing Alternatives + +Although sequencing costs have fallen dramatically in the past 20 years, they +are not completely free. There are several alternatives to sequencing that are +still used for a number of reasons: + +* They can provide the information more cheaply than sequencing. +* They can provide much deeper coverage than would be practical to get by sequencing. +* They can be done much faster than sequencing, allowing for rapid diagnostic use. + +Some examples of alternatives are: + +* Microarrays +* Nanostring +* qPCR +* Optical mapping + +### Post-Transcriptional Modification + +After RNA is transcribed from DNA, but before it goes on to do its actual job +(being translated into protein, or doing something on its own), it is sometimes +modified. There are a number of modifications that happen, some of the most +common are: + +* Splicing out introns. +* Polyadenalation of the 3' tail. +* Adding a 5' cap. + +### Post-Translational Modification + +After RNA is translated into protein, but before the protein begins to do its +intended work, it is sometimes modified. Modifications come in many forms and +can affect the terminal ends or one of the amino acid side chains. One common +modification is phosphorylation. + +### Chromatin + +Chromatin structure varies in the genome. Chromatin that is uncoiled and +available for work is euchromatin (eu = useful). Chromatin that is tightly +coiled and not accessible for transcription is called heterochromatin. +Heterochromatin has two sub-types of its own: + +* Facultative heterochromatin can temporarily become euchromatic. +* Constitutive heterochromatin is always condensed, and is usually found in + repetitive sections of the genome. + +[ATAC-Seq](https://en.wikipedia.org/wiki/ATAC-seq): "Assay for +Transposase-Accessible Chromatin using sequencing" is used to determine which +parts of the genome's chromatin are accessible (and, implicitly, which are not). +It uses a mutant, hyperactive transposase to insert sequencing adapters +*everywhere* it can inside the genome. Those tagged fragments are then +extracted and sequenced. + +[ChIP-seq](https://en.wikipedia.org/wiki/ChIP_sequencing): +"Chromatin-immunoprecipitation sequencing" is used to determine which parts of +genome particular proteins are interacting with, e.g. where a particular +transcription factor is binding to in the genome. The steps are roughly: + +1. Crosslink the DNA and proteins to lock them together, so the protein can't + unbind from the DNA (usually using formaldehyde). +2. [Fragment](n/fragmentation.md) the DNA into small (~500bp) fragments. +3. Use immunoprecipitation to extract only the proteins you are interested in + (antibodies that bind to your protein of interest). +4. Unlink the DNA and proteins, extract the DNA, and sequence it (e.g. with + Illumina). + +The resulting reads will be regions where the protein of interest was originally +bound to the DNA, allowing you to see (roughly) where a particular protein +interacts with the genome. + +### Homopolymer + +"Homopolymer" when used in the context of sequencing means a sequence of the +same base repeated, e.g. `AAAAA` or `CCC`. + +Many sequencing technologies have trouble with these kinds of sequences. For +example, Oxford Nanopore sequencing uses the properties of the bases currently +inside the pore to determine the bases that enter/leave. But for homopolymers +it's hard to tell when one enters and another leaves if they're completely +uniform along the entire pore. + +### Long-read sequencing + +Long-read sequencing is a relatively new branch of technology that allows for +sequencing much longer reads than e.g. Illumina. This is useful because many +parts of many genomes have regions that are difficult for short-read sequencing +to deal with, e.g. repetitive elements and high-level structural variation. +Long-read sequencing can help assemble those sequences which would be difficult +to piece together from short reads. + +### Oxford Nanopore Technologies + +A long-read sequencing technology that can be used to directly read a strand of +DNA using a molecular pore that produces current as DNA passes through it. The +variations in current can be recorded and processed to determine the DNA +sequence being passed through. + + ---, + \ + \ + =============== DNA helix + / + / ssDNA (unwound) + / + ( | ) + ( | ) motor protein + ( | ) + [ | ] + [ | ] pore for sensing DNA + [ | ] + [ | ] + [ | ] + ______[ | ]__________________________ + | + | + vvvvv DNA is fed through the pore by the motor protein and current (picoamps) measured + +ONT units are available as small devices that connect to a USB port on a laptop, +which makes them much more portable than other sequencers (though the library +prep equipment is not as portable, yet). + +### 10X Genomics Linked Reads + +A form of synthetic long-read sequencing from 10X. Long DNA fragments were +loaded into gel beads with barcodes. They were then fragmented and barcoded, +then sequenced on a normal sequencer. The barcodes allowed you to know which +short reads came from the same longer fragment (i.e. were "linked"), which helps +reconstruct the longer fragments from the short ones. + +We discontinued this product while I was working at 10X. RIP. + +### Non-uniform genotypes + +An organism is a **mosaic** if it has cells with different genomes that stem from +a somatic mutation in one of the ancestral cells, which then divided and +resulted in a chunk of the organism having the mutation while the rest does not. + +An organism is a **chimera** when it has cells with different genomes that are +caused by multiple zygotes combining early in development. The resulting +organism will have cells with completely different genomes depending on which +zygote they trace their ancestry back to. + +### Kaelin 2017 + +Read this paper from intro materials. Main thrust was about broad vs narrow +claims in papers. + +Scientific papers make and justify claims using evidence. Papers vary on how +many claims they make, and how well-supported each of those claims is with +evidence in the paper. + +Because scientists have limited time and papers must be a finite size, there is +almost always a tradeoff between making a broad set or a narrow set of claims: +more claims will generally mean less evidence to support each claim. + +The author claims that in biomedical research there is a growing trend toward +requiring papers with broader and more sweeping claims if you want to get +published. Reviewers want more and more claims, more "translatability" to the +real world. It is no longer enough to notice and document something is +happening, now you must also propose *why* it happens and experiment to +demonstrate this mechanism. + +Requiring authors of scientific papers to make broader and broader claims in +order to get published has several (likely unintended) effects. + +First, by adding more and more claims to a paper, each claim will likely have +less individual support behind it. Instead of a few claims supported by many +pieces of evidence: + + ------- ----- + ||||| ||| + +papers will end up with a collection of claims, each balancing precariously on +narrow support: + + --------- ----- --- --- ----- ---- + | | | | | | || + +This harms reproducibility because claims that are not supported by a robust set +of evidence are less likely to reproduce successfully. + +Broad claims are also often harder for peer reviewers to deal will -- they often +need to be an expert in multiple fields just to be able to evaluate the paper. + +### Recombination + +Recombination is a step that happens during meiosis where homologous chromosomes +cross over pieces of themselves, effectively swapping random chunks of their +sequences. This provides much more genetic variation than random segregation of +chromosomes alone, allowing sexual reproduction to explore more of the genetic +space more quickly. + +The number of recombination events (crossovers) per chromosome is random, but is +usually relatively low (3-5 per chromosome). + diff -r 8c2e86015e80 -r e20df9752955 README.markdown --- a/README.markdown Sun Sep 17 21:01:09 2023 -0400 +++ b/README.markdown Mon Sep 18 08:48:12 2023 -0400 @@ -5,616 +5,6 @@ [TOC] -# August 2023 - -## 2023-08-21 - -First day of orientation as a PhD student. Here we go again, back to school one -final time. - -Figured out the campus wifi despite the Linux jankery. Had to: - -1. Register as a special device, the laptop registration link redirected to - nothing useful. -2. Use `nm-connection-editor` from `gnome-network-manager` to edit the - connection and manually set up the WPA2+PEAP+user/pass for the connection. - -Finally fixed the dulwich errors from hg-git. Since I'm using Debian's hg now -instead of building from source, I needed to install dulwich somewhere that the -system python can find it. I almost installed `python3-pip` to do that, but -then realized I could just install `python3-dulwich` and be done. Cool. - -## 2023-08-22 - -Day 2. Lots of getting talked at, meeting with advisors, etc. Still trying to -get all the moving parts settled down — hopefully it'll be a lot clearer once -classes and rotations start (though I'm sure it'll be busy in a different way). - -First time trying the Ann Arbor bus system, the bus was 20 minutes late. This -bodes well. - -## 2023-08-23 - -Got a presentation about things to do/see/eat in Ann Arbor. Lots of things to -try. - -Advice from existing student panel: - -* Keep in touch and make connections with folks in your cohort/community who are - going through the same stuff as you. -* Rotate with anyone your want, don't let anyone tell you not to. -* Don't forget to have a life. Don't spend every weekend in the lab. -* Don't compare yourself to each other. Lots of variety in the incoming people. -* Ask for help. - -What to ask/think about when looking for a lab: - -* Do you have funding to support me? -* Do you have plans to stay at the university? -* Talk to current and previous students (and where those ended up). -* Talk to multiple students (people have different experiences). -* Work ethic. -* Mentoring style. -* What are your expectations of me? -* Priorities toward students (high/low?). -* How did they handle difficult situations in the lab? -* Find a good PI, because they're the one that will be there the entire time - (students/etc can and will leave). -* Tell them what you want to do (e.g. go to conferences), and based on their - responses you'll know whether they're a good fit. -* Switching labs is possible (not ideal though). -* Listen when people tell you a place sucks. -* Alumni are great because they don't have the power dynamic to worry about and - can actually be honest. -* You can change rotations if you really need to, even if it's the second week - in. -* Ask how many slots there are vs how many rotations they're taking to judge how - competitive joining the lab will be. - -## 2023-08-24 - -Did the Python pre-test so I don't have to take the introductory programming -class. - -Did more online training. - -More talks. Lots of meeting people after. - -## 2023-08-25 - -Lost power last night because of the storms and DTE says I won't get it back til -tomorrow. This is… not great. - -Doing more trainings and paperwork at a school building so I can plug in my -laptop. - -## 2023-08-28 - -Turns out I didn't get power until *last night*, i.e. three of God's own full -days after it went out. Ann Arbor is… not feeling great right now. - -Also, yesterday around 14:00 the university's network went down, entirely. -Unable to get online from the wifi, and any services hosted on the university -network are down, which is… not great. Of course, many of the services used -(e.g. Canvas) are third-party services not hosted on the university network. -Except they all use the university SSO (because Security™), which *is* hosted on -the network. So you can only use them if you're already logged in. And of -course sessions expire super quickly (because Security™) so, in summary: lol. - -Starting to use my HP-15C clone(s) from Swissmicros again. Was getting -misleading results when trying to do some stuff with logarithms: it seemed like -it only had two digits of precision. But after some digging around I realized -I must have configured the *display* precision to 2 SD's at some point in the -past and forgotten about it. The full precision is there, it was only rounding -to display. I set it to 6 to avoid this, but also the `PREFIX` key will show -the full precision temporarily. - -Figured out (again) how to program the calculator to do logs of arbitrary bases. -Essentially: - - logₓ(y) = ln(y)/ln(x) - - stack - LBL C x y - LN x ln(y) - x<->y ln(y) x - LN ln(y) ln(x) - ÷ ln(y)/ln(x) = logₓ(y) - RTN - -Trying out syncthing as a Dropbox replacement. Installing: - - sudo apt install syncthing - systemctl --user enable syncthing.service - systemctl --user start syncthing.service - -Then to access the admin. - -Got my budget scripts working and synced via syncthing (also shaved a couple of -yaks by making scripts to archive/create new hosts while I was at it). Seems to -work okay at the moment. Will gradually transition other stuff over time. - -Going to spend some time learning about Nextflow while I wait to hear from -rotation folks. Nextflow is basically a DAG, where: - -* Edges are FIFO queues (Nextflow calls them "channels") -* Vertices are things that consume input from their channels and produce output (Nextflow calls them "processes"). - -There are two types of channels. First: queue channels: asynchronous FIFO queues. Examples: - - # emits sequence of given values - ch = Channel.of(1, 1, 2, 3, 5, 8) - - # emits a single file path (queue of size 1) - ch = Channel.fromPath('data/one-single-file.txt') - - # emits multiple file paths - ch = Channel.fromPath('data/*.txt') - -Value channels: like queue channels, but just emit the same value over and over. -Basically `(constantly val)`. - -Processes: basically stages of a pipeline. Take input and output definitions, -plus something to run (e.g. `shell`). Example after a bit of poking around: - - - // allows you to define processes to be used for modular libraries - nextflow.enable.dsl = 2 - - workflow { - ids = Channel.fromPath('data/ids.txt') // single-item channel - chunksize = Channel.value(1000) // (constantly 1000) but will only ever be used once here - - // The process below produces a list of outputs. It will only ever be - // run once, but nextflow doesn't know that -- you could potentially - // have a process run multiple times that each produces a list. So by - // default it groups all the outputs into a single emitted value. But - // here we want to flatted [[aa ab ac ad ae]] into [aa ab ac ad ae]. - batched_ids = split_ids(ids, chunksize) | flatten - batched_ids.view() // .view() doesn't consume, good for debugging - - result = reverse(batched_ids) - result.view() - } - - process split_ids { - input: - path(ids) - val(chunksize) - - output: - file('batch-*') - - shell: - """ - split -l !{chunksize} !{ids} batch- - """ - } - - process reverse { - input: - path(batch_file) - - output: - file('result') - - shell: - """ - tac !{batch_file} > result - """ - } - -Nextflow seems to have the concept of a "run name", i.e. an identifier for -a particular run. It creates a `work/` directory with the output files, but -*also* seems to splat out a bunch of hidden `.nextflow/` and `.nextflow.log.*` -files in the current directory. `nextflow clean` removes `work` but not the -hidden files. - -Can run with some basic reporting with some extra flags: - - nextflow run example.nf -with-report report.html -with-dag graph.svg - -Of course there appears to be some backwards-incompatible jank in the language -already. Reading through -shows minor syntax changes I guess I'll need to be aware of when looking at -things a few years old. - -Putting some things on the TODO list for learning mroe about nextflow: - -* (uses old DSL) -* - -I also need to get some basic scratch VM infrastructure set up with qemu/vagrant -and Ansible so I can test out Nextflow/pipeline stuff without polluting my own -machines and/or any servers I eventually get access to with random testing -garbage. Maybe I'll put that on tomorrow's list too. - -## 2023-08-29 - -First BIOSTAT-521 class. This one seems like it's going to be quite easy, but -given I have a crazily-hard class and lab rotation work to do, I think that's -fine with me. It *does* use R, which will be a good excuse to poke around at -that since it's used so heavily in the industry. - -First lab meeting as well. It was good to meet everyone in person. Talked -about scheduling stuff and overall gist of my project, though we can't move -forward directly right now while the university network is still hosed. But -I did get some more information about things I'll want to look into: - -* Snakemake (not Nextflow, oh well). -* Singularity containers (should probably read the UNIX book section on - containerization first). -* I still want to get Vagrant/libvirt/qemu/Ansible working to make scratch envs. - -Going to meet again on Thursday to see if the network is available. If so I'll -dive in more then — until then I'll just poke around learning that stuff because -I know I'll need it eventually. - -Don't have a lot of time before next class, but I went ahead and installed -Snakemake highlighting/etc for Vim. - -## 2023-08-30 - -HG545 this morning. Class was mostly about how to succeed in the class. - -Still trying to decide on how I want to take notes while I'm here. I read the -Zettelkasten book and was considering trying that. But after poking around at -it I'm not sure I like the overhead of having to link everything up all the -time. I tried creating some notes while studying and it was a pain to have to -try to link everything to something else. Sometimes I want to just jot down -something without worrying about its place in the graph. I think I might end up -going with this current format (stream-of-consciousness .plan file style notes) -for everything, but taking a few things from Zettelkasten than might help: - -* Take fleeting notes as I read. -* Turn fleeting notes into permanent ones, but as text in my .plan instead of - linked entries in some other system. - -That seems like something I might actually *do*, and hopefully with grep it'll -be good enough for what I need. - -Installing JabRef to try as a reference manager. Zotero looks nicer (no stupid -flat UI) but syncing the DB requires sending it to their web thing. JabRef -seems to use a plain text file so I can probably just sync it with syncthing and -deal with conflicts manually. Spent some time adding a couple of papers to it. -Not sure it's great (it got the info wrong for 2/3 papers) but I guess that's -just typical open source jankery. - -Apparently you can just `C-v` a DOI into JabRef and it'll import it. Hard to -discover, but seems to work okay. JabRef is complaining about capital letters -in the titles but I'll figure out that jankery later. At least I've got -something for now. - -Had some wonkiness with my Syncthing budget stuff, but I think I just forgot to -reeval the location on my desktop. Will poke around more if anything else seems -to break. - -Watched some snakemake videos and read through their paper. This smells a lot -more academic than Nextflow did, which is a little worrying. I'm sure it'll be -fine in the end though. - -Send off the rest of my VA paperwork so things can get moving on that side. - -Read the ULSAH section on containers to get a high-level overview. Started -looking into Singularity and it's already looking spicy. Apparently the project -forked a couple of years ago and there are now two competing versions? Great. -Also you have to install it from source, which requires installing Golang. -I thought I was free of Rob Pike's Googly Tendrils but I guess I never will be. -Installed Go, built Singularity. At least it installs to a prefix -(`/opt/singularity`), so I can remove it easily if I want. - -Poked around a little to make sure it's working, e.g.: - - singularity pull docker://debian:bookwork-slim - singularity shell debian_bookworm-slim.sif - -Seems to be working as far as I can tell. - -Also installing snakemake. Using pip with a venv for now even though the -documentation tries to convince you not to. If anything breaks I can revisit -it, but for now it's probably fine to go through some tutorials without pulling -in some giant slab of junk. - -Started going through the Snakemake tutorial. - -> Since the rule has multiple input files, Snakemake will concatenate them, -> separated by a whitespace [sic] - -Oh boy. - -Realized I'd need to install a pile of stuff to get through the tutorial, -decided to pause and shave the qemu yak first so I can do this without dumping -a ton of stuff on my laptop. So many yaks. - -Shaved the qemu yak, now I've got a reliable VM setup. Committed the -instructions and a tiny script to a `vms` repo so I don't have to relearn this -again. - -With that out of the way, installed Snakemake and all the prereqs from their -tutorial on the VM with wild abandon. Now I can *actually* do the tutorial. -The simple tutorial was straightforward for the most part, but for this: - - rule bcftools_call: - input: - fa="data/genome.fa", - bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES), - bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES) - output: - "calls/all.vcf" - shell: - "bcftools mpileup -f {input.fa} {input.bam}" - " | bcftools call -mv - > {output}" - -It's not clear how the expanded input lists are ordered. Are they guaranteed to -always produce the same order given the same input list? - -Finally realized I could put my pubkeys online so I can just `curl -https://stevelosh.com/pubkeys >> ~/.ssh/authorized_keys`. No idea why I never -though of this before now. Also updated my site now that I've moved. Of course -everything just worked even though I haven't touched the site in months, because -it's written in Common Lisp which never changes. It's so nice to not have to -work with constantly-breaking shit. - -Installed R and RStudio for tomorrow's class. Base R is `r-base` in Debian. -Unfortunately Debian 12 has only been out for a few months and the RStudio -Debian package only supports 11, but apparently the `deb` one for Ubuntu 22 -works so I guess I'll just yolo it with that one for now. - -Reading for tomorrow's BS521 class: chapter 1 of OpenIntro Statistics (probably -also want to real Holm's stats book again as a refresher). All pretty basic so -far. - -## 2023-08-31 - -BS521 this morning. Still pretty straightforward. - -Going to create a separate note repo for lab notes, so I can take those -privately. I'll still put public notes here, but for certain things (e.g. DB -table names, etc) I don't want to have to worry about which are okay to make -public. - -Going to port the abortive attempt at a Zettelkasten into here in no particular -order (thank god for `grep`). Should have done this earlier but it's been -a hectic couple of days. Should I use heading more in this .plan for folding -purposes? Maybe. I still need to finish going through my fleeting notes that -I've got so far and getting them in here. I also should have already done this, -but I'll do that this weekend when I have some free time and try not to let it -pile up so much in the future. - -### Short-Read Sequencing - -Short-read sequencing aims to sequence long stretches of DNA (or RNA) by first -fragmenting them into small pieces, sequencing those pieces separately (usually -in parallel, for speed), then reassembling the fragments into contigs -computationally. - -There are several forms of short-read sequencing, but the most popular by far -today is Illumina's sequencing by synthesis approach. - -Short-read sequencing is often used for whole-genome sequencing. - -The cost of sequencing, especially short-read sequencing, has fallen -*dramatically* in the past 20 years, which has started to enable its use as -a clinical tool. It was impractical to sequence someone's genome to learn about -their health when it cost $1 billion, but it is much more reasonable to do it -when it costs $1,000. - -### Whole-Genome Sequencing - -Whole-genome sequencing refers to sequencing an organism's entire genome, -instead of just a limited part of it. - -This has the benefit of letting you get data on *all* of the genome at once, -without having to know what you're looking for in advance. But if you *do* know -what you're interested in, it can be much faster/cheaper to sequence just that -part (or can give you deeper coverage for the same cost). Like many things in -science and engineering, it's a tradeoff. - -### Fragmentation - -When sequencing DNA or RNA with short-read sequencing like Illumina, the strands -of nucleic acid need to be fragmented into shorter pieces. This usually happens -in one of a few ways: - -* Mechanical fragmentation by forcing it through a small passage -* Enzymatic fragmentation -* Sonication - -### Sequencing Alternatives - -Although sequencing costs have fallen dramatically in the past 20 years, they -are not completely free. There are several alternatives to sequencing that are -still used for a number of reasons: - -* They can provide the information more cheaply than sequencing. -* They can provide much deeper coverage than would be practical to get by sequencing. -* They can be done much faster than sequencing, allowing for rapid diagnostic use. - -Some examples of alternatives are: - -* Microarrays -* Nanostring -* qPCR -* Optical mapping - -### Post-Transcriptional Modification - -After RNA is transcribed from DNA, but before it goes on to do its actual job -(being translated into protein, or doing something on its own), it is sometimes -modified. There are a number of modifications that happen, some of the most -common are: - -* Splicing out introns. -* Polyadenalation of the 3' tail. -* Adding a 5' cap. - -### Post-Translational Modification - -After RNA is translated into protein, but before the protein begins to do its -intended work, it is sometimes modified. Modifications come in many forms and -can affect the terminal ends or one of the amino acid side chains. One common -modification is phosphorylation. - -### Chromatin - -Chromatin structure varies in the genome. Chromatin that is uncoiled and -available for work is euchromatin (eu = useful). Chromatin that is tightly -coiled and not accessible for transcription is called heterochromatin. -Heterochromatin has two sub-types of its own: - -* Facultative heterochromatin can temporarily become euchromatic. -* Constitutive heterochromatin is always condensed, and is usually found in - repetitive sections of the genome. - -[ATAC-Seq](https://en.wikipedia.org/wiki/ATAC-seq): "Assay for -Transposase-Accessible Chromatin using sequencing" is used to determine which -parts of the genome's chromatin are accessible (and, implicitly, which are not). -It uses a mutant, hyperactive transposase to insert sequencing adapters -*everywhere* it can inside the genome. Those tagged fragments are then -extracted and sequenced. - -[ChIP-seq](https://en.wikipedia.org/wiki/ChIP_sequencing): -"Chromatin-immunoprecipitation sequencing" is used to determine which parts of -genome particular proteins are interacting with, e.g. where a particular -transcription factor is binding to in the genome. The steps are roughly: - -1. Crosslink the DNA and proteins to lock them together, so the protein can't - unbind from the DNA (usually using formaldehyde). -2. [Fragment](n/fragmentation.md) the DNA into small (~500bp) fragments. -3. Use immunoprecipitation to extract only the proteins you are interested in - (antibodies that bind to your protein of interest). -4. Unlink the DNA and proteins, extract the DNA, and sequence it (e.g. with - Illumina). - -The resulting reads will be regions where the protein of interest was originally -bound to the DNA, allowing you to see (roughly) where a particular protein -interacts with the genome. - -### Homopolymer - -"Homopolymer" when used in the context of sequencing means a sequence of the -same base repeated, e.g. `AAAAA` or `CCC`. - -Many sequencing technologies have trouble with these kinds of sequences. For -example, Oxford Nanopore sequencing uses the properties of the bases currently -inside the pore to determine the bases that enter/leave. But for homopolymers -it's hard to tell when one enters and another leaves if they're completely -uniform along the entire pore. - -### Long-read sequencing - -Long-read sequencing is a relatively new branch of technology that allows for -sequencing much longer reads than e.g. Illumina. This is useful because many -parts of many genomes have regions that are difficult for short-read sequencing -to deal with, e.g. repetitive elements and high-level structural variation. -Long-read sequencing can help assemble those sequences which would be difficult -to piece together from short reads. - -### Oxford Nanopore Technologies - -A long-read sequencing technology that can be used to directly read a strand of -DNA using a molecular pore that produces current as DNA passes through it. The -variations in current can be recorded and processed to determine the DNA -sequence being passed through. - - ---, - \ - \ - =============== DNA helix - / - / ssDNA (unwound) - / - ( | ) - ( | ) motor protein - ( | ) - [ | ] - [ | ] pore for sensing DNA - [ | ] - [ | ] - [ | ] - ______[ | ]__________________________ - | - | - vvvvv DNA is fed through the pore by the motor protein and current (picoamps) measured - -ONT units are available as small devices that connect to a USB port on a laptop, -which makes them much more portable than other sequencers (though the library -prep equipment is not as portable, yet). - -### 10X Genomics Linked Reads - -A form of synthetic long-read sequencing from 10X. Long DNA fragments were -loaded into gel beads with barcodes. They were then fragmented and barcoded, -then sequenced on a normal sequencer. The barcodes allowed you to know which -short reads came from the same longer fragment (i.e. were "linked"), which helps -reconstruct the longer fragments from the short ones. - -We discontinued this product while I was working at 10X. RIP. - -### Non-uniform genotypes - -An organism is a **mosaic** if it has cells with different genomes that stem from -a somatic mutation in one of the ancestral cells, which then divided and -resulted in a chunk of the organism having the mutation while the rest does not. - -An organism is a **chimera** when it has cells with different genomes that are -caused by multiple zygotes combining early in development. The resulting -organism will have cells with completely different genomes depending on which -zygote they trace their ancestry back to. - -### Kaelin 2017 - -Read this paper from intro materials. Main thrust was about broad vs narrow -claims in papers. - -Scientific papers make and justify claims using evidence. Papers vary on how -many claims they make, and how well-supported each of those claims is with -evidence in the paper. - -Because scientists have limited time and papers must be a finite size, there is -almost always a tradeoff between making a broad set or a narrow set of claims: -more claims will generally mean less evidence to support each claim. - -The author claims that in biomedical research there is a growing trend toward -requiring papers with broader and more sweeping claims if you want to get -published. Reviewers want more and more claims, more "translatability" to the -real world. It is no longer enough to notice and document something is -happening, now you must also propose *why* it happens and experiment to -demonstrate this mechanism. - -Requiring authors of scientific papers to make broader and broader claims in -order to get published has several (likely unintended) effects. - -First, by adding more and more claims to a paper, each claim will likely have -less individual support behind it. Instead of a few claims supported by many -pieces of evidence: - - ------- ----- - ||||| ||| - -papers will end up with a collection of claims, each balancing precariously on -narrow support: - - --------- ----- --- --- ----- ---- - | | | | | | || - -This harms reproducibility because claims that are not supported by a robust set -of evidence are less likely to reproduce successfully. - -Broad claims are also often harder for peer reviewers to deal will -- they often -need to be an expert in multiple fields just to be able to evaluate the paper. - -### Recombination - -Recombination is a step that happens during meiosis where homologous chromosomes -cross over pieces of themselves, effectively swapping random chunks of their -sequences. This provides much more genetic variation than random segregation of -chromosomes alone, allowing sexual reproduction to explore more of the genetic -space more quickly. - -The number of recombination events (crossovers) per chromosome is random, but is -usually relatively low (3-5 per chromosome). - # September 2023 ## 2023-09-01 @@ -1542,3 +932,7 @@ See lab notebook. Finished BS521 homework 3. + +## 2023-09-18 + +HG545.