# HG changeset patch # User Steve Losh # Date 1714501687 14400 # Node ID ed6c7d93c3ff1ff2f4ec49bf53056ed967b5a4ce # Parent fb895759f5047802f0c482a5ab1e46816bdc2a93 Update diff -r fb895759f504 -r ed6c7d93c3ff README.markdown --- a/README.markdown Mon Apr 22 11:41:21 2024 -0400 +++ b/README.markdown Tue Apr 30 14:28:07 2024 -0400 @@ -454,3 +454,210 @@ ## 2024-04-22 +Added a `--slurp` kludge to CACL so I can use it for easy math. For some reason +when I use `/dev/tty` as the input I end up having to `c-d` twice to exit the +REPL, but I can't be bothered to dig that rabbit hole apart today, so whatever, +it's fine for now I guess. The joys of personal software. + +Starting all the paperwork to join the lab for the rest of my PhD. Exciting! + +## 2024-04-23 + +Practiced the 545 presentation a couple more times and got it down to under +2 minutes. I wish I could update the presenter notes, but sadly that's not an +option. Oh well. Presentation went fine. + +Did the BS522 final. 3/25 questions were just… wrong on the test. Got an email +hours later acknowledging that but yeah, that fits overall with this course. At +least I'm done with it now, thankfully. + +## 2024-04-24 + +Did the second (easier) half of the 529 homework. I'm realizing now that we +were probably supposed to pass a locals dict in the lab instead of creating +local variables and rawdogging the rules with `eval` to mutate the state, +whoops. + +## 2024-04-25 + +Realized UM students can get a free Yubikey from the university tech shop, so +I picked one up today. Nice. + +Almost had a heart attack when the GSI for 529 emailed about the homework being +due in 1 hour. The assignment page had "Available until (later date)" as the +first words under the assignment and my brain parsed that as the due date, when +the due date was listed later, so I almost completely skipped it. + +## 2024-04-29 + +Helped someone debug how to upload stuff to an external service via Great Lakes +which was an interesting rabbit hole to go down. Gonna port the notes here for +my future self because I'm sure I'm going to have to do something like this at +some point. + +The first layer of problem is that the Great Lakes slurm nodes can't access the +internet by default. has +some notes on why this is the case. So we need to use one of the workarounds on +that page, but which one you need starts bumping up against the next layer of +problems. + +The recommended upload method for what they were looking at was `ncftpput`, +which is using FTP, which is notorious for being a weird pain in the ass of +a protocol because it uses multiple TCP connections on different ports to +transfer shit instead of just using a single connection like… everything else. +So using FTP here would be even more painful because we'd need to deal with +wrangling multiple ports/connections on the slurm nodes, which is not great. + +Luckily SFTP is also an option in this particular case. SFTP is probably going +to be easier here because it only uses a single connection/port. + +Side note: there's `/etc/profile.d/http_proxy.sh` on the cluster, but that just +sets some environment variables for HTTP-based proxies. `ncftpput` explicitly +says it cannot use these in its `firewall` file, so that doesn't help. + +So step one is to use the final workaround in that ARC document to set up +a tunnel to the server on the slurm node, then point SFTP at that. According to +that document you can do that with something like this: + + ssh -o HostKeyAlgorithms=ssh-ed25519 \ + -NAL 9922:sftp-private.ncbi.nlm.nih.gov:22 \ + greatlakes-xfer.arc-ts.umich.edu \ + & + +But if you actually *try* that you immediately hit a problem that they just +kinda… ignore? in the document. SSH prompts you to check the host key +fingerprint the first time you log in, which is done interactively and can't +happen when it's backgrounded with `&` like that. So we need to run `ssh -o +HostKeyAlgorithms=ssh-ed25519 -NAL 9922:sftp-private.ncbi.nlm.nih.gov:22 +greatlakes-xfer.arc-ts.umich.edu` once interactively and accept the host key by +hand which will store it in `~/.ssh/known_hosts`, then just `^C` it. Then (as +long as they don't change the keys) we never need to do that again. + +As for the `HostKeyAlgorithms` crap, that's yet another snag. ARC also keeps +a list of known host fingerprints for the Great Lakes hosts in something like +`/etc/ssh/ssh_known_hosts` (which is good). SSH will check that these +fingerprints are valid when you connect (also good). But SSH supports multiple +key formats, and they've configured the ssh server on the `greatlakes-xfer` host +to prefer to send one format, but they've listed a different one in their +`ssh_known_hosts` (I have no idea why they would do this, it's probably an +oversight?). So `-o HostKeyAlgorithms=ssh-ed25519` will tell SSH to prefer that +format of key, which matches what's in `/etc/ssh/ssh_known_hosts`. + +Okay, so at this point we need something like this: + + #!/bin/bash + + #SBATCH lines here… + + set -euo pipefail + + # https://arc.umich.edu/document/internet-from-nodes/ + + ssh -o HostKeyAlgorithms=ssh-ed25519 \ + -NAL 9922:sftp-private.ncbi.nlm.nih.gov:22 \ + greatlakes-xfer.arc-ts.umich.edu \ + & + + sleep 5 + + # Actually upload the files here + + kill %1 + +And now we can finally get to the internet from the slurm node. + +We want to use `sftp` to upload by logging in with `sftp` and running something +like: + + cd uploads/somedir + mkdir new_thing + cd new_thing + put /path/to/data + +First snag: we want to upload an entire directory, so we probably need `put -r` +to recursively do that. It's probably also worth adding `-a` to make it +resumeable: + + cd uploads/somedir + mkdir new_thing + cd new_thing + put -ar /path/to/data + +But we don't want to have to interactively type the commands here since it's in +a script. `sftp` has a `-b batchfile` option where you can put those commands +into a file and read it in, e.g.: + + sftp -b commands.txt user@host + +We could do it that way, or use bash process substitution to fake a file without +actually creating one: + + sftp -b <(echo " + cd uploads/somedir + mkdir new_thing + cd new_thing + put -ar /path/to/data + ") user@host + +Since we're using a tunnel instead of connecting directly to the server, we tell +`sftp` to connect to that: + + sftp -P 9922 \ + -b <(echo " + cd uploads/somedir + mkdir new_thing + cd new_thing + put -ar /path/to/data + ") \ + someuser@localhost + +But there's one more snag: entering the password to the SFTP server. For that +there's a handy utility called `sshpass` that's already installed on the +cluster: + + SSHPASS="mypassword" sshpass -e \ + sftp -P 9922 \ + -b <(echo " + cd uploads/somedir + mkdir new_thing + cd new_thing + put -ar /path/to/data + ") \ + someuser@localhost + +Could also stick the password into a text file (probably with `600` permissions +for safety) and use `sshpass -fpassword.txt` instead of an environment variable +to avoid including it in the script. + +I think that's all the layers of the onion, so the final script would look +something like this: + + #!/bin/bash + + #SBATCH … + + set -euo pipefail + + # https://arc.umich.edu/document/internet-from-nodes/ + + ssh -o HostKeyAlgorithms=ssh-ed25519 \ + -NAL 9922:sftp-private.ncbi.nlm.nih.gov:22 \ + greatlakes-xfer.arc-ts.umich.edu \ + & + + sleep 5 + + # Might be safer to use the file-based mode for sshpass with sshpass -fsomefile + SSHPASS="mypassword" sshpass -e \ + sftp -P 9922 \ + -b <(echo " + cd uploads/somedir + mkdir new_thing + cd new_thing + put -ar /path/to/data + ") \ + someuser@localhost + + kill %1 + +Whew.