The motivation is to access dynamically scale-up/scale-down resources on a Slurm cluster but with the convenience of interactive development.

Implementation idea is that the Slurm “job” consists of a netcat to port 22, and the rest flows from that. Here a .ssh config file:

Host headnode
    Hostname i-head
    User myuser
    ProxyCommand ssh-aws-ssm.sh %h %p
    IdentityFile ~/.ssh/id_ed25519
    ForwardAgent yes	
    ControlMaster auto
    ControlPath /tmp/ssh-%r@%h:%p

Host workernode
    ForwardAgent yes
    User myuser
    ProxyCommand ssh  headnode /opt/slurm/bin/srun --unbuffered nc -q 1  localhost 22
    IdentityFile ~/.ssh/id_ed25519
    StrictHostKeyChecking no
    ControlMaster auto
    ControlPath /tmp/ssh-%r@%h:%p

And explanation:

  1. Head node connection is via AWS session manager helper script (as the example uses AWS parallelcluster )
  2. Forward agent for easier access to workers
  3. ControlMaster on both head and worker node for easy connection multiplexing
  4. Connection to worker node is a ProxyCommand which ssh to the headnode and then uses srun to run an netcat nc job on the worker
  5. Option “–unbuffered” is required for srun
  6. Option “-q 1” to netcat cleans up the job after the original ssh connection drops
  7. StrictHostKeyChecking no needed as the host being dynamic it is expected to change over time

With this setup VScode can connect directly to a worker node which will be automatically be spun up if needed and shutdown when no more connections.