modes-slurm leaves SLURM job running on crash of master
When running modes-slurm.sh and the master (?) encounters an exception, the associated SLURM job keeps running and occupying resources:
- Encounter error:
token-wireless.modest: error: Could not connect to host 188.8.131.52. token-wireless.modest: error: Could not connect to host 184.108.40.206. [ERROR] FATAL UNHANDLED EXCEPTION: System.ObjectDisposedException: The CancellationTokenSource has been disposed. at System.Threading.CancellationTokenSource.Cancel (System.Boolean throwOnFirstException) [0x00000] in <cc0368638257483f94f364ec47500332>:0 [...]
- squeue still lists the job (and another one that had also crashed):
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 157516 r415 bh_modes hartmann R 4:55 20 ctit[001-020] 157517 r415 6h_modes hartmann R 3:55 20 ctit[021-040]
The jobs then need to be cancelled with scancel. The expected behaviour is that the script cancels the jobs itself, even when a crash occurs.