modes-slurm leaves SLURM job running on crash of master
When running modes-slurm.sh and the master (?) encounters an exception, the associated SLURM job keeps running and occupying resources:
- Encounter error:
token-wireless.modest: error: Could not connect to host 130.89.6.205.
token-wireless.modest: error: Could not connect to host 130.89.6.215.
[ERROR] FATAL UNHANDLED EXCEPTION: System.ObjectDisposedException: The CancellationTokenSource has been disposed.
at System.Threading.CancellationTokenSource.Cancel (System.Boolean throwOnFirstException) [0x00000] in <cc0368638257483f94f364ec47500332>:0
[...]
- squeue still lists the job (and another one that had also crashed):
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
157516 r415 bh_modes hartmann R 4:55 20 ctit[001-020]
157517 r415 6h_modes hartmann R 3:55 20 ctit[021-040]
The jobs then need to be cancelled with scancel. The expected behaviour is that the script cancels the jobs itself, even when a crash occurs.