template: inverse class: middle, center, inverse
A task list manager for hpc-clusters. Among others it supports monitoring, automatic resubmission, profiling, and reporting of job lists
https://github.com/holgerbrandl/joblist
Holger Brandl
14.12.2015 MPI-CBG
What --> Why --> Usecases + Live Examples
Conceptually JobList jl
is just managing lists of job-ids as reported by the underlying queuing system
>cat .some_jobs
860671
860681
860686
860688
860691
- Processing states: Did they finish? Exit codes?
- Runtime statistics: How long? How much pending When?
- Stderr/out Logs?
- Metrics?
-
Workflow-graph not always well defined or contains just a single node
-
Snakemake: own programming language
-
Queue: Too much tuned to GATK
-
Not tools for direct flow control, but for workflow execution engines
-
DRMAA: Just (bloated) API not a tool; reference v2 has not been touched since 3y
Needed:
- Simplistic cross-cluster job-control toolchain, with reporting, convenient API access, and support for different HPC platforms
???
Other motivations
- increase robustness and quality of existing hpc-workflows
- deepen understanding of HPC in general
- get away from bash
- hone scala skills
- do some nice API design
http://www.drmaa.org/ogf34.pdf
- Submit, monitor and wait until an entire a list of clusters jobs has finished
- Report average runtime statistics, and predict the remaining runtime of a joblist based on cluster load and job complexities
- Recover crashed jobs and resubmit them again using a customizable set of resubmission strategies.
> jl --help
Usage: jl <command> [options] [<joblist_file>]
Supported commands are
submit Submits a named job including automatic stream redirection and adds
it to the list
add Extract job-ids from stdin and add them to the list
wait Wait for a list of tasks to finish
status Prints various statistics and allows to create an html report for the
list
kill Removes all queued jobs of this list from the scheduler
up Moves a list of jobs to the top of a queue if supported by the
underlying scheduler
If no <joblist_file> is provided, jl will use '.jobs' as default
- Cover all essential manipulation tasks for a given joblist
- Cross-platform, just requires Java (and optionally R+pandoc) for reporting
- Written in Scala but assembled into self-contained jar-bundle
- Use
jl
for blocking and monitoring and final status handling
bsub "echo foo" | jl add
bsub "sleep 10; echo bar" | jl add
bsub "exit 1" | jl add
jl wait
if [ -n "$(jl status --failed)" ]; then
echo "some jobs failed"
fi
jl status --report
- Does not change existing workflows but complements them
jl submit "sleep 10" ## add a job
jl submit "sleep 1000" ## add another which won't fit in our default queue
## wait and resubmit failed/killed jobs to another queue
jl wait
jl resub --queue long
- Decouple workflows from underlying queuing system
- Automatic resubmission using different patterns (diff queue, walltime, threads, ...)
--
--
Who cares?
jl
provides (some) abstraction from the underlying queuing system:
--
- LSF
jl submit "echo 'hello lsf'"
--
- SLURM
jl submit "echo 'hello slurm'"
--
- Any Local Computer
jl submit "echo 'hello computer'"
jl
provides abstraction for common properties but also supports scheduler specific settings- Total abstraction not feasible and also not desired
???
like custom memory options for
- Takes into account threading settings
- Interruptable
- Scheduler-Jumping: change scheduler for existing joblist
jl submit "sleep 10"
jl submit "sleep 20"
jl wait --report # ctrl-c
## fix something, and try again
jl wait --report
- Sidenote: Its implementation a lovely piece of (multi-threading) art
bash
sub-optimal for workflow development, thus use something decent
import joblist._
val jl = JobList()
jl.run(JobConfiguration("echo foo"))
jl.run(JobConfiguration("echo bar"))
// block execution until are jobs are done
jl.waitUntilDone()
// optionally we could investigate jobs that were killed by the queuing system
val killedInfo: List[RunInfo] = jl.killed.map(_.info)
// resubmit to other queue
jl.resubmit(new OtherQueue("long"))
- Also supports Java (or any other JVM language)
- Improve robustness and failure tolerance
- Fix Travis CI integration to let test suite shine
- Improve Java interoperability
- Reporting Improvements (resubmission graph, cpu consumption)
- For complete list see issue tracker
???
- Failure tolerance: Occasional 30s plus waiting time for squeue
- More general support for slurm (and not just taurus)
--
Questions, comments? Thank you for your attention!