Scheduler
pipen can send jobs to different scheduler system to run. To specify the scheduler, use scheduler and scheduler_opts configurations.
Default supported schedulers
pipen uses xqute for scheduler backend support. The following schedulers are supported by pipen:
local
This is the default scheduler used by pipen. The jobs will be run on the local machine.
No scheduler-specific options are available.
sge
Send the jobs to run on sge scheduler.
The scheduler_opts will be the ones supported by qsub.
slurm
Send the jobs to run on slurm scheduler.
The scheduler_opts will be the ones supported by sbatch.
ssh
Send the jobs to run on a remote machine via ssh.
The scheduler_opts will be the ones supported by ssh.
See also xqute.
container
Send the jobs to run in a container (Docker/Podman/Apptainer).
The scheduler_opts will be used to construct the container command.
They include:
- image: The container image to use.
- entrypoint: The entrypoint of the container to run the wrapped job script. If not specified, the default entrypoint /bin/sh will be used.
- bin: The container command to use. If not specified, it will use docker.
- volumes: A list of volumes to mount to the container. The default volumes are:
- workdir: The working directory of the pipeline, mounted to /mnt/disks/pipen-pipeline/workdir.
- outdir: The output directory of the pipeline, mounted to /mnt/disks/pipen-pipeline/outdir.
- envs: A dictionary of environment variables to set in the container.
- remove: Whether to remove the container after the job is done. Default is True. Only supported by Docker and Podman.
- user: The user to run the container as. Default is the current user. Only supported by Docker and Podman.
- bin_args: Additional arguments to pass to the container command. For example, {"bin_args": ["--privileged"]} will run the container in privileged mode. Only supported by Docker and Podman.
gbatch
Send the jobs to run using Google Cloud Batch.
The scheduler_opts will be used to construct the job configuration. This scheduler requires that the pipeline's outdir is a Google Cloud Storage path (e.g., gs://bucket/path).
The scheduler options include:
- project: Google Cloud project ID
- location: Google Cloud region or zone
- mount: GCS path to mount (e.g. gs://my-bucket:/mnt/my-bucket). You can pass a list of mounts.
- service_account: GCP service account email (e.g. test-account@example.com)
- network: GCP network (e.g. default-network)
- subnetwork: GCP subnetwork (e.g. regions/us-central1/subnetworks/default)
- no_external_ip_address: Whether to disable external IP address
- machine_type: GCP machine type (e.g. e2-standard-4)
- provisioning_model: GCP provisioning model (e.g. SPOT)
- image_uri: Container image URI (e.g. ubuntu-2004-lts)
- entrypoint: Container entrypoint (e.g. /bin/bash)
- commands: The command list to run in the container.
There are three ways to specify the commands:
1. If no entrypoint is specified, the final command will be
[commands, wrapped_script], where the entrypoint is the wrapper script
interpreter that is determined by JOBCMD_WRAPPER_LANG (e.g. /bin/bash),
commands is the list you provided, and wrapped_script is the path to the
wrapped job script.
2. You can specify something like "-c", then the final command
will be ["-c", "wrapper_script_interpreter, wrapper_script"]
3. You can use the placeholders {lang} and {script} in the commands
list, where {lang} will be replaced with the interpreter (e.g. /bin/bash)
and {script} will be replaced with the path to the wrapped job script.
For example, you can specify ["{lang} {script}"] and the final command
will be ["wrapper_interpreter, wrapper_script"]
Additional keyword arguments can be used for job configuration (e.g. taskGroups). See more details at Google Cloud Batch documentation.
By default, the pipeline's workdir is mounted to /mnt/disks/pipen-pipeline/workdir and the outdir is mounted to /mnt/disks/pipen-pipeline/outdir on the VM.
Writing your own scheduler plugin
To write a scheduler plugin, you need to subclass both xqute.schedulers.scheduler.Scheduler and pipen.scheduler.SchedulerPostInit.
For examples of a scheduler plugin, see local_scheduler, sge_scheduler, slurm_scheduler, ssh_scheduler, and [gbatch_scheduler][6], and also pipen.scheduler.
A scheduler class can be passed to scheduler configuration directly to be used as a scheduler. But you can also register it with entry points:
For setup.py, you will need:
setup(
# ...
entry_points={"pipen_sched": ["mysched = pipen_mysched"]},
# ...
)
For pyproject.toml:
[tool.poetry.plugins.pipen_sched]
mysched = "pipen_mysched"
Then you can switch the scheduler to mysched by scheduler="mysched"