Running the pipeline¶
Run the pipeline locally via CLI¶
Once the pipeline is installed, you can run it via CLI:
$ immunopipe --help
You can specify the options directly in the CLI. For example:
$ immunopipe --forks 4 --TopExpressingGenes.envs.n 100 ...
It's recommended to use a configuration file to specify all the options. For example:
$ immunopipe @config.toml
You can also use both ways together. The options specified in the CLI will override the ones in the configuration file.
$ immunopipe @config.toml --forks 4 --TopExpressingGenes.envs.n 100 ...
For configuration items, see configurations for more details.
Tip
If you want to run the pipeline on a cluster, see How to run the pipeline on a cluster? for more details.
Attention
For settings that determine the routes of the pipeline, you should define them in the configuration file. For example, if you want to perform supervised clustering, you need to add [SeuratMap2Ref]
in the configuration file with necessary parameters. If you just pass the section as a command line argument (--SeuratMap2Ref
), it will not trigger the corresponding processes.
To indicator whether the scTCR-/scBCR-seq data is available or not, you also need to specify the sample information file in the configuration file [SampleInfo.in.infile]
. Passing the sample information file as a command line argument (--Sample.in.infile
) does not trigger the corresponding processes.
See Routes of the pipeline for more details.
Run the pipeline via pipen-board
¶
pipen-board
is a web-based dashboard for pipen
. It provides a user-friendly interface to configure and run the pipeline. It also provides a way to monitor the running progress of the pipeline.
pipen-board
is installed by default with immunopipe
. You can run it via CLI:
$ pipen board immunopipe:Immunopipe
*
* __ __ __. . __ __ + __ __
* |__)||__)|_ |\ | __ |__)/ \ /\ |__)| \
* | || |__| \| |__)\__//--\| \ |__/
*
* version: 0.11.1
*
* Configure and run pipen pipelines from the web
*
* Serving Quart app 'pipen_board'
* Environment: development
* Debug mode: True
* Running on http://0.0.0.0:18521 (CTRL + C to quit)
[07/31/23 21:23:27] INFO Running on http://0.0.0.0:18521 (CTRL + C to quit)
Then you can open the dashboard in your browser at http://localhost:18521
.
In the Configuration
tab, you can configure the pipeline and the processes. Then you can use the Generate Configuration
button to generate the configuration file and then use the generated configuration file to run the pipeline via CLI.
If you want to run the pipeline via pipen-board
, you need an additional configuration file to tell pipen-board
how to run the pipeline:
$ pipen board immunopipe:Immunopipe -a gh:pwwang/immunopipe/board.toml@dev
The additional file is available at immunopipe
's GitHub repo. You can also download it and modify it to fit your needs, but in most cases, you don't have to. With the additional file, you can find four running options
, LOCAL
, DOCKER
, SINGULARITY
and APPTAINER
, on the left side of the Configuration
tab. You can choose one of them to run the pipeline.
Take LOCAL
as an example. When clicking the Run the command
button, a configuration file specified by configfile
is saved and used to run the pipeline via CLI. Then the Previous Run
tab is replaced by the Running
tab to track the progress of the pipeline.
Run the pipeline using docker image¶
Choose the right tag of the docker image¶
The docker image is tagged with the version of immunopipe
, together with master
and dev
. They are listed here: https://hub.docker.com/repository/docker/justold/immunopipe/tags.
dev
is the latest development version of immunopipe
. It may have unstable features. If you want to use a more stable version, please try master
, or a specific semantic version.
Any tags with a -full
suffix are the full version of the image. It contains all the dependencies of the pipeline, especially keras
and tensorflow
that are required by the embedding procedure of TESSA
. Those packages take quite a lot of the space of the image. If you don't need the TESSA
process, you can use the minimal version of the image.
Any tags without the -full
suffix are the minimal version of the image. TESSA
process is also NOT supported in the minimal version. keras
and tensorflow
are also NOT included in the image.
Please also keep in mind that there is no GPU support with either type of the image.
You can pull the images in advance using docker
, singularity
or apptainer
. See help options of docker pull
, singularity pull
or apptainer pull
for more details.
You can also specify the tag when running the pipeline. See the following sections for more details.
To run the pipeline using the docker image with docker
, you need to mount the current working directory to the /workdir
directory in the container. You also need to specify the configuration file via @<configfile>
option. For example:
$ docker run \
--rm -w /workdir -v .:/workdir -v /tmp:/tmp \
justold/immunopipe:<tag> \
@config.toml
You also need to mount the current working directory to the /workdir
directory in the container if you are using singularity
. You also need to specify the configuration file via @<configfile>
option. For example:
$ singularity run \
--pwd /workdir -B .:/workdir,/tmp -c -e --writable-tmpfs \
docker://justold/immunopipe:<tag> \
@config.toml
You also need to mount the current working directory to the /workdir
directory in the container if you are using apptainer
. You also need to specify the configuration file via @<configfile>
option. For example:
$ apptainer run \
--pwd /workdir -B .:/workdir,/tmp -c -e --unsquash --writable-tmpfs \
docker://justold/immunopipe:<tag> \
@config.toml
Run the pipeline via pipen-board
using docker image¶
You can also run the pipeline via pipen-board
using the docker image with docker
:
$ docker run -p 18521:18521 \
--rm -w /workdir -v .:/workdir -v /tmp:/tmp \
justold/immunopipe:<tag> board \
immunopipe:Immunopipe \
-a /immunopipe/board.toml
The under the running options
, choose LOCAL
to run the pipeline.
Note
You should use LOCAL
instead of DOCKER
to run the pipeline. Otherwise, the pipeline will be run in a docker container inside the docker container.
You can also run the pipeline via pipen-board
using the docker image with singularity
:
$ singularity run \
--pwd /workdir -B .:/workdir,/tmp -c -e --writable-tmpfs \
docker://justold/immunopipe:<tag> board \
immunopipe:Immunopipe \
-a /immunopipe/board.toml
The under the running options
, choose LOCAL
to run the pipeline.
Similarly, you should use LOCAL
instead of SINGULARITY
to run the pipeline. Otherwise, the pipeline will be run in a docker container inside the container.
You can also run the pipeline via pipen-board
using the docker image with apptainer
:
$ apptainer run \
--pwd /workdir -B .:/workdir,/tmp -c -e --unsquash --writable-tmpfs \
docker://justold/immunopipe:<tag> board \
immunopipe:Immunopipe \
-a /immunopipe/board.toml
Also similarly, you should use LOCAL
instead of APPTAINER
to run the pipeline. Otherwise, the pipeline will be run in a docker container inside the container.
When the command is running, you will see the following message:
Then, You can open the dashboard in your browser at http://localhost:18521
.
Run the pipeline using Google Cloud Batch Jobs¶
There are two ways to run the pipeline using Google Cloud Batch Jobs:
Use the gbatch
scheduler of pipen
¶
When using the gbatch
, the metadata of the processes (job status, job output, etc) are managed locally. Even though they are on the cloud, they are manipuated locally (using the API provided by cloudpathlib
). The processes are submitted to Google Cloud Batch Jobs using gcloud batch jobs submit
. And the processes are run on Google Cloud Compute Engine VMs, and they need to be sumbitted one after another.
See the documentation of cloud support
of pipen
.
Use pipen-cli-gbatch
¶
You need install the dependencies via pip install -U immunopipe[cli-gbatch]
to use this feature.
immunopipe
has integrated with pipen-cli-gbatch
to provide a seamless way to run the pipeline using Google Cloud Batch Jobs. The entire pipeline is wrapped (like it is running locally) and submitted as a single job to Google Cloud Batch Jobs. You just need to run the following:
> immunopipe gbatch @config.toml
To provide the scheduler options to run the wrapped job (daemon) on Google Cloud Batch Jobs, you can specify them by --cli-gbatch.machine-type
, --cli-gbatch.provisioning-model
, --cli-gbatch.disk-size-gb
, etc. See the help message of immunopipe gbatch --help
for more details.
> immunopipe gbatch --help
# pipeline options
Options For Pipen-cli-gbatch (extra Options):
--cli-gbatch.profile PROFILE
Use the `scheduler_opts` as the Scheduler Options of a given profile from pipen configuration files,
including ~/.pipen.toml and ./pipen.toml.
Note that if not provided, nothing will be loaded from the configuration files.
--cli-gbatch.loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,debug,info,warning,error,critical}
Set the logging level for the daemon process. [default: INFO]
--cli-gbatch.error-strategy {retry,halt}
The strategy when there is error happened [default: halt]
--cli-gbatch.num-retries NUM_RETRIES
The number of retries when there is error happened. Only valid when --error-strategy is 'retry'. [default: 0]
--cli-gbatch.prescript PRESCRIPT
The prescript to run before the main command.
--cli-gbatch.postscript POSTSCRIPT
The postscript to run after the main command.
--cli-gbatch.recheck-interval RECHECK_INTERVAL
The interval to recheck the job status, each takes about 0.1 seconds. [default: 600]
--cli-gbatch.project PROJECT
The Google Cloud project to run the job.
--cli-gbatch.location LOCATION
The location to run the job.
--cli-gbatch.mount MOUNT
The list of mounts to mount to the VM, each in the format of SOURCE:TARGET, where SOURCE must be either a
Google Storage Bucket path (gs://...).
You can also use named mounts like `INDIR=gs://my-bucket/inputs` and the directory will be mounted to
`/mnt/disks/INDIR` in the VM;
then you can use environment variable `$INDIR` in the command/script to refer to the mounted path.
You can also mount a file like `INFILE=gs://my-bucket/inputs/file.txt`. The parent directory will be mounted
to `/mnt/disks/INFILE/inputs` in the VM,
and the file will be available at `/mnt/disks/INFILE/inputs/file.txt` in the VM. `$INFILE` can also be used
in the command/script to refer to the mounted path.
[default: []]
--cli-gbatch.service-account SERVICE_ACCOUNT
The service account to run the job.
--cli-gbatch.network NETWORK
The network to run the job.
--cli-gbatch.subnetwork SUBNETWORK
The subnetwork to run the job.
--cli-gbatch.no-external-ip-address
Whether to disable external IP address for the VM.
--cli-gbatch.machine-type MACHINE_TYPE
The machine type of the VM.
--cli-gbatch.provisioning-model {STANDARD,SPOT}
The provisioning model of the VM.
--cli-gbatch.image-uri IMAGE_URI
The custom image URI of the VM.
--cli-gbatch.runnables RUNNABLES
The JSON string of extra settings of runnables add to the job.json.
Refer to https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#Runnable for details.
You can have an extra key 'order' for each runnable, where negative values mean to run before the main
command,
and positive values mean to run after the main command.
--cli-gbatch.allocationPolicy ALLOCATIONPOLICY
The JSON string of extra settings of allocationPolicy add to the job.json. Refer to
https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#AllocationPolicy for details.
[default: {}]
--cli-gbatch.taskGroups TASKGROUPS
The JSON string of extra settings of taskGroups add to the job.json. Refer to
https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#TaskGroup for details.
[default: []]
--cli-gbatch.labels LABELS
The strings of labels to add to the job (key=value). Refer to
https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#Job.FIELDS.labels for details.
[default: []]
--cli-gbatch.gcloud GCLOUD
The path to the gcloud command. [default: gcloud]
When --cli-gbatch.profile
is provided, the default scheduler options will be used from ~/.pipen.toml
and ./pipen.toml
. For example, you can add the following to ~/.pipen.toml
:
[gbatch.scheduler_opts]
project = "my-project"
location = "us-central1"
Then you can run the pipeline using the following command:
> immunopipe gbatch @config.toml --cli-gbatch.profile gbatch
To use the default project
and location
.
You can also specify these options directly in the command line or under a section cli-gbatch
in the configuration file. The options specified in the command line will override the ones in the configuration file, which will override the ones in the profile.
For example, you may have the following in config.toml
:
name = "Immunopipe"
workdir = "gs://my-bucket/immunopipe_workdir"
outdir = "gs://my-bucket/immunopipe_outdir"
[cli-gbatch]
project = "my-project"
location = "us-central1"
machine-type = "n2d-standard-4"
provisioning-model = "SPOT"
...
There are other actions you can do with gbatch
:
immunopipe gbatch @config.toml --nowait
: submit the job and exit without waiting for the job to finish.immunopipe gbatch @config.toml --view-logs
: view the logs of the job for the detached job.immunopipe gbatch @config.toml --version
: show the version ofimmunopipe
,pipen-cli-gbatch
andpipen
.
Here is a diagram showing the difference between using the gbatch
scheduler and using pipen-cli-gbatch
:
Use Google Cloud Batch Jobs directly (not recommended)¶
You can also run the pipeline using Google Cloud Batch Jobs directly. In this case, you need to create a job definition file and submit the job using gcloud batch jobs submit
. The job definition file should specify the container image, the command to run the pipeline, and the resources required for the job.
Here is an example of a job definition file (job.json
):
{
"allocationPolicy": {
"serviceAccount": {
"email": "..."
},
"network": "...",
"instances": [
{
"policy": {
"machineType": "n2d-standard-4",
"provisioningModel": "SPOT"
}
}
]
},
"taskGroups": [
{
"taskSpec": {
"runnables": [
{
"container": {
"image_uri": "docker.io/justold/immunopipe:dev",
"entrypoint": "/usr/local/bin/_entrypoint.sh",
"commands": [
"immunopipe",
"@/mnt/disks/workdir/Immunopipe.config.toml"
]
}
}
],
"volumes": [
{
"gcs": {
"remotePath": "<bucket>/path/to/workdir"
},
"mountPath": "/mnt/disks/workdir"
}
]
}
}
],
"logsPolicy": {
"destination": "CLOUD_LOGGING"
},
"labels": "..."
}
Then you can submit the job using the following command:
$ gcloud batch jobs submit <job-name> --location <location> --project <project-id> --config job.json