Defining a process
A pipeline consists of many processes, which could own multiple jobs that run in parallel.
Defining/Creating processes
pipen
has two (preferred) ways to define processes:
Subclassing pipen.Proc
from pipen import Proc
class MyProcess(Proc):
... # process configurations
The configurations are specified as class variables of the class.
Using class method Proc.from_proc()
If you want to reuse a defined process, you can either subclass it:
class MyOtherProcess(MyProcess):
... # configurations inherited from MyProcess
Or use Proc.from_proc()
:
# You can also pass the configurations you want to override
MyOtherProcess = Proc.from_proc(MyProcess, ...)
Note that Proc.from_proc()
cannot override all configurations/class variables, because we assume that there are some shared configurations if you want to "copy" from another process.
These shared configurations are:
- Template engine and its options (
template
andtemplate_opts
) - Script template (
script
) - Input keys (
input
) - Language/Interpreter of the script (
lang
) - Output keys (
output
)
All other configurations can be passed to Proc.from_proc()
to override the old ones.
For all configurations/class variables for a process, see next section.
You don't need to specify the new name of the new process, the variable name on the left-handle side will be used if name
argument is not provided to Proc.from_proc()
. For example:
NewProc = Proc.from_proc(OldProc)
# NewProc.name == "NewProc"
But you are able to assign a different name to a new process if you want. For example:
NewProc = Proc.from_proc(OldProc, name="NewProc2")
# NewProc.name = "NewProc2"
How about instantiation of Proc
directly?
You are not allowed to do that. Proc
is an abstract class, which is designed to be subclassed.
How about instantiation of a Proc
subclass?
Nope, in pipen
, a process is a Proc
subclass itself. The instances of the subcleasses are used internally, and they are singletons. In most cases, you don't need to use the instances, unless you want to access the computed properties of the instances, including:
pipeline
: The pipeline, which is aPipen
objectpbar
: The progress bar for the process, indicating the job status of this processjobs
: The jobs of this processxqute
: TheXqute
object to manage the job running.template
: The template engine (apipen.template.Template
object)template_opts
: The template options (overwritten from config by thetemplate_opts
class variable)input
: The sanitized input keys and typesoutput
: The compiled output template, ready for the jobs to render with their own datascheduler
: The scheduler object (inferred from the name or sheduler object from thescheduler
class variable)script
: The compiled script template, ready for the jobs to render with their own data
How about copy/deep-copy of a Proc
subclass?
Nope. Copy or deep-copy of a Proc
subclass won't trigger __init_subclass__()
, where consolidate the process name from the class name if not specified and connect the required processes with the current one. Copy or deep-copy keeps all properties, but disconnect the relationships between current process and the dependency processes, even with a separate assignment, such as MyProcess.requires = ...
.
process configurations and Proc
class variables
The configurations of a process are specified as class variables of subclasses of Proc
.
Name | Meaning | Can be overwritten by Proc.from_proc() |
---|---|---|
name |
The name of the process. Will use the class name by default. | Yes |
desc |
The description of the process. Will use the summary from the docstring by default. | Yes |
envs |
The env variables that are job-independent, useful for common options across jobs. | Yes, and old ones will be inherited |
cache |
Should we detect whether the jobs are cached? | Yes |
dirsig |
When checking the signature for caching, the depth we should walk through the content of the directory? This is sometimes time-consuming if the directory and the depth are big. | Yes |
export |
When True, the results will be exported to <pipeline.outdir> Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processes |
Yes |
error_strategy |
How to deal with the errors: retry, ignore, halt | Yes |
num_retries |
How many times to retry to jobs once error occurs | Yes |
template |
Define the template engine to use. | No |
template_opts |
Options to initialize the template engine. | No |
forks |
How many jobs to run simultaneously? | Yes |
input |
The keys and types for the input channel | No |
input_data |
The input data (will be computed for dependent processes) | Yes |
lang |
The language for the script to run. | No |
order |
The execution order for the same dependency-level processes | Yes |
output |
The output keys for the output channel | No |
plugin_opts |
Options for process-level plugins | Yes |
requires |
The dependency processes | Yes |
scheduler |
The scheduler to run the jobs | Yes |
scheduler_opts |
The options for the scheduler | Yes |
script |
The script template for the process | No |
submission_batch |
How many jobs to be submited simultaneously | Yes |