module

biopipen.ns.rnaseq

RNA-seq data analysis

Classes
class

biopipen.ns.rnaseq.UnitConversion(*args, **kwds)Proc

Bases
biopipen.core.proc.Proc pipen.proc.Proc

Convert expression value units back and forth

See https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ and https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#fpkm.

Following converstions are supported -

  • * count -> cpm, fpkm/rpkm, fpkmuq/rpkmrq, tpm, tmm
  • * fpkm/rpkm -> count, tpm, cpm
  • * tpm -> count, fpkm/rpkm, cpm
  • * cpm -> count, fpkm/rpkm, tpm
NOTE that during some conversions, sum(counts/effLen) is approximated tosum(counts)/sum(effLen) * length(effLen))

You can also use this process to just transform the expression values, e.g., take log2 of the expression values. In this case, you can set inunit and outunit to count and log2(count + 1) respectively.

Attributes
  • cache Should we detect whether the jobs are cached?
  • desc The description of the process. Will use the summary fromthe docstring by default.
  • dirsig When checking the signature for caching, whether should we walkthrough the content of the directory? This is sometimes time-consuming if the directory is big.
  • envs The arguments that are job-independent, useful for common optionsacross jobs.
  • envs_depth How deep to update the envs when subclassed.
  • error_strategy How to deal with the errors
    • - retry, ignore, halt
    • - halt to halt the whole pipeline, no submitting new jobs
    • - terminate to just terminate the job itself
  • export When True, the results will be exported to <pipeline.outdir>Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processes
  • forks How many jobs to run simultaneously?
  • input The keys for the input channel
  • input_data The input data (will be computed for dependent processes)
  • lang The language for the script to run. Should be the path to theinterpreter if lang is not in $PATH.
  • name The name of the process. Will use the class name by default.
  • nexts Computed from requires to build the process relationships
  • num_retries How many times to retry to jobs once error occurs
  • order The execution order for this process. The bigger the numberis, the later the process will be executed. Default: 0. Note that the dependent processes will always be executed first. This doesn't work for start processes either, whose orders are determined by Pipen.set_starts()
  • output The output keys for the output channel(the data will be computed)
  • output_data The output data (to pass to the next processes)
  • plugin_opts Options for process-level plugins
  • requires The dependency processes
  • scheduler The scheduler to run the jobs
  • scheduler_opts The options for the scheduler
  • script The script template for the process
  • submission_batch How many jobs to be submited simultaneously
  • template Define the template engine to use.This could be either a template engine or a dict with key engine indicating the template engine and the rest the arguments passed to the constructor of the pipen.template.Template object. The template engine could be either the name of the engine, currently jinja2 and liquidpy are supported, or a subclass of pipen.template.Template. You can subclass pipen.template.Template to use your own template engine.
Input
  • infile Input file containing expression valuesThe file should be a matrix with rows representing genes and columns representing samples. It could be an RDS file containing a data frame or a matrix, or a text file containing a matrix with tab as the delimiter. The text file can be gzipped.
Output
  • outfile Output file containing the converted expression valuesThe file will be a matrix with rows representing genes and columns representing samples.
Envs
  • inunit The input unit of the expression values.You can also use an expression to indicate the input unit, e.g., log2(counts + 1). The expression should be like A * fn(B*X + C) + D, where A, B, C and D are constants, fn is a function, and X is the input unit. Currently only expr, sqrt, log2, log10 and log are supported as functions. Supported input units are:
    • * counts/count/rawcounts/rawcount: raw counts.
    • * cpm: counts per million.
    • * fpkm/rpkm: fragments per kilobase of transcript per million.
    • * fpkmuq/rpkmuq: upper quartile normalized FPKM/RPKM.
    • * tpm: transcripts per million.
    • * tmm: trimmed mean of M-values.
  • meanfl (type=auto) A file containing the mean fragment length for each sampleby rows (samples as rowname), without header. Or a fixed universal estimated number (1 used by TCGA).
  • nreads (type=auto) The estimatied total number of reads for each sample.or you can pass a file with the number for each sample by rows (samples as rowname), without header. When converting fpkm/rpkm -> count, it should be total reads of that sample. When converting cpm -> count: it should be total reads of that sample. When converting tpm -> count: it should be total reads of that sample. When converting tpm -> cpm: it should be total reads of that sample. When converting tpm -> fpkm/rpkm: it should be sum(fpkm) of that sample. It is not used when converting count -> cpm, fpkm/rpkm, tpm.
  • outunit The output unit of the expression values. An expression can also beused for transformation (e.g. log2(tpm + 1)). If inunit is count, then this means we are converting raw counts to tpm, and transforming it to log2(tpm + 1) as the output. Any expression supported by R can be used. Same units as inunit are supported.
  • refexon Path to the reference exon gff file.
Classes
Methods
  • __init_subclass__() Do the requirements inferring since we need them to build up theprocess relationship </>
  • from_proc(proc, name, desc, envs, envs_depth, cache, export, error_strategy, num_retries, forks, input_data, order, plugin_opts, requires, scheduler, scheduler_opts, submission_batch) (Type) Create a subclass of Proc using another Proc subclass or Proc itself</>
  • gc() GC process for the process to save memory after it's done</>
  • init() Init all other properties and jobs</>
  • log(level, msg, *args, logger) Log message for the process</>
  • run() Run the process</>
class

pipen.proc.ProcMeta(name, bases, namespace, **kwargs)

Bases
abc.ABCMeta

Meta class for Proc

Methods
  • __call__(cls, *args, **kwds) (Proc) Make sure Proc subclasses are singletons</>
  • __instancecheck__(cls, instance) Override for isinstance(instance, cls).</>
  • __repr__(cls) (str) Representation for the Proc subclasses</>
  • __subclasscheck__(cls, subclass) Override for issubclass(subclass, cls).</>
  • register(cls, subclass) Register a virtual subclass of an ABC.</>
staticmethod
register(cls, subclass)

Register a virtual subclass of an ABC.

Returns the subclass, to allow usage as a class decorator.

staticmethod
__instancecheck__(cls, instance)

Override for isinstance(instance, cls).

staticmethod
__subclasscheck__(cls, subclass)

Override for issubclass(subclass, cls).

staticmethod
__repr__(cls) → str

Representation for the Proc subclasses

staticmethod
__call__(cls, *args, **kwds)

Make sure Proc subclasses are singletons

Parameters
  • *args (Any) and
  • **kwds (Any) Arguments for the constructor
Returns (Proc)

The Proc instance

classmethod

from_proc(proc, name=None, desc=None, envs=None, envs_depth=None, cache=None, export=None, error_strategy=None, num_retries=None, forks=None, input_data=None, order=None, plugin_opts=None, requires=None, scheduler=None, scheduler_opts=None, submission_batch=None)

Create a subclass of Proc using another Proc subclass or Proc itself

Parameters
  • proc (Type) The Proc subclass
  • name (str, optional) The new name of the process
  • desc (str, optional) The new description of the process
  • envs (Mapping, optional) The arguments of the process, will overwrite parent oneThe items that are specified will be inherited
  • envs_depth (int, optional) How deep to update the envs when subclassed.
  • cache (bool, optional) Whether we should check the cache for the jobs
  • export (bool, optional) When True, the results will be exported to<pipeline.outdir> Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processes
  • error_strategy (str, optional) How to deal with the errors
    • - retry, ignore, halt
    • - halt to halt the whole pipeline, no submitting new jobs
    • - terminate to just terminate the job itself
  • num_retries (int, optional) How many times to retry to jobs once error occurs
  • forks (int, optional) New forks for the new process
  • input_data (Any, optional) The input data for the process. Only when this processis a start process
  • order (int, optional) The order to execute the new process
  • plugin_opts (Mapping, optional) The new plugin options, unspecified items will beinherited.
  • requires (Sequence, optional) The required processes for the new process
  • scheduler (str, optional) The new shedular to run the new process
  • scheduler_opts (Mapping, optional) The new scheduler options, unspecified items willbe inherited.
  • submission_batch (int, optional) How many jobs to be submited simultaneously
Returns (Type)

The new process class

classmethod

__init_subclass__()

Do the requirements inferring since we need them to build up theprocess relationship

method

init()

Init all other properties and jobs

method

gc()

GC process for the process to save memory after it's done

method

log(level, msg, *args, logger=<LoggerAdapter pipen.core (WARNING)>)

Log message for the process

Parameters
  • level (int | str) The log level of the record
  • msg (str) The message to log
  • *args The arguments to format the message
  • logger (LoggerAdapter, optional) The logging logger
method

run()

Run the process

class

biopipen.ns.rnaseq.Simulation(*args, **kwds)Proc

Bases
biopipen.core.proc.Proc pipen.proc.Proc

Simulate RNA-seq data using ESCO/RUVcorr package

Attributes
  • cache Should we detect whether the jobs are cached?
  • desc The description of the process. Will use the summary fromthe docstring by default.
  • dirsig When checking the signature for caching, whether should we walkthrough the content of the directory? This is sometimes time-consuming if the directory is big.
  • envs The arguments that are job-independent, useful for common optionsacross jobs.
  • envs_depth How deep to update the envs when subclassed.
  • error_strategy How to deal with the errors
    • - retry, ignore, halt
    • - halt to halt the whole pipeline, no submitting new jobs
    • - terminate to just terminate the job itself
  • export When True, the results will be exported to <pipeline.outdir>Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processes
  • forks How many jobs to run simultaneously?
  • input The keys for the input channel
  • input_data The input data (will be computed for dependent processes)
  • lang The language for the script to run. Should be the path to theinterpreter if lang is not in $PATH.
  • name The name of the process. Will use the class name by default.
  • nexts Computed from requires to build the process relationships
  • num_retries How many times to retry to jobs once error occurs
  • order The execution order for this process. The bigger the numberis, the later the process will be executed. Default: 0. Note that the dependent processes will always be executed first. This doesn't work for start processes either, whose orders are determined by Pipen.set_starts()
  • output The output keys for the output channel(the data will be computed)
  • output_data The output data (to pass to the next processes)
  • plugin_opts Options for process-level plugins
  • requires The dependency processes
  • scheduler The scheduler to run the jobs
  • scheduler_opts The options for the scheduler
  • script The script template for the process
  • submission_batch How many jobs to be submited simultaneously
  • template Define the template engine to use.This could be either a template engine or a dict with key engine indicating the template engine and the rest the arguments passed to the constructor of the pipen.template.Template object. The template engine could be either the name of the engine, currently jinja2 and liquidpy are supported, or a subclass of pipen.template.Template. You can subclass pipen.template.Template to use your own template engine.
Input
  • ngenes Number of genes to simulate
  • nsamples Number of samples to simulateIf you want to force the process to re-simulate for the same ngenes and nsamples, you can set a different value for envs.seed. Note that the samples will be shown as cells in the output (since the simulation is designed for single-cell RNA-seq data).
Output
  • outdir Output directory containing the simulated datasim.rds and True.rds will be generated. For ESCO, sim.rds contains the simulated data in a SingleCellExperiment object, and True.rds contains the matrix of true counts. For RUVcorr, sim.rds contains the simulated data in list with Truth, A matrix containing the values of Xβ; Y A matrix containing the values in Y; Noise A matrix containing the values in ; Sigma A matrix containing the true gene-gene correlations, as defined by Xβ; and Info A matrix containing some of the general information about the simulation. For all matrices, rows represent genes and columns represent samples.
  • outfile Output file containing the simulated data with rows representinggenes and columns representing samples.
Envs
  • esco_args (ns) Additional arguments to pass to the simulation function.
    • - save (choice): Which type of data to save to out.outfile.
        - simulated-truth: saves the simulated true counts.
        - zero-inflated: saves the zero-inflated counts.
        - down-sampled: saves the down-sampled counts.
    • - type (choice): Which type of heterogenounity to use.
        - single: produces a single population.
        - group: produces distinct groups.
        - tree: produces distinct groups but admits a tree structure.
        - traj: produces distinct groups but admits a smooth trajectory
        structure.
    • - : See https://rdrr.io/github/JINJINT/ESCO/man/escoParams.html.
  • index_start (type=int) The index to start from when naming the samples.Affects the sample names in out.outfile only.
  • ncores (type=int) Number of cores to use.
  • ruvcorr_args (ns) Additional arguments to pass to the simulationfunction.
  • seed (type=int) Random seed.If not set, seed will not be set.
  • tool (choice) Which tool to use for simulation.
    • - ESCO: uses the ESCO package.
    • - RUVcorr: uses the RUVcorr package.
  • transpose_output (flag) If set, the output will be transposed.
Classes
Methods
  • __init_subclass__() Do the requirements inferring since we need them to build up theprocess relationship </>
  • from_proc(proc, name, desc, envs, envs_depth, cache, export, error_strategy, num_retries, forks, input_data, order, plugin_opts, requires, scheduler, scheduler_opts, submission_batch) (Type) Create a subclass of Proc using another Proc subclass or Proc itself</>
  • gc() GC process for the process to save memory after it's done</>
  • init() Init all other properties and jobs</>
  • log(level, msg, *args, logger) Log message for the process</>
  • run() Run the process</>
class

pipen.proc.ProcMeta(name, bases, namespace, **kwargs)

Bases
abc.ABCMeta

Meta class for Proc

Methods
  • __call__(cls, *args, **kwds) (Proc) Make sure Proc subclasses are singletons</>
  • __instancecheck__(cls, instance) Override for isinstance(instance, cls).</>
  • __repr__(cls) (str) Representation for the Proc subclasses</>
  • __subclasscheck__(cls, subclass) Override for issubclass(subclass, cls).</>
  • register(cls, subclass) Register a virtual subclass of an ABC.</>
staticmethod
register(cls, subclass)

Register a virtual subclass of an ABC.

Returns the subclass, to allow usage as a class decorator.

staticmethod
__instancecheck__(cls, instance)

Override for isinstance(instance, cls).

staticmethod
__subclasscheck__(cls, subclass)

Override for issubclass(subclass, cls).

staticmethod
__repr__(cls) → str

Representation for the Proc subclasses

staticmethod
__call__(cls, *args, **kwds)

Make sure Proc subclasses are singletons

Parameters
  • *args (Any) and
  • **kwds (Any) Arguments for the constructor
Returns (Proc)

The Proc instance

classmethod

from_proc(proc, name=None, desc=None, envs=None, envs_depth=None, cache=None, export=None, error_strategy=None, num_retries=None, forks=None, input_data=None, order=None, plugin_opts=None, requires=None, scheduler=None, scheduler_opts=None, submission_batch=None)

Create a subclass of Proc using another Proc subclass or Proc itself

Parameters
  • proc (Type) The Proc subclass
  • name (str, optional) The new name of the process
  • desc (str, optional) The new description of the process
  • envs (Mapping, optional) The arguments of the process, will overwrite parent oneThe items that are specified will be inherited
  • envs_depth (int, optional) How deep to update the envs when subclassed.
  • cache (bool, optional) Whether we should check the cache for the jobs
  • export (bool, optional) When True, the results will be exported to<pipeline.outdir> Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processes
  • error_strategy (str, optional) How to deal with the errors
    • - retry, ignore, halt
    • - halt to halt the whole pipeline, no submitting new jobs
    • - terminate to just terminate the job itself
  • num_retries (int, optional) How many times to retry to jobs once error occurs
  • forks (int, optional) New forks for the new process
  • input_data (Any, optional) The input data for the process. Only when this processis a start process
  • order (int, optional) The order to execute the new process
  • plugin_opts (Mapping, optional) The new plugin options, unspecified items will beinherited.
  • requires (Sequence, optional) The required processes for the new process
  • scheduler (str, optional) The new shedular to run the new process
  • scheduler_opts (Mapping, optional) The new scheduler options, unspecified items willbe inherited.
  • submission_batch (int, optional) How many jobs to be submited simultaneously
Returns (Type)

The new process class

classmethod

__init_subclass__()

Do the requirements inferring since we need them to build up theprocess relationship

method

init()

Init all other properties and jobs

method

gc()

GC process for the process to save memory after it's done

method

log(level, msg, *args, logger=<LoggerAdapter pipen.core (WARNING)>)

Log message for the process

Parameters
  • level (int | str) The log level of the record
  • msg (str) The message to log
  • *args The arguments to format the message
  • logger (LoggerAdapter, optional) The logging logger
method

run()

Run the process