biopipen.ns.rnaseq
RNA-seq data analysis
UnitConversion
(Proc) — Convert expression value units back and forth</>Simulation
(Proc) — Simulate RNA-seq data using ESCO/RUVcorr package</>
biopipen.ns.rnaseq.
UnitConversion
(
*args
, **kwds
)
→ Proc
Convert expression value units back and forth
See https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ and https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#fpkm.
Following converstions are supported -
- *
count -> cpm, fpkm/rpkm, fpkmuq/rpkmrq, tpm, tmm
- *
fpkm/rpkm -> count, tpm, cpm
- *
tpm -> count, fpkm/rpkm, cpm
- *
cpm -> count, fpkm/rpkm, tpm
sum(counts/effLen)
is approximated tosum(counts)/sum(effLen) * length(effLen))
You can also use this process to just transform the expression values, e.g., take
log2 of the expression values. In this case, you can set inunit
and outunit
to
count
and log2(count + 1)
respectively.
cache
— Should we detect whether the jobs are cached?desc
— The description of the process. Will use the summary fromthe docstring by default.dirsig
— When checking the signature for caching, whether should we walkthrough the content of the directory? This is sometimes time-consuming if the directory is big.envs
— The arguments that are job-independent, useful for common optionsacross jobs.envs_depth
— How deep to update the envs when subclassed.error_strategy
— How to deal with the errors- - retry, ignore, halt
- - halt to halt the whole pipeline, no submitting new jobs
- - terminate to just terminate the job itself
export
— When True, the results will be exported to<pipeline.outdir>
Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processesforks
— How many jobs to run simultaneously?input
— The keys for the input channelinput_data
— The input data (will be computed for dependent processes)lang
— The language for the script to run. Should be the path to theinterpreter iflang
is not in$PATH
.name
— The name of the process. Will use the class name by default.nexts
— Computed fromrequires
to build the process relationshipsnum_retries
— How many times to retry to jobs once error occursorder
— The execution order for this process. The bigger the numberis, the later the process will be executed. Default: 0. Note that the dependent processes will always be executed first. This doesn't work for start processes either, whose orders are determined byPipen.set_starts()
output
— The output keys for the output channel(the data will be computed)output_data
— The output data (to pass to the next processes)plugin_opts
— Options for process-level pluginsrequires
— The dependency processesscheduler
— The scheduler to run the jobsscheduler_opts
— The options for the schedulerscript
— The script template for the processsubmission_batch
— How many jobs to be submited simultaneouslytemplate
— Define the template engine to use.This could be either a template engine or a dict with keyengine
indicating the template engine and the rest the arguments passed to the constructor of thepipen.template.Template
object. The template engine could be either the name of the engine, currently jinja2 and liquidpy are supported, or a subclass ofpipen.template.Template
. You can subclasspipen.template.Template
to use your own template engine.
infile
— Input file containing expression valuesThe file should be a matrix with rows representing genes and columns representing samples. It could be an RDS file containing a data frame or a matrix, or a text file containing a matrix with tab as the delimiter. The text file can be gzipped.
outfile
— Output file containing the converted expression valuesThe file will be a matrix with rows representing genes and columns representing samples.
inunit
— The input unit of the expression values.You can also use an expression to indicate the input unit, e.g.,log2(counts + 1)
. The expression should be likeA * fn(B*X + C) + D
, whereA
,B
,C
andD
are constants,fn
is a function, and X is the input unit. Currently onlyexpr
,sqrt
,log2
,log10
andlog
are supported as functions. Supported input units are:- * counts/count/rawcounts/rawcount: raw counts.
- * cpm: counts per million.
- * fpkm/rpkm: fragments per kilobase of transcript per million.
- * fpkmuq/rpkmuq: upper quartile normalized FPKM/RPKM.
- * tpm: transcripts per million.
- * tmm: trimmed mean of M-values.
meanfl
(type=auto) — A file containing the mean fragment length for each sampleby rows (samples as rowname), without header. Or a fixed universal estimated number (1 used by TCGA).nreads
(type=auto) — The estimatied total number of reads for each sample.or you can pass a file with the number for each sample by rows (samples as rowname), without header. When convertingfpkm/rpkm -> count
, it should be total reads of that sample. When convertingcpm -> count
: it should be total reads of that sample. When convertingtpm -> count
: it should be total reads of that sample. When convertingtpm -> cpm
: it should be total reads of that sample. When convertingtpm -> fpkm/rpkm
: it should besum(fpkm)
of that sample. It is not used when convertingcount -> cpm, fpkm/rpkm, tpm
.outunit
— The output unit of the expression values. An expression can also beused for transformation (e.g.log2(tpm + 1)
). Ifinunit
iscount
, then this means we are converting raw counts to tpm, and transforming it tolog2(tpm + 1)
as the output. Any expression supported byR
can be used. Same units asinunit
are supported.refexon
— Path to the reference exon gff file.
__init_subclass__
(
)
— Do the requirements inferring since we need them to build up theprocess relationship </>from_proc
(
proc
,name
,desc
,envs
,envs_depth
,cache
,export
,error_strategy
,num_retries
,forks
,input_data
,order
,plugin_opts
,requires
,scheduler
,scheduler_opts
,submission_batch
)
(Type) — Create a subclass of Proc using another Proc subclass or Proc itself</>gc
(
)
— GC process for the process to save memory after it's done</>init
(
)
— Init all other properties and jobs</>log
(
level
,msg
,*args
,logger
)
— Log message for the process</>run
(
)
— Run the process</>
pipen.proc.
ProcMeta
(
name
, bases
, namespace
, **kwargs
)
Meta class for Proc
__call__
(
cls
,*args
,**kwds
)
(Proc) — Make sure Proc subclasses are singletons</>__instancecheck__
(
cls
,instance
)
— Override for isinstance(instance, cls).</>__repr__
(
cls
)
(str) — Representation for the Proc subclasses</>__subclasscheck__
(
cls
,subclass
)
— Override for issubclass(subclass, cls).</>register
(
cls
,subclass
)
— Register a virtual subclass of an ABC.</>
register
(
cls
, subclass
)
Register a virtual subclass of an ABC.
Returns the subclass, to allow usage as a class decorator.
__instancecheck__
(
cls
, instance
)
Override for isinstance(instance, cls).
__subclasscheck__
(
cls
, subclass
)
Override for issubclass(subclass, cls).
__repr__
(
cls
)
→ strRepresentation for the Proc subclasses
__call__
(
cls
, *args
, **kwds
)
Make sure Proc subclasses are singletons
*args
(Any) — and**kwds
(Any) — Arguments for the constructor
The Proc instance
from_proc
(
proc
, name=None
, desc=None
, envs=None
, envs_depth=None
, cache=None
, export=None
, error_strategy=None
, num_retries=None
, forks=None
, input_data=None
, order=None
, plugin_opts=None
, requires=None
, scheduler=None
, scheduler_opts=None
, submission_batch=None
)
Create a subclass of Proc using another Proc subclass or Proc itself
proc
(Type) — The Proc subclassname
(str, optional) — The new name of the processdesc
(str, optional) — The new description of the processenvs
(Mapping, optional) — The arguments of the process, will overwrite parent oneThe items that are specified will be inheritedenvs_depth
(int, optional) — How deep to update the envs when subclassed.cache
(bool, optional) — Whether we should check the cache for the jobsexport
(bool, optional) — When True, the results will be exported to<pipeline.outdir>
Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processeserror_strategy
(str, optional) — How to deal with the errors- - retry, ignore, halt
- - halt to halt the whole pipeline, no submitting new jobs
- - terminate to just terminate the job itself
num_retries
(int, optional) — How many times to retry to jobs once error occursforks
(int, optional) — New forks for the new processinput_data
(Any, optional) — The input data for the process. Only when this processis a start processorder
(int, optional) — The order to execute the new processplugin_opts
(Mapping, optional) — The new plugin options, unspecified items will beinherited.requires
(Sequence, optional) — The required processes for the new processscheduler
(str, optional) — The new shedular to run the new processscheduler_opts
(Mapping, optional) — The new scheduler options, unspecified items willbe inherited.submission_batch
(int, optional) — How many jobs to be submited simultaneously
The new process class
__init_subclass__
(
)
Do the requirements inferring since we need them to build up theprocess relationship
init
(
)
Init all other properties and jobs
gc
(
)
GC process for the process to save memory after it's done
log
(
level
, msg
, *args
, logger=<LoggerAdapter pipen.core (WARNING)>
)
Log message for the process
level
(int | str) — The log level of the recordmsg
(str) — The message to log*args
— The arguments to format the messagelogger
(LoggerAdapter, optional) — The logging logger
run
(
)
Run the process
biopipen.ns.rnaseq.
Simulation
(
*args
, **kwds
)
→ Proc
Simulate RNA-seq data using ESCO/RUVcorr package
cache
— Should we detect whether the jobs are cached?desc
— The description of the process. Will use the summary fromthe docstring by default.dirsig
— When checking the signature for caching, whether should we walkthrough the content of the directory? This is sometimes time-consuming if the directory is big.envs
— The arguments that are job-independent, useful for common optionsacross jobs.envs_depth
— How deep to update the envs when subclassed.error_strategy
— How to deal with the errors- - retry, ignore, halt
- - halt to halt the whole pipeline, no submitting new jobs
- - terminate to just terminate the job itself
export
— When True, the results will be exported to<pipeline.outdir>
Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processesforks
— How many jobs to run simultaneously?input
— The keys for the input channelinput_data
— The input data (will be computed for dependent processes)lang
— The language for the script to run. Should be the path to theinterpreter iflang
is not in$PATH
.name
— The name of the process. Will use the class name by default.nexts
— Computed fromrequires
to build the process relationshipsnum_retries
— How many times to retry to jobs once error occursorder
— The execution order for this process. The bigger the numberis, the later the process will be executed. Default: 0. Note that the dependent processes will always be executed first. This doesn't work for start processes either, whose orders are determined byPipen.set_starts()
output
— The output keys for the output channel(the data will be computed)output_data
— The output data (to pass to the next processes)plugin_opts
— Options for process-level pluginsrequires
— The dependency processesscheduler
— The scheduler to run the jobsscheduler_opts
— The options for the schedulerscript
— The script template for the processsubmission_batch
— How many jobs to be submited simultaneouslytemplate
— Define the template engine to use.This could be either a template engine or a dict with keyengine
indicating the template engine and the rest the arguments passed to the constructor of thepipen.template.Template
object. The template engine could be either the name of the engine, currently jinja2 and liquidpy are supported, or a subclass ofpipen.template.Template
. You can subclasspipen.template.Template
to use your own template engine.
ngenes
— Number of genes to simulatensamples
— Number of samples to simulateIf you want to force the process to re-simulate for the samengenes
andnsamples
, you can set a different value forenvs.seed
. Note that the samples will be shown as cells in the output (since the simulation is designed for single-cell RNA-seq data).
outdir
— Output directory containing the simulated datasim.rds
andTrue.rds
will be generated. ForESCO
,sim.rds
contains the simulated data in aSingleCellExperiment
object, andTrue.rds
contains the matrix of true counts. ForRUVcorr
,sim.rds
contains the simulated data in list withTruth
, A matrix containing the values of Xβ;Y
A matrix containing the values inY
;Noise
A matrix containing the values inWα
;Sigma
A matrix containing the true gene-gene correlations, as defined by Xβ; andInfo
A matrix containing some of the general information about the simulation. For all matrices, rows represent genes and columns represent samples.outfile
— Output file containing the simulated data with rows representinggenes and columns representing samples.
esco_args
(ns) — Additional arguments to pass to the simulation function.- - save (choice): Which type of data to save to
out.outfile
.
-simulated-truth
: saves the simulated true counts.
-zero-inflated
: saves the zero-inflated counts.
-down-sampled
: saves the down-sampled counts. - - type (choice): Which type of heterogenounity to use.
- single: produces a single population.
- group: produces distinct groups.
- tree: produces distinct groups but admits a tree structure.
- traj: produces distinct groups but admits a smooth trajectory
structure. - -
: See https://rdrr.io/github/JINJINT/ESCO/man/escoParams.html.
- - save (choice): Which type of data to save to
index_start
(type=int) — The index to start from when naming the samples.Affects the sample names inout.outfile
only.ncores
(type=int) — Number of cores to use.ruvcorr_args
(ns) — Additional arguments to pass to the simulationfunction.seed
(type=int) — Random seed.If not set, seed will not be set.tool
(choice) — Which tool to use for simulation.transpose_output
(flag) — If set, the output will be transposed.
__init_subclass__
(
)
— Do the requirements inferring since we need them to build up theprocess relationship </>from_proc
(
proc
,name
,desc
,envs
,envs_depth
,cache
,export
,error_strategy
,num_retries
,forks
,input_data
,order
,plugin_opts
,requires
,scheduler
,scheduler_opts
,submission_batch
)
(Type) — Create a subclass of Proc using another Proc subclass or Proc itself</>gc
(
)
— GC process for the process to save memory after it's done</>init
(
)
— Init all other properties and jobs</>log
(
level
,msg
,*args
,logger
)
— Log message for the process</>run
(
)
— Run the process</>
pipen.proc.
ProcMeta
(
name
, bases
, namespace
, **kwargs
)
Meta class for Proc
__call__
(
cls
,*args
,**kwds
)
(Proc) — Make sure Proc subclasses are singletons</>__instancecheck__
(
cls
,instance
)
— Override for isinstance(instance, cls).</>__repr__
(
cls
)
(str) — Representation for the Proc subclasses</>__subclasscheck__
(
cls
,subclass
)
— Override for issubclass(subclass, cls).</>register
(
cls
,subclass
)
— Register a virtual subclass of an ABC.</>
register
(
cls
, subclass
)
Register a virtual subclass of an ABC.
Returns the subclass, to allow usage as a class decorator.
__instancecheck__
(
cls
, instance
)
Override for isinstance(instance, cls).
__subclasscheck__
(
cls
, subclass
)
Override for issubclass(subclass, cls).
__repr__
(
cls
)
→ strRepresentation for the Proc subclasses
__call__
(
cls
, *args
, **kwds
)
Make sure Proc subclasses are singletons
*args
(Any) — and**kwds
(Any) — Arguments for the constructor
The Proc instance
from_proc
(
proc
, name=None
, desc=None
, envs=None
, envs_depth=None
, cache=None
, export=None
, error_strategy=None
, num_retries=None
, forks=None
, input_data=None
, order=None
, plugin_opts=None
, requires=None
, scheduler=None
, scheduler_opts=None
, submission_batch=None
)
Create a subclass of Proc using another Proc subclass or Proc itself
proc
(Type) — The Proc subclassname
(str, optional) — The new name of the processdesc
(str, optional) — The new description of the processenvs
(Mapping, optional) — The arguments of the process, will overwrite parent oneThe items that are specified will be inheritedenvs_depth
(int, optional) — How deep to update the envs when subclassed.cache
(bool, optional) — Whether we should check the cache for the jobsexport
(bool, optional) — When True, the results will be exported to<pipeline.outdir>
Defaults to None, meaning only end processes will export. You can set it to True/False to enable or disable exporting for processeserror_strategy
(str, optional) — How to deal with the errors- - retry, ignore, halt
- - halt to halt the whole pipeline, no submitting new jobs
- - terminate to just terminate the job itself
num_retries
(int, optional) — How many times to retry to jobs once error occursforks
(int, optional) — New forks for the new processinput_data
(Any, optional) — The input data for the process. Only when this processis a start processorder
(int, optional) — The order to execute the new processplugin_opts
(Mapping, optional) — The new plugin options, unspecified items will beinherited.requires
(Sequence, optional) — The required processes for the new processscheduler
(str, optional) — The new shedular to run the new processscheduler_opts
(Mapping, optional) — The new scheduler options, unspecified items willbe inherited.submission_batch
(int, optional) — How many jobs to be submited simultaneously
The new process class
__init_subclass__
(
)
Do the requirements inferring since we need them to build up theprocess relationship
init
(
)
Init all other properties and jobs
gc
(
)
GC process for the process to save memory after it's done
log
(
level
, msg
, *args
, logger=<LoggerAdapter pipen.core (WARNING)>
)
Log message for the process
level
(int | str) — The log level of the recordmsg
(str) — The message to log*args
— The arguments to format the messagelogger
(LoggerAdapter, optional) — The logging logger
run
(
)
Run the process