Introduction¶
The pipeline architecture¶
immunopipe
is built upon pipen
. It is recommended to read the pipen docs first to get a better understanding of the pipeline.
Here, we just want to highlight some concepts that are helpful to use the pipeline as a user.
A process is a unit of work in the pipeline. immunopipe
includes a set of processes. Some of them are reused from biopipen
and some are written specifically for immunopipe
.
The input of a process is typically a pandas
DataFrame
, which serves as the channel passing data between processes. The rows of the data frame are distributed to the jobs of the process, and columns are spreaded to the input variables of the job s. See more illustration here. In our case, most processes are just single-job processes. Other than the start processes, the input of a process is the output of other process(es). So users don't need to worry about the input of the processes in the configurations.
envs
of a process is the most important part of immunopipe
that a user needs to configure. It defines the environment variables of the process. The environment variables are shared by all the jobs of the process.
Attention
These environment variables are not the same as the environment variables of the system. They are just variables that are used in the process across its jobs.
See individual process pages for more details about the envs
of each process.
Analyses and processes¶
As shown in the figure above, immunopipe
includes a set of processes for scTCR- and scRNA-seq data analysis. The processes are grouped into categories below:
Data input and QC¶
SampleInfo
: Read sample information from a CSV file and list the sample information in the report.ImmunarchLoading
: Load the data intoimmunarch
objects.SeuratPreparing
: Read the data intoSeurat
objects and perform QC.
T cell selection¶
SeuratClusteringOfAllCells
: Perform clustering on all cells if non-T cells are present in the data.ClusterMarkersOfAllCells
: Find markers for each cluster of all the cells and perform enrichment analysis.TopExpressingGenesOfAllCells
: Find top expressing genes for each cluster of all the cells and perform enrichment analysis.TCellSelection
: Select T cells from all cells.
Clustering of T cells¶
SeuratClustering
: Perform clustering on all or T cells selected above.SeuratMap2Ref
: Map the cells to a reference dataset.CellTypeAnnotation
: Annotate cell types for each T-cell cluster.SeuratSubClustering
: Perform sub-clustering on subsets of cells.ClusterMarkers
: Find markers for each T-cell cluster and perform enrichment analysis.TopExpressingGenes
: Find top expressing genes for each T-cell cluster and perform enrichment analysis.ModuleScoreCalculator
: Calculate module scores or cell cycle scores for each cell.
TCR data analyses¶
CloneResidency
: Explore the residency of TCR clones for paired samples (e.g. tumor vs blood) from the same patient.TCRClustering
: Perform clustering on TCR clones based on CDR3 amino acid sequences.TCRClusterStats
: Investigate statistics for TCR clusters (i.e. TCR cluster size distribution, shared TCR clusters among samples, revisited sample diversity using TCR clusters instead of clonotypes, etc.)Immunarch
: Perform TCR clonotype analyses usingimmunarch
package.
Integrative analyses¶
TESSA
: Perform integrative analyses usingTessa
.SeuratClusterStats
: Investigate statistics for each T-cell cluster (i.e. the number of cells in each cluster, the number of cells in each sample for each cluster, feature/gene expression visualization, dimension reduction plots, etc.). It's also possible to perform stats on TCR clones/clusters for each T-cell cluster.IntegratingTCRClusters
: Attach TCR clusters toSeurat
objects.IntegratingTCR
: Integrate TCR data intoSeurat
objects.RadarPlots
: Visualize proportion of cells in different groups for each cluster.CellsDistribution
: Investigate the distribution of cells in different groups for each T-cell cluster.CDR3AAPhyschem
: Investigate the physicochemical properties of CDR3 amino acid sequences of one cell type over another (i.e.Treg
vsTconv
).ScFGSEA
: Perform GSEA analysis for comparisons between two groups of cells. For example, between two cell types, clone groups, TCR clusters or clinical groups.MarkersFinder
: Find markers (differentially expressed genes) for any two groups, including clones or clone groups.MetaMarkers
: Find meta markers for more than 2 clones or clone groups and perform enrichment analysis.
Metabolic landscape analyses¶
ScrnaMetabolicLandscape
: A group of folowwing processes to perform metabolic landscape analyses.MetabolicInput
: Prepare the input files for metabolic landscape analyses.MetabolicExprImpution
: Impute the dropout values in the expression matrix.MetabolicPathwayActivity
: Investigate the metabolic pathways of the cells in different groups and subsets.MetabolicPathwayHeterogeneity
: Show metabolic pathways enriched in genes with highest contribution to the metabolic heterogeneities.MetabolicFeatures
: Perform gene set enrichment analysis against the metabolic pathways for groups in different subsets.MetabolicFeaturesIntraSubset
: Perform gene set enrichment analysis against the metabolic pathways for subsets based on the designed comparison in different groups.
Routes of the pipeline¶
immunopipe
is designed to be flexible. It can be used in different ways. Here we list some common routes of the pipeline:
Both scRNA-seq and scTCR-seq data avaiable¶
To enable this route, you need to:
- tell the pipeline that scTCR-seq data is available by adding a column named
TCRData
in the sample information file. - put the path of the sample information file in the configuration file
[SampleInfo.in.infile]
, instead of passing it as a command line argument (--Sample.in.infile
).
Unsupervised clustering [SeuratClustering]
on selected T cells is the default setting. If you want to perform supervised clustering, you need to add [SeuratMap2Ref]
in the configuration file with necessary parameters. If so, SeuratClustering
will be replaced by SeuratMap2Ref
in the pipeline.
If you need to select T cells from all cells available for later analyses, you need to add [TCellSelection]
in the configuration file. If so, the processes annotated as For T cell selection
will be added to the pipeline.
This is the most common route of the pipeline:
The optional processes are enabled only when the corresponding sections are added in the configuration file. For example, if you want to add module scores (e.g. cell activation score) to the Seurat
object, you need to add [ModuleScoreCalculator]
in the configuration file.
Only scRNA-seq data avaiable¶
When you have only scRNA-seq data, you just don't need to add the TCRData
column in the sample information file. The pipeline will automatically skip the processes related to scTCR-seq data analysis.
Attention
You need to specify the sample information file in the configuration file [SampleInfo.in.infile]
to enable this route. Passing the sample information file as a command line argument (--Sample.in.infile
) does not trigger this route.
Unsupervised clustering [SeuratClustering]
on selected T cells is the default setting. If you want to perform supervised clustering, you need to add [SeuratMap2Ref]
in the configuration file with necessary parameters. If so, SeuratClustering
will be replaced by SeuratMap2Ref
in the pipeline.
If you need to select T cells from all cells available for later analyses, you need to add [TCellSelection]
in the configuration file. If so, the processes annotated as For T cell selection
will be added to the pipeline.