Preparing the input data¶
Single-cell RNA-seq (scRNA-seq) data¶
Currently supported data formats:
For each sample, you need to provide the path to the data file or a directory containing the files. Specifically, the directory should be able to be read by Seurat::Read_10X()
. For example, the directory should contain matrix.mtx
, barcodes.tsv
and features.tsv
.
These files can also be gzipped.
For 10X Genomics data, you can also provide the h5
file generated by CellRanger
.
With v2+, we also support loading expression data directly from a Seurat
object in RDS
or qs/qs2
file.
To do so, you need to enable the LoadRNAFromSeurat
process in the configuration file, and provide the path to the RDS
or qs2
file:
[LoadRNAFromSeurat.in]
infile = "path/to/seurat_object.rds"
[LoadRNAFromSeurat.envs]
# When true, SeuratPreparing will be skipped
# Suppose the Seurat object is already prepared
prepared = false
# When true (assuming prepared is also true), SeuratClusteringOfAllCells/SeuratClustering will be skipped
# Suppose the Seurat object already contains the clustering information
clustered = false
Single-cell TCR-/BCR-seq (scTCR-seq/scBCR-seq) data¶
The scTCR-/scBCR-seq data is optional for the pipeline. However, the scRNA-seq data is required for the pipeline.
The scTCR-/scBCR-seq data, if available, should be paired with the scRNA-seq data. Theoratically, as long as the data can be loaded by scRepertoire::loadContigs()
, it should be fine. Following formats are supported:
- 10X: 10X Genomics data, which is usually in a directory with
filtered_contig_annotations.csv
file. - AIRR: AIRR format, which is usually in a file with
airr_rearrangement.tsv
file. - BD: Becton Dickinson data, which is usually in a file with
Contigs_AIRR.tsv
file. - Dandelion: Dandelion data, which is usually in a file with
all_contig_dandelion.tsv
file. - Immcantation: Immcantation data, which is usually in a file with
data.tsv
file. - JSON: JSON format, which is usually in a file with
.json
extension. - ParseBio: ParseBio data, which is usually in a file with
barcode_report.tsv
file. - MiXCR: MiXCR data, which is usually in a file with
clones.tsv
file. - Omniscope: Omniscope data, which is usually in a file with
.csv
extension. - TRUST4: TRUST4 data, which is usually in a file with
barcode_report.tsv
file. - WAT3R: WAT3R data, which is usually in a file with
barcode_results.csv
file. See also: https://rdrr.io/github/ncborcherding/scRepertoire/man/loadContigs.html
Metadata¶
A metadata file is required as an input file for the pipeline. It should be a TAB
delimited file with 3 required columns:
Sample
: A unique id for each sampleRNAData
: The directory orh5
file for single-cell RNA data for this sample, as described above.TCRData
(optional): The directory for single-cell TCR data for this sample as described above.BCRData
(optional): The directory for single-cell BCR data for this sample as described above.
When TCRData
/BCRData
is not provided, the pipeline will skip the processes related to scTCR-/scBCR-seq data (see Routes of the pipeline for more details).
You can also add other columns to the metadata file. The columns will be added the Seurat
object as metadata, and can be used for downstream analysis. For example, you can add a column Condition
to indicate the condition of each sample, or Batch
to indicate the batch effect.
This file should be provided to SampleInfo
process. See SampleInfo
for more details.
An example metadata file can be found here.
You can also use SampleInfo
with envs.save_mutated = true
and/or SeuratPreparing
to add columns to metadata by configuration. These columns are persisted for downstream analysis. The difference is that SampleInfo
can not pass along the factor (categorical) columns, while you are able to do so with SeuratPreparing
.
Other optional files¶
Genes/Features to visualize for Seurat object¶
If you have a set of genes/features of interest, you can provide a file with those genes, one gene per line, to SeuratClusterStats.envs.features.features
for visualization the feature values, which finally is implemented by scplotter::FeatureStatPlot()
.
Note
The genes should exist in the RNA-seq data (i.e features.tsv
or the h5
file from cellranger).
See SeuratClusterStats
for more details.
Pathways for Gene Set Enrichment Analysis (GSEA)¶
If you want to perform GSEA, you need to provide a file containing the pathways. The file should be in the GMT format. You can provide the file to ScFGSEA.envs.gmtfile
. Similarly, the genes should exist (be in the same format) in the features.tsv
file.
See ScFGSEA
for more details.
You can also find an example here: https://github.com/pwwang/immunopipe-example/blob/master/data/MSigDB_Hallmark_v7.5.1.gmt
Cell type database for cell type annotation by sctype
or hitype
¶
If you want to perform cell type annotation, you need to provide a file containing the cell type database if you are using sctype
or hitype
. The database file should be fed to CellTypeAnnotation.envs.sctype_db
if you are using sctype
, or CellTypeAnnotation.envs.hitype_db
if you are using hitype
. Again, the markers in the database should exist (be in the same format) in the features.tsv
file or the h5
file.
See CellTypeAnnotation
for more details.
Examples can be found here: ScTypeDB_short.xlsx and ScTypeDB_full.xlsx.
Model for cell type annotation by celltypist
¶
If you want to perform cell type annotation by celltypist
, you need to provide a model file. The model file should be fed to CellTypeAnnotation.envs.celltypist_args.model
. The information of models can be found here. Download the one you want to use and provide the path to the file.
Metabolic pathway for Metabolic Landscape Analysis¶
Similarly, if you want to perform metabolic landscape analysis, you need to provide a file containing the metabolic pathways. The file should be in the GMT format. You can provide the file to ScMetabolicLandscape.envs.gmtfile
. This file can also be used for GSEA. A pathway file for KEGG metabolism is provided here.
See ScrnaMetabolicLandscape
for more details.
Reference for Seurat mapping if you want to perform supervised clustering¶
If you want to perform supervised clustering, you need to provide a reference for SeuratMap2Ref
. The reference should be a Seurat
object in RDS
or h5seurat
file. You can provide the reference to SeuratMap2Ref.envs.ref
.
See SeuratMap2Ref
for more details.