Preparing the input data

Single-cell RNA-seq (scRNA-seq) data

Currently supported data formats:

For each sample, you need to provide the path to the data file or a directory containing the files. Specifically, the directory should be able to be read by Seurat::Read_10X(). For example, the directory should contain matrix.mtx, barcodes.tsv and features.tsv. These files can also be gzipped.

For 10X Genomics data, you can also provide the h5 file generated by CellRanger.

With v2+, we also support loading expression data directly from a Seurat object in RDS or qs/qs2 file. To do so, you need to enable the LoadRNAFromSeurat process in the configuration file, and provide the path to the RDS or qs2 file:

[LoadRNAFromSeurat.in]
infile = "path/to/seurat_object.rds"

[LoadRNAFromSeurat.envs]
# When true, SeuratPreparing will be skipped
# Suppose the Seurat object is already prepared
prepared = false

# When true (assuming prepared is also true), SeuratClusteringOfAllCells/SeuratClustering will be skipped
# Suppose the Seurat object already contains the clustering information
clustered = false

Single-cell TCR-/BCR-seq (scTCR-seq/scBCR-seq) data

The scTCR-/scBCR-seq data is optional for the pipeline. However, the scRNA-seq data is required for the pipeline.

The scTCR-/scBCR-seq data, if available, should be paired with the scRNA-seq data. Theoratically, as long as the data can be loaded by scRepertoire::loadContigs(), it should be fine. Following formats are supported:

  • 10X: 10X Genomics data, which is usually in a directory with filtered_contig_annotations.csv file.
  • AIRR: AIRR format, which is usually in a file with airr_rearrangement.tsv file.
  • BD: Becton Dickinson data, which is usually in a file with Contigs_AIRR.tsv file.
  • Dandelion: Dandelion data, which is usually in a file with all_contig_dandelion.tsv file.
  • Immcantation: Immcantation data, which is usually in a file with data.tsv file.
  • JSON: JSON format, which is usually in a file with .json extension.
  • ParseBio: ParseBio data, which is usually in a file with barcode_report.tsv file.
  • MiXCR: MiXCR data, which is usually in a file with clones.tsv file.
  • Omniscope: Omniscope data, which is usually in a file with .csv extension.
  • TRUST4: TRUST4 data, which is usually in a file with barcode_report.tsv file.
  • WAT3R: WAT3R data, which is usually in a file with barcode_results.csv file. See also: https://rdrr.io/github/ncborcherding/scRepertoire/man/loadContigs.html

Metadata

A metadata file is required as an input file for the pipeline. It should be a TAB delimited file with 3 required columns:

  • Sample: A unique id for each sample
  • RNAData: The directory or h5 file for single-cell RNA data for this sample, as described above.
  • TCRData (optional): The directory for single-cell TCR data for this sample as described above.
  • BCRData (optional): The directory for single-cell BCR data for this sample as described above.

When TCRData/BCRData is not provided, the pipeline will skip the processes related to scTCR-/scBCR-seq data (see Routes of the pipeline for more details).

You can also add other columns to the metadata file. The columns will be added the Seurat object as metadata, and can be used for downstream analysis. For example, you can add a column Condition to indicate the condition of each sample, or Batch to indicate the batch effect.

This file should be provided to SampleInfo process. See SampleInfo for more details.

An example metadata file can be found here.

You can also use SampleInfo with envs.save_mutated = true and/or SeuratPreparing to add columns to metadata by configuration. These columns are persisted for downstream analysis. The difference is that SampleInfo can not pass along the factor (categorical) columns, while you are able to do so with SeuratPreparing.

Other optional files

Genes/Features to visualize for Seurat object

If you have a set of genes/features of interest, you can provide a file with those genes, one gene per line, to SeuratClusterStats.envs.features.features for visualization the feature values, which finally is implemented by scplotter::FeatureStatPlot().

Note

The genes should exist in the RNA-seq data (i.e features.tsv or the h5 file from cellranger).

See SeuratClusterStats for more details.

Pathways for Gene Set Enrichment Analysis (GSEA)

If you want to perform GSEA, you need to provide a file containing the pathways. The file should be in the GMT format. You can provide the file to ScFGSEA.envs.gmtfile. Similarly, the genes should exist (be in the same format) in the features.tsv file.

See ScFGSEA for more details.

You can also find an example here: https://github.com/pwwang/immunopipe-example/blob/master/data/MSigDB_Hallmark_v7.5.1.gmt

Cell type database for cell type annotation by sctype or hitype

If you want to perform cell type annotation, you need to provide a file containing the cell type database if you are using sctype or hitype. The database file should be fed to CellTypeAnnotation.envs.sctype_db if you are using sctype, or CellTypeAnnotation.envs.hitype_db if you are using hitype. Again, the markers in the database should exist (be in the same format) in the features.tsv file or the h5 file.

See CellTypeAnnotation for more details.

Examples can be found here: ScTypeDB_short.xlsx and ScTypeDB_full.xlsx.

Model for cell type annotation by celltypist

If you want to perform cell type annotation by celltypist, you need to provide a model file. The model file should be fed to CellTypeAnnotation.envs.celltypist_args.model. The information of models can be found here. Download the one you want to use and provide the path to the file.

Metabolic pathway for Metabolic Landscape Analysis

Similarly, if you want to perform metabolic landscape analysis, you need to provide a file containing the metabolic pathways. The file should be in the GMT format. You can provide the file to ScMetabolicLandscape.envs.gmtfile. This file can also be used for GSEA. A pathway file for KEGG metabolism is provided here.

See ScrnaMetabolicLandscape for more details.

Reference for Seurat mapping if you want to perform supervised clustering

If you want to perform supervised clustering, you need to provide a reference for SeuratMap2Ref. The reference should be a Seurat object in RDS or h5seurat file. You can provide the reference to SeuratMap2Ref.envs.ref.

See SeuratMap2Ref for more details.