
List sample information and perform statistics

This process is the entrance of the pipeline. It just pass by input file and list the sample information in the report.

To specify the input file in the configuration file, use the following

infile = [ "path/to/sample_info.txt" ]

Or with pipen-board, find the SampleInfo process and click the Edit button.
Then you can specify the input file here


Theroetically, we can have multiple input files. However, it is not tested yet.
If you have multiple input files to run, please run it with a different pipeline instance (configuration file).

For the content of the input file, please see details here.

Once the pipeline is finished, you can see the sample information in the report


Note that the required RNAData and TCRData columns are not shown in the report.
They are used to specify the paths of the scRNA-seq and scTCR-seq data, respectively.

You may also perform some statistics on the sample information, for example, number of samples per group. See next section for details.


This is the start process of the pipeline. Once you change the parameters for this process, the whole pipeline will be re-run.

If you just want to change the parameters for the statistics, and use the cached (previous) results for other processes, you can set cache at pipeline level to "force" to force the pipeline to use the cached results and cache of SampleInfo to false to force the pipeline to re-run the SampleInfo process only.

cache = "force"

cache = false


  • infile (required): The input file to list sample information The input file should be a csv/tsv file with header.
    The input file should have the following columns.
    • Sample: A unique id for each sample.
    • TCRData: The directory for single-cell TCR data for this sample.
      Specifically, it should contain filtered_contig_annotations.csv or all_contig_annotations.csv from cellranger.
    • RNAData: The directory for single-cell RNA data for this sample.
      Specifically, it should be able to be read by Seurat::Read10X().
      See also
    • Other columns are optional and will be treated as metadata for each sample.


  • outfile: Default: {{in.infile | basename}}.
    The output file with sample information, with mutated columns if envs.save_mutated is True.

Environment Variables

  • sep: Default: .
    The separator of the input file.
  • mutaters (type=json): Default: {}.
    A dict of mutaters to mutate the data frame.
    The key is the column name and the value is the R expression to mutate the column. The dict will be transformed to a list in R and passed to dplyr::mutate.
    You may also use paired() to identify paired samples. The function takes following arguments:
    • df: The data frame. Use . if the function is called in a dplyr pipe.
    • id_col: The column name in df for the ids to be returned in the final output.
    • compare_col: The column name in df to compare the values for each id in id_col.
    • idents: The values in compare_col to compare. It could be either an an integer or a vector. If it is an integer, the number of values in compare_col must be the same as the integer for the id to be regarded as paired. If it is a vector, the values in compare_col must be the same as the values in idents for the id to be regarded as paired.
    • uniq: Whether to return unique ids or not. Default is TRUE.
      If FALSE, you can mutate the meta data frame with the returned ids. Non-paired ids will be NA.
  • save_mutated (flag): Default: False.
    Whether to save the mutated columns.
  • exclude_cols: Default: TCRData,RNAData.
    The columns to exclude in the table in the report.
    Could be a list or a string separated by comma.
  • defaults (ns): The default parameters for envs.stats.
    • on: Default: Sample.
      The column name in the data for the stats.
      Default is Sample. The column could be either continuous or not.
    • subset: An R expression to subset the data.
      If you want to keep the distinct records, you can use !duplicated(<col>).
    • group: The column name in the data for the group ids.
      If not provided, all records will be regarded as one group.
    • na_group (flag): Default: False.
      Whether to include NAs in the group.
    • each: The column in the data to split the analysis in different plots.
    • ncol (type=int): Default: 2.
      The number of columns in the plot when each is not NULL. Default is 2.
    • na_each (flag): Default: False.
      Whether to include NAs in the each column.
    • plot: Type of plot. If on is continuous, it could be boxplot (default), violin, violin+boxplot or histogram.
      If on is not continuous, it could be barplot or pie (default).
    • devpars (ns): The device parameters for the plot.
      • width (type=int): Default: 800.
        The width of the plot.
      • height (type=int): Default: 600.
        The height of the plot.
      • res (type=int): Default: 100.
        The resolution of the plot.
  • stats (type=json): Default: {}.
    The statistics to perform.
    The keys are the case names and the values are the parameters inheirted from envs.defaults.


Example data

Sample Age Sex Diagnosis
C1 62 F Colitis
C2 71.2 F Colitis
C3 56.2 M Colitis
C4 61.5 M Colitis
C5 72.8 M Colitis
C6 78.4 M Colitis
C7 61.6 F Colitis
C8 49.5 F Colitis
NC1 43.6 M NoColitis
NC2 68.1 M NoColitis
NC3 70.5 F NoColitis
NC4 63.7 M NoColitis
NC5 58.5 M NoColitis
NC6 49.3 F NoColitis
CT1 21.4 F Control
CT2 61.7 M Control
CT3 50.5 M Control
CT4 43.4 M Control
CT5 70.6 F Control
CT6 44.3 M Control
CT7 50.2 M Control
CT8 61.5 F Control

Count the number of samples per Diagnosis

[SampleInfo.envs.stats."N_Samples_per_Diagnosis (pie)"]
on = "sample"
group = "Diagnosis"


What if we want a bar plot instead of a pie chart?

[SampleInfo.envs.stats."N_Samples_per_Diagnosis (bar)"]
on = "sample"
group = "Diagnosis"
plot = "barplot"


Explore Age distribution

The distribution of Age of all samples

[SampleInfo.envs.stats."Age_distribution (boxplot)"]
on = "Age"


How about the distribution of Age in each Diagnosis, and make it violin + boxplot?

[SampleInfo.envs.stats."Age_distribution_per_Diagnosis (violin + boxplot)"]
on = "Age"
group = "Diagnosis"
plot = "violin+boxplot"


How about Age distribution per Sex in each Diagnosis?

[SampleInfo.envs.stats."Age_distribution_per_Sex_in_each_Diagnosis (boxplot)"]
on = "Age"
group = "Sex"
each = "Diagnosis"
plot = "boxplot"
ncol = 3
devpars = {height = 450}
