SampleInfo

List sample information and perform statistics

This process is the entrance of the pipeline. It just pass by input file and list the sample information in the report.

To specify the input file in the configuration file, use the following

[SampleInfo.in]
infile = [ "path/to/sample_info.txt" ]

Or with pipen-board, find the SampleInfo process and click the Edit button.
Then you can specify the input file here

infile

Theroetically, we can have multiple input files. However, it is not tested yet.
If you have multiple input files to run, please run it with a different pipeline instance (configuration file).

For the content of the input file, please see details here.

Once the pipeline is finished, you can see the sample information in the report

report

Note that the required RNAData and TCRData columns are not shown in the report.
They are used to specify the paths of the scRNA-seq and scTCR-seq data, respectively.

You may also perform some statistics on the sample information, for example, number of samples per group. See next section for details.

Tip

This is the start process of the pipeline. Once you change the parameters for this process, the whole pipeline will be re-run.

If you just want to change the parameters for the statistics, and use the cached (previous) results for other processes, you can set cache at pipeline level to "force" to force the pipeline to use the cached results and cache of SampleInfo to false to force the pipeline to re-run the SampleInfo process only.

cache = "force"

[SampleInfo]
cache = false

Environment Variables

  • sep: Default: .
    The separator of the input file.
  • mutaters (type=json): Default: {}.
    A dict of mutaters to mutate the data frame.
    The key is the column name and the value is the R expression to mutate the column. The dict will be transformed to a list in R and passed to dplyr::mutate.
    You may also use paired() to identify paired samples. The function takes following arguments:
    • df: The data frame. Use . if the function is called in a dplyr pipe.
    • id_col: The column name in df for the ids to be returned in the final output.
    • compare_col: The column name in df to compare the values for each id in id_col.
    • idents: The values in compare_col to compare. It could be either an an integer or a vector. If it is an integer, the number of values in compare_col must be the same as the integer for the id to be regarded as paired. If it is a vector, the values in compare_col must be the same as the values in idents for the id to be regarded as paired.
    • uniq: Whether to return unique ids or not. Default is TRUE.
      If FALSE, you can mutate the meta data frame with the returned ids. Non-paired ids will be NA.
  • save_mutated (flag): Default: False.
    Whether to save the mutated columns.
  • exclude_cols: Default: TCRData,RNAData.
    The columns to exclude in the table in the report.
    Could be a list or a string separated by comma.
  • defaults (ns): The default parameters for envs.stats.
    • on: Default: Sample.
      The column name in the data for the stats.
      Default is Sample. The column could be either continuous or not.
    • subset: An R expression to subset the data.
      If you want to keep the distinct records, you can use !duplicated(<col>).
    • group: The column name in the data for the group ids.
      If not provided, all records will be regarded as one group.
    • na_group (flag): Default: False.
      Whether to include NAs in the group.
    • each: The column in the data to split the analysis in different plots.
    • ncol (type=int): Default: 2.
      The number of columns in the plot when each is not NULL. Default is 2.
    • na_each (flag): Default: False.
      Whether to include NAs in the each column.
    • plot: Type of plot. If on is continuous, it could be boxplot (default), violin, violin+boxplot or histogram.
      If on is not continuous, it could be barplot or pie (default).
    • devpars (ns): The device parameters for the plot.
      • width (type=int): Default: 800.
        The width of the plot.
      • height (type=int): Default: 600.
        The height of the plot.
      • res (type=int): Default: 100.
        The resolution of the plot.
  • stats (type=json): Default: {}.
    The statistics to perform.
    The keys are the case names and the values are the parameters inheirted from envs.defaults.

Examples

Example data

Sample Age Sex Diagnosis
C1 62 F Colitis
C2 71.2 F Colitis
C3 56.2 M Colitis
C4 61.5 M Colitis
C5 72.8 M Colitis
C6 78.4 M Colitis
C7 61.6 F Colitis
C8 49.5 F Colitis
NC1 43.6 M NoColitis
NC2 68.1 M NoColitis
NC3 70.5 F NoColitis
NC4 63.7 M NoColitis
NC5 58.5 M NoColitis
NC6 49.3 F NoColitis
CT1 21.4 F Control
CT2 61.7 M Control
CT3 50.5 M Control
CT4 43.4 M Control
CT5 70.6 F Control
CT6 44.3 M Control
CT7 50.2 M Control
CT8 61.5 F Control

Count the number of samples per Diagnosis

[SampleInfo.envs.stats."N_Samples_per_Diagnosis (pie)"]
on = "sample"
group = "Diagnosis"

Samples_Diagnosis

What if we want a bar plot instead of a pie chart?

[SampleInfo.envs.stats."N_Samples_per_Diagnosis (bar)"]
on = "sample"
group = "Diagnosis"
plot = "barplot"

Samples_Diagnosis_bar

Explore Age distribution

The distribution of Age of all samples

[SampleInfo.envs.stats."Age_distribution (boxplot)"]
on = "Age"

Age_distribution

How about the distribution of Age in each Diagnosis, and make it violin + boxplot?

[SampleInfo.envs.stats."Age_distribution_per_Diagnosis (violin + boxplot)"]
on = "Age"
group = "Diagnosis"
plot = "violin+boxplot"

Age_distribution_per_Diagnosis

How about Age distribution per Sex in each Diagnosis?

[SampleInfo.envs.stats."Age_distribution_per_Sex_in_each_Diagnosis (boxplot)"]
on = "Age"
group = "Sex"
each = "Diagnosis"
plot = "boxplot"
ncol = 3
devpars = {height = 450}

Age_distribution_per_Sex_in_each_Diagnosis