SampleInfo¶
List sample information and perform statistics
This process is the entrance of the pipeline. It just pass by input file and list
the sample information in the report.
To specify the input file in the configuration file, use the following
[SampleInfo.in]
infile = [ "path/to/sample_info.txt" ]
Or with pipen-board
, find the SampleInfo
process and click the Edit
button.
Then you can specify the input file here
Theroetically, we can have multiple input files. However, it is not tested yet.
If you have multiple input files to run, please run it with a different pipeline
instance (configuration file).
For the content of the input file, please see details
here.
Once the pipeline is finished, you can see the sample information in the report
Note that the required RNAData
and TCRData
columns are not shown in the report.
They are used to specify the paths of the scRNA-seq
and scTCR-seq
data, respectively.
You may also perform some statistics on the sample information, for example,
number of samples per group. See next section for details.
Tip
This is the start process of the pipeline. Once you change the parameters for
this process, the whole pipeline will be re-run.
If you just want to change the parameters for the statistics, and use the
cached (previous) results for other processes, you can set cache
at
pipeline level to "force"
to force the pipeline to use the cached results
and cache
of SampleInfo
to false
to force the pipeline to re-run the
SampleInfo
process only.
cache = "force"
[SampleInfo]
cache = false
Environment Variables¶
sep
: Default:.
The separator of the input file.mutaters
(type=json
): Default:{}
.
A dict of mutaters to mutate the data frame.
The key is the column name and the value is the R expression to mutate the column. The dict will be transformed to a list in R and passed todplyr::mutate
.
You may also usepaired()
to identify paired samples. The function takes following arguments:df
: The data frame. Use.
if the function is called in a dplyr pipe.id_col
: The column name indf
for the ids to be returned in the final output.compare_col
: The column name indf
to compare the values for each id inid_col
.idents
: The values incompare_col
to compare. It could be either an an integer or a vector. If it is an integer, the number of values incompare_col
must be the same as the integer for theid
to be regarded as paired. If it is a vector, the values incompare_col
must be the same as the values inidents
for theid
to be regarded as paired.uniq
: Whether to return unique ids or not. Default isTRUE
.
IfFALSE
, you can mutate the meta data frame with the returned ids. Non-paired ids will beNA
.
save_mutated
(flag
): Default:False
.
Whether to save the mutated columns.exclude_cols
: Default:TCRData,RNAData
.
The columns to exclude in the table in the report.
Could be a list or a string separated by comma.defaults
(ns
): The default parameters forenvs.stats
.on
: Default:Sample
.
The column name in the data for the stats.
Default isSample
. The column could be either continuous or not.subset
: An R expression to subset the data.
If you want to keep the distinct records, you can use!duplicated(<col>)
.group
: The column name in the data for the group ids.
If not provided, all records will be regarded as one group.na_group
(flag
): Default:False
.
Whether to includeNA
s in the group.each
: The column in the data to split the analysis in different plots.ncol
(type=int
): Default:2
.
The number of columns in the plot wheneach
is notNULL
. Default is 2.na_each
(flag
): Default:False
.
Whether to includeNA
s in theeach
column.plot
: Type of plot. Ifon
is continuous, it could beboxplot
(default),violin
,violin+boxplot
orhistogram
.
Ifon
is not continuous, it could bebarplot
orpie
(default).devpars
(ns
): The device parameters for the plot.width
(type=int
): Default:800
.
The width of the plot.height
(type=int
): Default:600
.
The height of the plot.res
(type=int
): Default:100
.
The resolution of the plot.
stats
(type=json
): Default:{}
.
The statistics to perform.
The keys are the case names and the values are the parameters inheirted fromenvs.defaults
.
Examples¶
Example data¶
Sample | Age | Sex | Diagnosis |
---|---|---|---|
C1 | 62 | F | Colitis |
C2 | 71.2 | F | Colitis |
C3 | 56.2 | M | Colitis |
C4 | 61.5 | M | Colitis |
C5 | 72.8 | M | Colitis |
C6 | 78.4 | M | Colitis |
C7 | 61.6 | F | Colitis |
C8 | 49.5 | F | Colitis |
NC1 | 43.6 | M | NoColitis |
NC2 | 68.1 | M | NoColitis |
NC3 | 70.5 | F | NoColitis |
NC4 | 63.7 | M | NoColitis |
NC5 | 58.5 | M | NoColitis |
NC6 | 49.3 | F | NoColitis |
CT1 | 21.4 | F | Control |
CT2 | 61.7 | M | Control |
CT3 | 50.5 | M | Control |
CT4 | 43.4 | M | Control |
CT5 | 70.6 | F | Control |
CT6 | 44.3 | M | Control |
CT7 | 50.2 | M | Control |
CT8 | 61.5 | F | Control |
Count the number of samples per Diagnosis¶
[SampleInfo.envs.stats."N_Samples_per_Diagnosis (pie)"]
on = "sample"
group = "Diagnosis"
What if we want a bar plot instead of a pie chart?
[SampleInfo.envs.stats."N_Samples_per_Diagnosis (bar)"]
on = "sample"
group = "Diagnosis"
plot = "barplot"
Explore Age distribution¶
The distribution of Age of all samples
[SampleInfo.envs.stats."Age_distribution (boxplot)"]
on = "Age"
How about the distribution of Age in each Diagnosis, and make it violin + boxplot?
[SampleInfo.envs.stats."Age_distribution_per_Diagnosis (violin + boxplot)"]
on = "Age"
group = "Diagnosis"
plot = "violin+boxplot"
How about Age distribution per Sex in each Diagnosis?
[SampleInfo.envs.stats."Age_distribution_per_Sex_in_each_Diagnosis (boxplot)"]
on = "Age"
group = "Sex"
each = "Diagnosis"
plot = "boxplot"
ncol = 3
devpars = {height = 450}