ScFGSEA¶
Gene set enrichment analysis for cells in different groups using fgsea
This process allows us to do Gene Set Enrichment Analysis (GSEA) on the expression data,
but based on variaties of grouping, including the from the meta data and the
scTCR-seq data as well.
The GSEA is done using the
fgsea package,
which allows to quickly and accurately calculate arbitrarily low GSEA P-values
for a collection of gene sets.
The fgsea package is based on the fast algorithm for preranked GSEA described in
Subramanian et al. 2005.
For each case, the process will generate a table with the enrichment scores for
each gene set, and GSEA plots for the top gene sets.
Environment Variables¶
ncores
(type=int
): Default:1
.
Number of cores for parallelization Passed tonproc
offgseaMultilevel()
.-
mutaters
(type=json
): Default:{}
.
The mutaters to mutate the metadata.
The key-value pairs will be passed thedplyr::mutate()
to mutate the metadata.
There are also also 4 helper functions,expanded
,collapsed
,emerged
andvanished
, which can be used to identify the expanded/collpased/emerged/vanished groups (i.e. TCR clones).
See also https://pwwang.github.io/immunopipe/configurations/#mutater-helpers.
For example, you can use{"Patient1_Tumor_Collapsed_Clones": "expanded(., Source, 'Tumor', subset = Patent == 'Patient1', uniq = FALSE)"}
to create a new column in metadata namedPatient1_Tumor_Collapsed_Clones
with the collapsed clones in the tumor sample (compared to the normal sample) of patient 1.
The values in this columns for other clones will beNA
.
Those functions take following arguments:df
: The metadata data frame. You can use the.
to refer to it.group.by
: The column name in metadata to group the cells.idents
: The first group or both groups of cells to compare (value ingroup.by
column). If only the first group is given, the rest of the cells (with non-NA ingroup.by
column) will be used as the second group.subset
: An expression to subset the cells, will be passed todplyr::filter()
. Default isTRUE
(no filtering).each
: A column name (without quotes) in metadata to split the cells.
Each comparison will be done for each value in this column (typically each patient or subject).id
: The column name in metadata for the group ids (i.e.CDR3.aa
).compare
: Either a (numeric) column name (i.e.Clones
) in metadata to compare between groups, or.n
to compare the number of cells in each group.
If numeric column is given, the values should be the same for all cells in the same group.
This will not be checked (only the first value is used).
It is helpful to useClones
to use the raw clone size from TCR data, in case the cells are not completely mapped to RNA data.
Also if you havesubset
set orNA
s ingroup.by
column, you should use.n
to compare the number of cells in each group.uniq
: Whether to return unique ids or not. Default isTRUE
. IfFALSE
, you can mutate the meta data frame with the returned ids. For example,df |> mutate(expanded = expanded(...))
.debug
: Return the data frame with intermediate columns instead of the ids. Default isFALSE
.order
: The expression passed todplyr::arrange()
to order intermediate dataframe and get the ids in order accordingly.
The intermediate dataframe includes the following columns:<id>
: The ids of clones (i.e.CDR3.aa
).<each>
: The values ineach
column.ident_1
: The size of clones in the first group.ident_2
: The size of clones in the second group..diff
: The difference between the sizes of clones in the first and second groups..sum
: The sum of the sizes of clones in the first and second groups..predicate
: Showing whether the clone is expanded/collapsed/emerged/vanished.include_emerged
: Whether to include the emerged group forexpanded
(only works forexpanded
). Default isFALSE
.include_vanished
: Whether to include the vanished group forcollapsed
(only works forcollapsed
). Default isFALSE
.
You can also use
top()
to get the top clones (i.e. the clones with the largest size) in each group.
For example, you can use{"Patient1_Top10_Clones": "top(subset = Patent == 'Patient1', uniq = FALSE)"}
to create a new column in metadata namedPatient1_Top10_Clones
.
The values in this columns for other clones will beNA
.
This function takes following arguments:
*df
: The metadata data frame. You can use the.
to refer to it.
*id
: The column name in metadata for the group ids (i.e.CDR3.aa
).
*n
: The number of top clones to return. Default is10
.
If n < 1, it will be treated as the percentage of the size of the group.
Specify0
to get all clones.
*compare
: Either a (numeric) column name (i.e.Clones
) in metadata to compare between groups, or.n
to compare the number of cells in each group.
If numeric column is given, the values should be the same for all cells in the same group.
This will not be checked (only the first value is used).
It is helpful to useClones
to use the raw clone size from TCR data, in case the cells are not completely mapped to RNA data.
Also if you havesubset
set orNA
s ingroup.by
column, you should use.n
to compare the number of cells in each group.
*subset
: An expression to subset the cells, will be passed todplyr::filter()
. Default isTRUE
(no filtering).
*each
: A column name (without quotes) in metadata to split the cells.
Each comparison will be done for each value in this column (typically each patient or subject).
*uniq
: Whether to return unique ids or not. Default isTRUE
. IfFALSE
, you can mutate the meta data frame with the returned ids. For example,df |> mutate(expanded = expanded(...))
.
*debug
: Return the data frame with intermediate columns instead of the ids. Default isFALSE
.
*with_ties
: Whether to include ties (i.e. clones with the same size as the last clone) or not. Default isFALSE
. -
group-by
: The column name in metadata to group the cells. ident-1
: The first group of cells to compareident-2
: The second group of cells to compare, if not provided, the rest of the cells that are notNA
s ingroup-by
column are used forident-2
.each
: The column name in metadata to separate the cells into different subsets to do the analysis.prefix_each
(flag
): Default:True
.
Whether to prefix theeach
column name to the values as the case/section name.subset
: An expression to subset the cells.section
: Default:DEFAULT
.
The section name for the report. Worked only wheneach
is not specified. Otherwise, the section name will be constructed fromeach
and its value.
This allows different cases to be put into the same section in the report.
Thesection
is used to collect cases and put the results under the same directory and the same section in report.
Wheneach
for a case is specified, thesection
will be ignored and case name will be used assection
.
The cases will be the expanded values ineach
column. Whenprefix_each
is True, the column name specified byeach
will be prefixed to each value as directory name and expanded case name.gmtfile
: Default:""
.
The pathways in GMT format, with the gene names/ids in the same format as the seurat object.
One could also use a URL to a GMT file. For example, from https://download.baderlab.org/EM_Genesets/current_release/Human/symbol/Pathways/.method
(choice
): Default:s2n
.
The method to do the preranking.signal_to_noise
: Signal to noise.
The larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a "class marker".s2n
: Alias of signal_to_noise.abs_signal_to_noise
: The absolute value of signal_to_noise.abs_s2n
: Alias of abs_signal_to_noise.t_test
: T test.
Uses the difference of means scaled by the standard deviation and number of samples.ratio_of_classes
: Also referred to as fold change.
Uses the ratio of class means to calculate fold change for natural scale data.diff_of_classes
: Difference of class means.
Uses the difference of class means to calculate fold change for nature scale datalog2_ratio_of_classes
: Log2 ratio of class means.
Uses the log2 ratio of class means to calculate fold change for natural scale data.
This is the recommended statistic for calculating fold change for log scale data.
top
(type=auto
): Default:20
.
Do gsea table and enrich plot for top N pathways.
If it is < 1, will apply it topadj
, selecting pathways withpadj
<top
.eps
(type=float
): Default:0
.
This parameter sets the boundary for calculating the p value.
See https://rdrr.io/bioc/fgsea/man/fgseaMultilevel.htmlminsize
(type=int
): Default:10
.
Minimal size of a gene set to test. All pathways below the threshold are excluded.maxsize
(type=int
): Default:100
.
Maximal size of a gene set to test. All pathways above the threshold are excluded.rest
(type=json;order=98
): Default:{}
.
Rest arguments forfgsea()
See also https://rdrr.io/bioc/fgsea/man/fgseaMultilevel.htmlcases
(type=json;order=99
): Default:{}
.
If you have multiple cases, you can specify them here.
The keys are the names of the cases and the values are the above options exceptmutaters
.
If some options are not specified, the default values specified above will be used.
If no cases are specified, the default case will be added with the nameDEFAULT
.