MarkersFinder¶
Find markers between different groups of cells
MarkersFinder
is a process that wraps the
Seurat::FindMarkers()
function, and performs enrichment analysis for the markers found.
Environment Variables¶
ncores
(type=int
): Default:1
.
Number of cores to use for parallel computing for someSeurat
procedures.- Used in
future::plan(strategy = "multicore", workers = <ncores>)
to parallelize some Seurat procedures. - See also: https://satijalab.org/seurat/articles/future_vignette.html
- Used in
mutaters
(type=json
): Default:{}
.
The mutaters to mutate the metadata There are also also 4 helper functions,expanded
,collapsed
,emerged
andvanished
, which can be used to identify the expanded/collpased/emerged/vanished groups (i.e. TCR clones).
See also https://pwwang.github.io/immunopipe/configurations/#mutater-helpers.
For example, you can use{"Patient1_Tumor_Collapsed_Clones": "expanded(., Source, 'Tumor', subset = Patent == 'Patient1', uniq = FALSE)"}
to create a new column in metadata namedPatient1_Tumor_Collapsed_Clones
with the collapsed clones in the tumor sample (compared to the normal sample) of patient 1.
The values in this columns for other clones will beNA
.
Those functions take following arguments:df
: The metadata data frame. You can use the.
to refer to it.group.by
: The column name in metadata to group the cells.idents
: The first group or both groups of cells to compare (value ingroup.by
column). If only the first group is given, the rest of the cells (with non-NA ingroup.by
column) will be used as the second group.subset
: An expression to subset the cells, will be passed todplyr::filter()
. Default isTRUE
(no filtering).each
: A column name (without quotes) in metadata to split the cells.
Each comparison will be done for each value in this column (typically each patient or subject).id
: The column name in metadata for the group ids (i.e.CDR3.aa
).compare
: Either a (numeric) column name (i.e.Clones
) in metadata to compare between groups, or.n
to compare the number of cells in each group.
If numeric column is given, the values should be the same for all cells in the same group.
This will not be checked (only the first value is used).
It is helpful to useClones
to use the raw clone size from TCR data, in case the cells are not completely mapped to RNA data.
Also if you havesubset
set orNA
s ingroup.by
column, you should use.n
to compare the number of cells in each group.uniq
: Whether to return unique ids or not. Default isTRUE
. IfFALSE
, you can mutate the meta data frame with the returned ids. For example,df |> mutate(expanded = expanded(...))
.debug
: Return the data frame with intermediate columns instead of the ids. Default isFALSE
.order
: The expression passed todplyr::arrange()
to order intermediate dataframe and get the ids in order accordingly.
The intermediate dataframe includes the following columns:<id>
: The ids of clones (i.e.CDR3.aa
).<each>
: The values ineach
column.ident_1
: The size of clones in the first group.ident_2
: The size of clones in the second group..diff
: The difference between the sizes of clones in the first and second groups..sum
: The sum of the sizes of clones in the first and second groups..predicate
: Showing whether the clone is expanded/collapsed/emerged/vanished.include_emerged
: Whether to include the emerged group forexpanded
(only works forexpanded
). Default isFALSE
.include_vanished
: Whether to include the vanished group forcollapsed
(only works forcollapsed
). Default isFALSE
.
You can also usetop()
to get the top clones (i.e. the clones with the largest size) in each group.
For example, you can use{"Patient1_Top10_Clones": "top(subset = Patent == 'Patient1', uniq = FALSE)"}
to create a new column in metadata namedPatient1_Top10_Clones
.
The values in this columns for other clones will beNA
.
This function takes following arguments:df
: The metadata data frame. You can use the.
to refer to it.id
: The column name in metadata for the group ids (i.e.CDR3.aa
).n
: The number of top clones to return. Default is10
.
If n < 1, it will be treated as the percentage of the size of the group.
Specify0
to get all clones.compare
: Either a (numeric) column name (i.e.Clones
) in metadata to compare between groups, or.n
to compare the number of cells in each group.
If numeric column is given, the values should be the same for all cells in the same group.
This will not be checked (only the first value is used).
It is helpful to useClones
to use the raw clone size from TCR data, in case the cells are not completely mapped to RNA data.
Also if you havesubset
set orNA
s ingroup.by
column, you should use.n
to compare the number of cells in each group.subset
: An expression to subset the cells, will be passed todplyr::filter()
. Default isTRUE
(no filtering).each
: A column name (without quotes) in metadata to split the cells.
Each comparison will be done for each value in this column (typically each patient or subject).uniq
: Whether to return unique ids or not. Default isTRUE
. IfFALSE
, you can mutate the meta data frame with the returned ids. For example,df |> mutate(expanded = expanded(...))
.debug
: Return the data frame with intermediate columns instead of the ids. Default isFALSE
.with_ties
: Whether to include ties (i.e. clones with the same size as the last clone) or not. Default isFALSE
..
See also mutating the metadata.
ident-1
: The first group of cells to compareident-2
: The second group of cells to compare If not provided, the rest of the cells are used forident-2
.group-by
: Default:seurat_clusters
.
The column name in metadata to group the cells.
If onlygroup-by
is specified, andident-1
andident-2
are not specified, markers will be found for all groups in this column in the manner of "group vs rest" comparison.
NA
group will be ignored.each
: The column name in metadata to separate the cells into different cases.prefix_each
(flag
): Default:True
.
Whether to prefix theeach
column name to the value as the case/section name.prefix_group
(flag
): Default:True
.
When neitherident-1
norident-2
is specified, should we prefix the group name to the section name?dbs
(list
): Default:['KEGG_2021_Human', 'MSigDB_Hallmark_2020']
.
The dbs to do enrichment analysis for significant markers See below for all libraries.
https://maayanlab.cloud/Enrichr/#librariessigmarkers
: Default:p_val_adj < 0.05
.
An expression passed todplyr::filter()
to filter the significant markers for enrichment analysis.
Available variables arep_val
,avg_log2FC
,pct.1
,pct.2
andp_val_adj
. For example,"p_val_adj < 0.05 & abs(avg_log2FC) > 1"
to select markers with adjusted p-value < 0.05 and absolute log2 fold change > 1.assay
: The assay to use.volcano_genes
(type=auto
): Default:True
.
The genes to label in the volcano plot if they are significant markers.
IfTrue
, all significant markers will be labeled. IfFalse
, no genes will be labeled. Otherwise, specify the genes to label.
It could be either a string with comma separated genes, or a list of genes.section
: Default:DEFAULT
.
The section name for the report. It must not contain colon (:
).
Ignored wheneach
is not specified andident-1
is specified.
When neithereach
norident-1
is specified, case name will be used as section name.
Ifeach
is specified, the section name will be constructed fromeach
and case name.
Thesection
is used to collect cases and put the results under the same directory and the same section in report.
Wheneach
for a case is specified, thesection
will be ignored and case name will be used assection
.
The cases will be the expanded values ineach
column. Whenprefix_each
is True, the column name specified byeach
will be prefixed to each value as directory name and expanded case name.subset
: An expression to subset the cells for each case.rest
(ns
): Rest arguments forSeurat::FindMarkers()
.
Use-
to replace.
in the argument name. For example, usemin-pct
instead ofmin.pct
.
This only works whenuse_presto
isFalse
.
dotplot
(ns
): Arguments forSeurat::DotPlot()
.
Use-
to replace.
in the argument name. For example, usegroup-bar
instead ofgroup.bar
.
Note thatobject
,features
, andgroup-by
are already specified by this process. So you don't need to specify them here.maxgenes
(type=int
): Default:20
.
The maximum number of genes to plot.devpars
(ns
): The device parameters for the plots.res
(type=int
): The resolution of the plots.height
(type=int
): The height of the plots.width
(type=int
): The width of the plots.
<more>
: See https://satijalab.org/seurat/reference/doheatmap
cases
(type=json
): Default:{}
.
If you have multiple cases, you can specify them here. The keys are the names of the cases and the values are the above options exceptncores
andmutaters
. If some options are not specified, the default values specified above will be used.
If no cases are specified, the default case will be added with the default values underenvs
with the nameDEFAULT
.overlap_defaults
(ns
): The default options for overlapping analysis.venn
(ns
): The options for the Venn diagram.
Venn diagram can only be plotted for sections with no more than 4 cases.devpars
(ns
): The device parameters for the plots.res
(type=int
): Default:100
.
The resolution of the plots.height
(type=int
): Default:600
.
The height of the plots.width
(type=int
): Default:1000
.
The width of the plots.
upset
(ns
): The options for the UpSet plot.devpars
(ns
): The device parameters for the plots.res
(type=int
): Default:100
.
The resolution of the plots.height
(type=int
): Default:600
.
The height of the plots.width
(type=int
): Default:800
.
The width of the plots.
overlap
(json
): Default:{}
.
The sections to do overlaping analysis, including Venn diagram and UpSet plot. The Venn diagram and UpSet plot will be plotted for the overlapping of significant markers between different cases.
The keys of this option are the names of the sections. The values are a dict of options with keysvenn
andupset
, values will be inherited fromenvs.overlap_defaults
, recursively.
You can setenvs.overlap.<section>.venn
toFalse
/None
to disable the Venn diagram for the section.
It works wheneach
is specified. In such a case, the sections will be the case names.
This does not work for the cases whereident-1
is not specified. In case you want to do such analysis for those cases, you should enumerate the idents in different cases and specify them here.cache
(type=auto
): Default:/tmp
.
Where to cache toFindAllMarkers
results.
IfTrue
, cache tooutdir
of the job. IfFalse
, don't cache.
Otherwise, specify the directory to cache to.
Only works whenuse_presto
isFalse
(presto works fast enough).
Examples¶
The examples are for more general use of MarkersFinder
, in order to
demonstrate how the final cases are constructed.
Suppose we have a metadata like this:
id | seurat_clusters | Group |
---|---|---|
1 | 1 | A |
2 | 1 | A |
3 | 2 | A |
4 | 2 | A |
5 | 3 | B |
6 | 3 | B |
7 | 4 | B |
8 | 4 | B |
Default¶
By default, group-by
is seurat_clusters
, and ident-1
and ident-2
are not specified. So markers will be found for all clusters in the manner
of "cluster vs rest" comparison.
- Cluster
- 1 (vs 2, 3, 4)
- 2 (vs 1, 3, 4)
- 3 (vs 1, 2, 4)
- 4 (vs 1, 2, 3)
Each case will have the markers and the enrichment analysis for the
markers as the results.
With each
group¶
each
is used to separate the cells into different cases. group-by
is still seurat_clusters
.
[<Proc>.envs]
group-by = "seurat_clusters"
each = "Group"
- A:Cluster
- 1 (vs 2)
- 2 (vs 1)
- B:Cluster
- 3 (vs 4)
- 4 (vs 3)
With ident-1
only¶
ident-1
is used to specify the first group of cells to compare.
Then the rest of the cells in the case are used for ident-2
.
[<Proc>.envs]
group-by = "seurat_clusters"
ident-1 = "1"
- Cluster
- 1 (vs 2, 3, 4)
With both ident-1
and ident-2
¶
ident-1
and ident-2
are used to specify the two groups of cells to
compare.
[<Proc>.envs]
group-by = "seurat_clusters"
ident-1 = "1"
ident-2 = "2"
- Cluster
- 1 (vs 2)
Multiple cases¶
[<Proc>.envs.cases]
c1_vs_c2 = {ident-1 = "1", ident-2 = "2"}
c3_vs_c4 = {ident-1 = "3", ident-2 = "4"}
- DEFAULT:c1_vs_c2
- 1 (vs 2)
- DEFAULT:c3_vs_c4
- 3 (vs 4)
The DEFAULT
section name will be ignored in the report. You can specify
a section name other than DEFAULT
for each case to group them
in the report.