MetaMarkers

Find markers between three or more groups of cells, using one-way ANOVA or Kruskal-Wallis test.

Sometimes, you may want to find the markers for cells from more than 2 groups.
In this case, you can use this process to find the markers for the groups and do enrichment analysis for the markers. Each marker is examined using either one-way ANOVA or Kruskal-Wallis test.
The p values are adjusted using the specified method. The significant markers are then used for enrichment analysis using enrichr api.

Other than the markers and the enrichment analysis as outputs, this process also generates violin plots for the top 10 markers.

Environment Variables

  • ncores (type=int): Default: 1.
    Number of cores to use to parallelize for genes
  • mutaters (type=json): Default: {}.
    The mutaters to mutate the metadata The key-value pairs will be passed the dplyr::mutate() to mutate the metadata.
    There are also also 4 helper functions, expanded, collapsed, emerged and vanished, which can be used to identify the expanded/collpased/emerged/vanished groups (i.e. TCR clones).
    See also https://pwwang.github.io/immunopipe/configurations/#mutater-helpers.
    For example, you can use {"Patient1_Tumor_Collapsed_Clones": "expanded(., Source, 'Tumor', subset = Patent == 'Patient1', uniq = FALSE)"} to create a new column in metadata named Patient1_Tumor_Collapsed_Clones with the collapsed clones in the tumor sample (compared to the normal sample) of patient 1.
    The values in this columns for other clones will be NA.
    Those functions take following arguments:

    • df: The metadata data frame. You can use the . to refer to it.
    • group.by: The column name in metadata to group the cells.
    • idents: The first group or both groups of cells to compare (value in group.by column). If only the first group is given, the rest of the cells (with non-NA in group.by column) will be used as the second group.
    • subset: An expression to subset the cells, will be passed to dplyr::filter(). Default is TRUE (no filtering).
    • each: A column name (without quotes) in metadata to split the cells.
      Each comparison will be done for each value in this column (typically each patient or subject).
    • id: The column name in metadata for the group ids (i.e. CDR3.aa).
    • compare: Either a (numeric) column name (i.e. Clones) in metadata to compare between groups, or .n to compare the number of cells in each group.
      If numeric column is given, the values should be the same for all cells in the same group.
      This will not be checked (only the first value is used).
      It is helpful to use Clones to use the raw clone size from TCR data, in case the cells are not completely mapped to RNA data.
      Also if you have subset set or NAs in group.by column, you should use .n to compare the number of cells in each group.
    • uniq: Whether to return unique ids or not. Default is TRUE. If FALSE, you can mutate the meta data frame with the returned ids. For example, df |> mutate(expanded = expanded(...)).
    • debug: Return the data frame with intermediate columns instead of the ids. Default is FALSE.
    • order: The expression passed to dplyr::arrange() to order intermediate dataframe and get the ids in order accordingly.
      The intermediate dataframe includes the following columns:
    • <id>: The ids of clones (i.e. CDR3.aa).
    • <each>: The values in each column.
    • ident_1: The size of clones in the first group.
    • ident_2: The size of clones in the second group.
    • .diff: The difference between the sizes of clones in the first and second groups.
    • .sum: The sum of the sizes of clones in the first and second groups.
    • .predicate: Showing whether the clone is expanded/collapsed/emerged/vanished.
    • include_emerged: Whether to include the emerged group for expanded (only works for expanded). Default is FALSE.
    • include_vanished: Whether to include the vanished group for collapsed (only works for collapsed). Default is FALSE.

    You can also use top() to get the top clones (i.e. the clones with the largest size) in each group.
    For example, you can use {"Patient1_Top10_Clones": "top(subset = Patent == 'Patient1', uniq = FALSE)"} to create a new column in metadata named Patient1_Top10_Clones.
    The values in this columns for other clones will be NA.
    This function takes following arguments:
    * df: The metadata data frame. You can use the . to refer to it.
    * id: The column name in metadata for the group ids (i.e. CDR3.aa).
    * n: The number of top clones to return. Default is 10.
    If n < 1, it will be treated as the percentage of the size of the group.
    Specify 0 to get all clones.
    * compare: Either a (numeric) column name (i.e. Clones) in metadata to compare between groups, or .n to compare the number of cells in each group.
    If numeric column is given, the values should be the same for all cells in the same group.
    This will not be checked (only the first value is used).
    It is helpful to use Clones to use the raw clone size from TCR data, in case the cells are not completely mapped to RNA data.
    Also if you have subset set or NAs in group.by column, you should use .n to compare the number of cells in each group.
    * subset: An expression to subset the cells, will be passed to dplyr::filter(). Default is TRUE (no filtering).
    * each: A column name (without quotes) in metadata to split the cells.
    Each comparison will be done for each value in this column (typically each patient or subject).
    * uniq: Whether to return unique ids or not. Default is TRUE. If FALSE, you can mutate the meta data frame with the returned ids. For example, df |> mutate(expanded = expanded(...)).
    * debug: Return the data frame with intermediate columns instead of the ids. Default is FALSE.
    * with_ties: Whether to include ties (i.e. clones with the same size as the last clone) or not. Default is FALSE.

  • group-by: The column name in metadata to group the cells.
    If only group-by is specified, and idents are not specified, markers will be found for all groups in this column.
    NA group will be ignored.

  • idents: The groups of cells to compare, values should be in the group-by column.
  • each: The column name in metadata to separate the cells into different cases.
  • prefix_each (flag): Default: True.
    Whether to add the each value as prefix to the case name.
  • dbs (list): Default: ['KEGG_2021_Human', 'MSigDB_Hallmark_2020'].
    The dbs to do enrichment analysis for significant markers See below for all libraries.
    https://maayanlab.cloud/Enrichr/#libraries
  • subset: The subset of the cells to do the analysis.
    An expression passed to dplyr::filter().
  • p_adjust (choice): Default: BH.
    The method to adjust the p values, which can be used to filter the significant markers.
    See also https://rdrr.io/r/stats/p.adjust.html
    • holm: Holm-Bonferroni method
    • hochberg: Hochberg method
    • hommel: Hommel method
    • bonferroni: Bonferroni method
    • BH: Benjamini-Hochberg method
    • BY: Benjamini-Yekutieli method
    • fdr: FDR method of Benjamini-Hochberg
    • none: No adjustment
  • sigmarkers: Default: p_adjust < 0.05.
    An expression passed to dplyr::filter() to filter the significant markers for enrichment analysis. The default is p.value < 0.05.
    If method = 'anova', the variables that can be used for filtering are:
    sumsq, meansq, statistic, p.value and p_adjust.
    If method = 'kruskal', the variables that can be used for filtering are:
    statistic, p.value and p_adjust.
  • section: Default: DEFAULT.
    The section name for the report.
    Worked only when each is not specified.
    Otherwise, the section name will be constructed from each and group-by.
    If DEFAULT, and it's the only section, it not included in the case/section names.
    The section is used to collect cases and put the results under the same directory and the same section in report.
    When each for a case is specified, the section will be ignored and case name will be used as section.
    The cases will be the expanded values in each column. When prefix_each is True, the column name specified by each will be prefixed to each value as directory name and expanded case name.
  • method (choice): Default: anova.
    The method for the test.
    • anova: One-way ANOVA
    • kruskal: Kruskal-Wallis test
  • cases (type=json): Default: {}.
    If you have multiple cases, you can specify them here. The keys are the names of the cases and the values are the above options except ncores and mutaters. If some options are not specified, the default values specified above will be used.
    If no cases are specified, the default case will be added with the default values under envs with the name DEFAULT.