
Programmatic clone selection for TCR/BCR repertoire analysis
Source:R/clone_selectors.R
clone_selectors.RdClone selectors provide a programmatic, expression-based system for filtering and selecting T cell and B cell clones from immune repertoire data. They are the foundation of clone-level analysis in scplotter, enabling flexible clone selection without manual specification of clone IDs.
Clone selectors operate on data frames containing clone abundance information (clone IDs paired with group-level counts or fractions). They evaluate selection criteria — such as abundance thresholds, group comparisons, or shared presence across conditions — and return either the selected clone IDs, a logical indicator vector, or a filtered data frame. The system is context-aware: it automatically detects whether it is being called from within a dplyr pipeline, a scplotter function, or standalone code, and adjusts its default behavior accordingly.
The following selector functions are available:
top()— select thenlargest clones by abundancesel()— select clones matching a custom logical expressionuniq()— select clones unique to a specified groupshared()— select clones present in all specified groupsgt(),ge(),lt(),le(),eq(),ne()— comparison-based selectionand(),or()— combine multiple selector results
Usage
top(
n,
group_by = NULL,
data = NULL,
order = NULL,
id = NULL,
in_form = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
sel(
expr,
group_by = NULL,
data = NULL,
id = NULL,
in_form = NULL,
top = NULL,
order = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
uniq(
group1,
group2,
...,
group_by = NULL,
data = NULL,
id = NULL,
in_form = NULL,
top = NULL,
order = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
shared(
group1,
group2,
...,
group_by = NULL,
data = NULL,
id = NULL,
in_form = NULL,
top = NULL,
order = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
gt(
group1,
group2,
include_zeros = TRUE,
group_by = NULL,
data = NULL,
id = NULL,
in_form = NULL,
top = NULL,
order = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
ge(
group1,
group2,
include_zeros = TRUE,
group_by = NULL,
data = NULL,
id = NULL,
in_form = NULL,
top = NULL,
order = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
lt(
group1,
group2,
include_zeros = TRUE,
group_by = NULL,
data = NULL,
id = NULL,
in_form = NULL,
top = NULL,
order = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
le(
group1,
group2,
include_zeros = TRUE,
group_by = NULL,
data = NULL,
id = NULL,
in_form = NULL,
top = NULL,
order = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
eq(
group1,
group2,
group_by = NULL,
data = NULL,
id = NULL,
top = NULL,
order = NULL,
in_form = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
ne(
group1,
group2,
include_zeros = TRUE,
group_by = NULL,
data = NULL,
id = NULL,
in_form = NULL,
top = NULL,
order = NULL,
within = NULL,
output_within = NULL,
output = NULL
)
and(x, y, ...)
or(x, y, ...)Arguments
- n
The number of top clones to select or the threshold size.
- group_by
The column names in the meta data to group the cells. By default, it is assumed
facet_byandsplit_byto be in the parent frame if used in scplotter functions. When used in dplyr verbs, it should be a character vector of the grouping columns, where the first column is used to extract the values (count) forgroup1,group2, and...and the rest are used to keep the groupings.- data
The data frame containing clone information. Default is NULL. If NULL, when used in scplotter functions, it will get data from parent.frame. A typical
datashould have a column namedCloneIDand other columns for the groupings. Supposingly it should be a grouped data frame with the grouping columns. Under each grouping column, the value should be the size of the clone. By default, the data is assumed to be in the parent frame. When used in dplyr verbs, it should be the parent data frame passed to the dplyr verb.- order
The order of the clones to select. It can be an expression to order the clones by a specific column. Only used in
top().- id
The column name that contains the clone ID. Default is "CTaa".
- in_form
The format of the input data. It can be "long" or "wide". If "long", the data should be in a long format with a column for the clone IDs and a column for the size. If "wide", the data should be in a wide format with columns for the clone IDs and the size for each group. When used in dplyr verbs, it should be "long" by default. If used in scplotter functions, it should be "wide" by default.#'
- within
An expression passed to subset the data before applying the selection criteria. Only works for
longformat. It is useful when you want to select clones based on the criteria within a specific subset of the data. Note that this subsetting is only applied to determine the selection of clones, not to the final output. So if a cell belongs to a clone that is selected based on the subsetted data, it will be included in the final output, even if it does not meet thewithincriteria. If you want the clones returned to also meet thewithincriteria, you can setoutput_withinto TRUE, which will return the clones that meet both the selection criteria and thewithincriteria.- output_within
An expression passed to subset the data after applying the selection criteria. Can work with both
longandwideformat. It is useful when you want to return clones that meet both the selection criteria and this criteria. If set to TRUE (only works whenwithinis specified), thewithincriteria will be applied to filter the final output to include only the clones that meet both the selection criteria and thewithincriteria. If FALSE or NULL (default), thewithincriteria will only be applied to determine the selection of clones, not to the final output.- output
There are three options for the output: "id" (or "ids"), "logical" (or "bool", "boolean", "indicator"), and "data" (or "df", "data.frame").
"id" (or "ids"): return a vector with the same length as the input data, with the selected clones' CTaa values (clone IDs) and NA for others. It is useful for adding a new column to the data frame.
"logical" (or "bool", "boolean", "indicator"): return a logical vector indicating whether each clone is selected or not. Same as
idbut with TRUE for selected clones and FALSE for others."data" (or "df", "data.frame"): return a subset of the data frame with only the selected clones. This is useful for filtering the data frame to only include the clones that meet the criteria. It is used internally in some other scplotter functions, such as
ClonalStatPlot, to select clones. The groupings are also applied, and defaulting tofacet_byandsplit_byin the parent frame. By default, it is NULL, which will return "id" when used in dplyr verbs and "data" when used in scplotter functions.
- expr
The expression (in characters) to filter the clones (e.g. "group1 > group2" to select clones where group1 is larger than group2).
- top
The number of top clones to select based on the expression. If specified, it will select the top N clones that meet the criteria defined by
exprand ordered byorder. Iforderis not specified, it will select the top N clones based on the order they appear in the data after filtering byexpr.- group1
The first group to compare.
- group2
The second group to compare.
- ...
Additional vectors to compare in logical operations (and/or).
- include_zeros
Whether to include clones with zero size in the comparison. If TRUE, in a comparison (s1 > s2) for a clone to be selected, both s1 and s2 must be greater than 0. If FALSE, only the first group must be greater than the second group.
- x
The first vector to compare in logical operations (and/or).
- y
The second vector to compare in logical operations (and/or).
Note
Clone selectors are designed to work with data produced by
scRepertoire::combineTCR()orscRepertoire::combineBCR(), where each row represents a cell and clone IDs are stored in a column (default"CTaa").The
withinandoutput_withinarguments enable powerful subset-then-select patterns. Usewithinto restrict the selection pool (e.g., select top clones only among CD8+ T cells), and useoutput_withinto additionally filter the returned results.When using selectors as expression strings in scplotter functions (e.g.,
clones = "top(10, group_by = 'Sample')"), the expression is parsed and evaluated in the function's data context — not in the global environment. Group names should be quoted within the string.Selector functions detect their calling context via
rlang::caller_env(). When calling selectors from within nested wrapper functions, the context detection may fail; passdata,in_form, andoutputexplicitly in such cases.
Usage contexts
Clone selectors adapt their behavior based on the calling environment:
In a dplyr pipeline (e.g.,
mutate(data, Top5 = top(5))): Returns a character vector of clone IDs (the same length as the input data), withNAfor non-selected rows. This is ideal for adding a selection column to a data frame. For functions requiringgroup1,group2, etc., providegroup_byto identify the grouping column — its values become the group names referenced in the selector. Additional grouping columns can be passed viac(group1, group2, ...)to perform selection within each combination of grouping values.In a scplotter function (e.g.,
ClonalStatPlot): Returns a filtered data frame containing only the selected clones. Groupings default tofacet_byandsplit_byfrom the parent frame. This is the mode used when clone selectors are passed as expression strings (e.g.,clones = "top(10)").Standalone: All arguments should be provided explicitly (
group_by,output,data) to control behavior and return type.
Output formats
The output parameter controls what the selector returns:
"id"/"ids": A character vector of clone IDs for selected clones, withNAfor non-selected rows. Same length as the input data. Default in dplyr pipeline context."logical"/"bool"/"boolean"/"indicator": A logical vector (TRUE/FALSE) indicating selection status. Same information as"id"but in Boolean form."data"/"df"/"data.frame": A subsetted data frame containing only the rows for selected clones. Default in scplotter function context.
Compound expressions
Multiple selector results can be combined using and() and or():
and(x, y, ...): Returns clones selected by ALL input selectors (intersection). All input vectors must have the same length and output type (all"id"or all"logical").or(x, y, ...): Returns clones selected by ANY input selector (union). All input vectors must have the same length and output type.
For example, or(top(3), eq(group1, group2, group_by = "group")) selects
clones that are either in the top 3 OR have equal abundance in group1 and group2.
See also
ClonalStatPlotfor the primary consumer of clone selectors in expression-string formClonalCompositionPlot,ClonalDiversityPlot,ClonalGeneUsagePlotfor other clonal analysis functionsscRepertoire::combineTCR()andscRepertoire::combineBCR()for producing compatible input data
Examples
set.seed(8525)
data <- data.frame(
CTaa = c("AA1", "AA2", "AA3", "AA4", "AA5", "AA6", "AA7", "AA8", "AA9", "AA10"),
group1 = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
group2 = c(7, 3, 8, 2, 1, 5, 9, 4, 6, 0),
groups = c("A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
)
data <- data[order(data$group1 + data$group2, decreasing = TRUE), ]
top(3)
#> CTaa group1 group2 groups
#> 1 AA7 6 9 B
#> 2 AA9 8 6 B
#> 3 AA8 7 4 B
top(3, group_by = "groups")
#> # A tibble: 6 × 4
#> # Groups: groups [2]
#> CTaa group1 group2 groups
#> <chr> <dbl> <dbl> <chr>
#> 1 AA7 6 9 B
#> 2 AA9 8 6 B
#> 3 AA8 7 4 B
#> 4 AA3 2 8 A
#> 5 AA1 0 7 A
#> 6 AA4 3 2 A
sel(group1 == 0 | group2 == 0)
#> CTaa group1 group2 groups
#> 1 AA10 9 0 B
#> 2 AA1 0 7 A
uniq(group1, group2)
#> CTaa group1 group2 groups
#> 1 AA10 9 0 B
shared(group1, group2)
#> CTaa group1 group2 groups
#> 1 AA7 6 9 B
#> 2 AA9 8 6 B
#> 3 AA8 7 4 B
#> 4 AA3 2 8 A
#> 5 AA6 5 5 B
#> 6 AA4 3 2 A
#> 7 AA5 4 1 B
#> 8 AA2 1 3 A
gt(group1, group2)
#> CTaa group1 group2 groups
#> 1 AA9 8 6 B
#> 2 AA8 7 4 B
#> 3 AA10 9 0 B
#> 4 AA4 3 2 A
#> 5 AA5 4 1 B
lt(group1, group2)
#> CTaa group1 group2 groups
#> 1 AA7 6 9 B
#> 2 AA3 2 8 A
#> 3 AA1 0 7 A
#> 4 AA2 1 3 A
le(group1, group2)
#> CTaa group1 group2 groups
#> 1 AA7 6 9 B
#> 2 AA3 2 8 A
#> 3 AA6 5 5 B
#> 4 AA1 0 7 A
#> 5 AA2 1 3 A
lt(group1, group2, include_zeros = FALSE)
#> CTaa group1 group2 groups
#> 1 AA7 6 9 B
#> 2 AA3 2 8 A
#> 3 AA2 1 3 A
eq(group1, group2)
#> CTaa group1 group2 groups
#> 1 AA6 5 5 B
ne(group1, group2)
#> CTaa group1 group2 groups
#> 1 AA7 6 9 B
#> 2 AA9 8 6 B
#> 3 AA8 7 4 B
#> 4 AA3 2 8 A
#> 5 AA10 9 0 B
#> 6 AA1 0 7 A
#> 7 AA4 3 2 A
#> 8 AA5 4 1 B
#> 9 AA2 1 3 A
# Use them in a dplyr pipeline
data <- tidyr::pivot_longer(data,
cols = c("group1", "group2"),
names_to = "group", values_to = "value"
)
data <- tidyr::uncount(data, !!rlang::sym("value"))
data$subset <- sample(c("S1", "S2"), nrow(data), replace = TRUE)
# Take a glimpse of the data
data[sample(1:nrow(data), 10), ]
#> # A tibble: 10 × 4
#> CTaa groups group subset
#> <chr> <chr> <chr> <chr>
#> 1 AA8 B group2 S1
#> 2 AA8 B group1 S1
#> 3 AA5 B group1 S1
#> 4 AA10 B group1 S1
#> 5 AA7 B group1 S2
#> 6 AA7 B group1 S1
#> 7 AA1 A group2 S2
#> 8 AA1 A group2 S1
#> 9 AA5 B group1 S1
#> 10 AA9 B group1 S2
unique(dplyr::mutate(data, Top3 = top(3))$Top3)
#> [1] "AA7" "AA9" "AA8" NA
# Note that AA9 also reported in S2, even though the comparison is only applied within S1
dplyr::distinct(
dplyr::mutate(data, Top3 = top(3, within = subset == "S1")),
CTaa, subset, Top3
)
#> # A tibble: 19 × 3
#> CTaa subset Top3
#> <chr> <chr> <chr>
#> 1 AA7 S2 AA7
#> 2 AA7 S1 AA7
#> 3 AA9 S1 AA9
#> 4 AA9 S2 AA9
#> 5 AA8 S2 NA
#> 6 AA8 S1 NA
#> 7 AA3 S2 NA
#> 8 AA3 S1 NA
#> 9 AA6 S1 NA
#> 10 AA6 S2 NA
#> 11 AA10 S2 AA10
#> 12 AA10 S1 AA10
#> 13 AA1 S1 NA
#> 14 AA1 S2 NA
#> 15 AA4 S1 NA
#> 16 AA4 S2 NA
#> 17 AA5 S1 NA
#> 18 AA2 S1 NA
#> 19 AA2 S2 NA
# Note that AA9 is now excluded
dplyr::distinct(
dplyr::mutate(data, Top3 = top(3, within = subset == "S1", output_within = TRUE)),
CTaa, subset, Top3
)
#> # A tibble: 19 × 3
#> CTaa subset Top3
#> <chr> <chr> <chr>
#> 1 AA7 S2 NA
#> 2 AA7 S1 AA7
#> 3 AA9 S1 AA9
#> 4 AA9 S2 NA
#> 5 AA8 S2 NA
#> 6 AA8 S1 NA
#> 7 AA3 S2 NA
#> 8 AA3 S1 NA
#> 9 AA6 S1 NA
#> 10 AA6 S2 NA
#> 11 AA10 S2 NA
#> 12 AA10 S1 AA10
#> 13 AA1 S1 NA
#> 14 AA1 S2 NA
#> 15 AA4 S1 NA
#> 16 AA4 S2 NA
#> 17 AA5 S1 NA
#> 18 AA2 S1 NA
#> 19 AA2 S2 NA
# We can also exclude S1 clones even when the comparison is applied within S1
dplyr::distinct(
dplyr::mutate(data, Top3 = top(3, within = subset == "S1", output_within = subset == "S2")),
CTaa, subset, Top3
)
#> # A tibble: 19 × 3
#> CTaa subset Top3
#> <chr> <chr> <chr>
#> 1 AA7 S2 AA7
#> 2 AA7 S1 NA
#> 3 AA9 S1 NA
#> 4 AA9 S2 AA9
#> 5 AA8 S2 NA
#> 6 AA8 S1 NA
#> 7 AA3 S2 NA
#> 8 AA3 S1 NA
#> 9 AA6 S1 NA
#> 10 AA6 S2 NA
#> 11 AA10 S2 AA10
#> 12 AA10 S1 NA
#> 13 AA1 S1 NA
#> 14 AA1 S2 NA
#> 15 AA4 S1 NA
#> 16 AA4 S2 NA
#> 17 AA5 S1 NA
#> 18 AA2 S1 NA
#> 19 AA2 S2 NA
unique(dplyr::mutate(data, Top3 = top(3, group_by = "groups"))$Top3)
#> [1] "AA7" "AA9" "AA8" "AA3" NA "AA1" "AA4"
unique(dplyr::mutate(data, Unique = sel(group1 == 0 | group2 == 0, group_by = "group"))$Unique)
#> [1] NA "AA10" "AA1"
unique(dplyr::mutate(data, UniqueInG1 = uniq(group1, group2, group_by = "group"))$UniqueInG1)
#> [1] NA "AA10"
unique(dplyr::mutate(data, Shared = shared(group1, group2, group_by = "group"))$Shared)
#> [1] "AA7" "AA9" "AA8" "AA3" "AA6" NA "AA4" "AA5" "AA2"
unique(dplyr::mutate(data, Greater = gt(group1, group2, group_by = "group"))$Greater)
#> [1] NA "AA9" "AA8" "AA10" "AA4" "AA5"
unique(dplyr::mutate(data, Less = lt(group1, group2, group_by = "group"))$Less)
#> [1] "AA7" NA "AA3" "AA1" "AA2"
unique(dplyr::mutate(data, LessEqual = le(group1, group2, group_by = "group"))$LessEqual)
#> [1] "AA7" NA "AA3" "AA6" "AA1" "AA2"
unique(dplyr::mutate(data, GreaterEqual = ge(group1, group2, group_by = "group"))$GreaterEqual)
#> [1] NA "AA9" "AA8" "AA6" "AA10" "AA4" "AA5"
unique(dplyr::mutate(data, Equal = eq(group1, group2, group_by = "group"))$Equal)
#> [1] NA "AA6"
unique(dplyr::mutate(data, NotEqual = ne(group1, group2, group_by = "group"))$NotEqual)
#> [1] "AA7" "AA9" "AA8" "AA3" NA "AA10" "AA1" "AA4" "AA5" "AA2"
# Compond expressions
unique(
dplyr::mutate(data,
Top3OrEqual = or(top(3), eq(group1, group2, group_by = "group"))
)$Top3OrEqual
)
#> [1] "AA7" "AA9" "AA8" NA "AA6"
unique(
dplyr::mutate(data,
SharedAndGreater = and(
shared(group1, group2, group_by = "group"),
gt(group1, group2, group_by = "group")
)
)$SharedAndGreater
)
#> [1] NA "AA9" "AA8" "AA4" "AA5"
dplyr::mutate(data,
SharedAndGreater = and(
shared(group1, group2, group_by = "group", output = "logical"),
gt(group1, group2, group_by = "group", output = "logical")
)
)$SharedAndGreater
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [37] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [73] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [85] TRUE TRUE FALSE FALSE FALSE FALSE