Get more_than, less_than, emerged, vanished, paired or top entities from a data frame.
mutate-helper-1.Rd
more_than
: Select entities that have more counts in the first group than the second group.less_than
: Select entities that have less counts in the first group than the second group.emerged
: Select entities that have counts in the first group but not in the second group.vanished
: Select entities that have counts in the second group but not in the first group.paired
: Select entities that have the same counts in both groups.top
: Select the top entities from a data frame based on the number of entities in each group.
Usage
more_than(
df = ".",
group_by,
idents,
id,
subset = NULL,
split_by = NULL,
compare = ".n",
return_type = c("uids", "ids", "subdf", "df", "interdf"),
order = "desc(sum)",
include_zeros = FALSE
)
less_than(
df = ".",
group_by,
idents,
id,
subset = NULL,
split_by = NULL,
compare = ".n",
return_type = c("uids", "ids", "subdf", "df", "interdf"),
order = "desc(sum)",
include_zeros = FALSE
)
emerged(
df = ".",
group_by,
idents,
id,
subset = NULL,
split_by = NULL,
compare = ".n",
return_type = c("uids", "ids", "subdf", "df", "interdf"),
order = "desc(sum)",
include_zeros = FALSE
)
vanished(
df = ".",
group_by,
idents,
id,
subset = NULL,
split_by = NULL,
compare = ".n",
return_type = c("uids", "ids", "subdf", "df", "interdf"),
order = "desc(sum)",
include_zeros = FALSE
)
paired(df = ".", id, compare, idents = 2, uniq = TRUE)
top(
df = ".",
id,
n = 10,
compare = ".n",
subset = NULL,
with_ties = FALSE,
split_by = NULL,
return_type = c("uids", "ids", "subdf", "df", "interdf")
)
Arguments
- df
The data frame. Use
.
if the function is called in a dplyr pipe.- group_by
The column name in the data frame to group the entities. It could be a quoted string or a bare variable, and defines the groups of entities for comparison.
- idents
The values in
compare
to compare. It could be either an an integer or a vector. If it is an integer, the number of values incompare
must be the same as the integer for theid
to be regarded as paired. If it is a vector, the values incompare
must be the same as the values inidents
for theid
to be regarded as paired.- id
The column name in
df
for the groups.- subset
An expression to subset the entities, will be passed to
dplyr::filter()
. Default isTRUE
(no filtering).- split_by
A column name (without quotes) in metadata to split the cells.
- compare
The column name in
df
to compare the values for each group. It could be either a numeric column or.n
to compare the number of entities in each group. If a column is passed, the values in the column must be numeric and the same in each group. This won't be checked.- return_type
The type of the returned value. Default is
uids
. It could be one ofuids
: return the unique ids of the selected entitiesids
: return the ids of all entities in the same order as indf
, where the non-selected ids will beNA
subdf
: return a subset ofdf
with the selected entitiesdf
: return the originaldf
with a new logical column.out
to mark the selected entitiesinterdf
: return the intermediate data frame with the id column,<compare>
,predicate
and the split_by column if provided.
- order
An expression to order the intermediate data frame before returning the final result. Default is
NULL
. It does not work forsubdf
anddf
.- include_zeros
Whether to include the zero entities in the other group for
more_than
andless_than
comparisons. Default isFALSE
. By default, the zero entities will be excluded, meaning that the entities must exist in both groups to be selected.- uniq
Whether to return unique ids or not. Default is
TRUE
. IfFALSE
, you can mutate the meta data frame with the returned ids. Non-paired ids will beNA
.- n
The number of top entities to return. if
n
< 1, it will be regarded as the percentage of the total number of entities in each group (after subsetting or each applied). Specify 0 to return all entities.- with_ties
Whether to return all entities with the same size as the last entity in the top list. Default is
FALSE
.
Value
Depending on the return_type
, the function will return different values.
uids
: a vector of unique ids of the selected entitiesids
: a vector of ids of all entities in the same order as indf
, where the non-selected ids will beNA
subdf
: a subset ofdf
with the selected entitiesdf
: the originaldf
with a new logical column.selected
to mark the selected entitiesinterdf
: the intermediate data frame with the id column,ident_1
,ident_2
,predicate
,sum
, anddiff
and the split_by column if provided.
A vector of paired ids (in id
column)
Depending on the return_type
, the function will return different values.
Examples
df <- data.frame(
id = c("A", "A", "A", "B", "B", "B", "C", "C", "D", "D"),
group = c("G1", "G1", "G2", "G1", "G2", "G2", "G1", "G2", "G1", "G2"),
count = rep(1, 10),
split = c("S1", "S2", "S1", "S1", "S2", "S1", "S1", "S2", "S1", "S2")
)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "uids")
#> character(0)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = ".n",
return_type = "uids")
#> [1] "A"
more_than(df, group_by = group, split_by = split, idents = c("G1", "G2"), id = id,
compare = count, return_type = "ids")
#> [1] NA NA NA NA NA NA NA NA NA NA
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "subdf")
#> [1] id group count split
#> <0 rows> (or 0-length row.names)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "ids")
#> [1] NA NA NA NA NA NA NA NA NA NA
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "interdf")
#> # A tibble: 4 × 6
#> id ident_1 ident_2 predicate sum diff
#> <chr> <dbl> <dbl> <lgl> <dbl> <dbl>
#> 1 A 1 1 FALSE 2 0
#> 2 B 1 1 FALSE 2 0
#> 3 C 1 1 FALSE 2 0
#> 4 D 1 1 FALSE 2 0
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "df")
#> id group count split .selected
#> 1 A G1 1 S1 FALSE
#> 2 A G1 1 S2 FALSE
#> 3 A G2 1 S1 FALSE
#> 4 B G1 1 S1 FALSE
#> 5 B G2 1 S2 FALSE
#> 6 B G2 1 S1 FALSE
#> 7 C G1 1 S1 FALSE
#> 8 C G2 1 S2 FALSE
#> 9 D G1 1 S1 FALSE
#> 10 D G2 1 S2 FALSE
more_than(df, group_by = group, idents = c("G1", "G2"), id = id,
return_type = "uids", subset = id %in% c("A", "B"))
#> [1] "A"
dplyr::mutate(df, selected = more_than(group_by = group, idents = c("G1", "G2"),
id = id, compare = count, return_type = "ids"))
#> id group count split selected
#> 1 A G1 1 S1 <NA>
#> 2 A G1 1 S2 <NA>
#> 3 A G2 1 S1 <NA>
#> 4 B G1 1 S1 <NA>
#> 5 B G2 1 S2 <NA>
#> 6 B G2 1 S1 <NA>
#> 7 C G1 1 S1 <NA>
#> 8 C G2 1 S2 <NA>
#> 9 D G1 1 S1 <NA>
#> 10 D G2 1 S2 <NA>
less_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "uids")
#> character(0)
emerged(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "uids", order = sum)
#> character(0)
vanished(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "uids")
#> character(0)
df <- data.frame(
id = c("A", "A", "B", "B", "C", "C", "D", "D"),
compare = c(1, 2, 1, 1, 1, 2, 1, 2)
)
paired(df, id, compare, 2)
#> [1] "A" "B" "C" "D"
paired(df, id, compare, c(1, 2))
#> [1] "A" "C" "D"
paired(df, id, compare, c(1, 2), uniq = FALSE)
#> [1] "A" "A" NA NA "C" "C" "D" "D"
df <- data.frame(
id = c("A", "B", "C", "D", "E", "F", "G", "H"),
value = c(10, 20, 30, 40, 50, 60, 80, 80)
)
top(df, id, n = 1, compare = value, with_ties = TRUE, return_type = "uids")
#> [1] "G" "H"
top(df, "id", n = 2, compare = "value", return_type = "subdf")
#> id value
#> 1 G 80
#> 2 H 80
top(df, "id", n = 2, compare = "value", return_type = "df")
#> id value .selected
#> 1 A 10 FALSE
#> 2 B 20 FALSE
#> 3 C 30 FALSE
#> 4 D 40 FALSE
#> 5 E 50 FALSE
#> 6 F 60 FALSE
#> 7 G 80 TRUE
#> 8 H 80 TRUE
top(df, "id", n = 2, compare = "value", return_type = "interdf")
#> id value predicate
#> 1 A 10 FALSE
#> 2 B 20 FALSE
#> 3 C 30 FALSE
#> 4 D 40 FALSE
#> 5 E 50 FALSE
#> 6 F 60 FALSE
#> 7 G 80 TRUE
#> 8 H 80 TRUE
top(df, id, n = 0.25, compare = value, return_type = "uids")
#> [1] "G" "H"
top(df, id, n = 0, compare = value, return_type = "uids")
#> [1] "A" "B" "C" "D" "E" "F" "G" "H"
df <- data.frame(id = c("A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D"))
top(df, id, n = 2, compare = ".n", return_type = "uids", with_ties = TRUE)
#> [1] "B" "C" "D"
dplyr::mutate(df, selected = top(id = id, n = 2, compare = ".n", return_type = "ids",
with_ties = TRUE))
#> id selected
#> 1 A <NA>
#> 2 A <NA>
#> 3 B B
#> 4 B B
#> 5 B B
#> 6 C C
#> 7 C C
#> 8 C C
#> 9 D D
#> 10 D D
#> 11 D D
#> 12 D D