Get more_than, less_than, emerged, vanished, paired or top entities from a data frame.

more_than: Select entities that have more counts in the first group than the second group.
less_than: Select entities that have less counts in the first group than the second group.
emerged: Select entities that have counts in the first group but not in the second group.
vanished: Select entities that have counts in the second group but not in the first group.
paired: Select entities that have the same counts in both groups.
top: Select the top entities from a data frame based on the number of entities in each group.

Usage

more_than(
  df = ".",
  group_by,
  idents,
  id,
  subset = NULL,
  split_by = NULL,
  compare = ".n",
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  order = "desc(sum)",
  include_zeros = FALSE
)

less_than(
  df = ".",
  group_by,
  idents,
  id,
  subset = NULL,
  split_by = NULL,
  compare = ".n",
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  order = "desc(sum)",
  include_zeros = FALSE
)

emerged(
  df = ".",
  group_by,
  idents,
  id,
  subset = NULL,
  split_by = NULL,
  compare = ".n",
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  order = "desc(sum)",
  include_zeros = FALSE
)

vanished(
  df = ".",
  group_by,
  idents,
  id,
  subset = NULL,
  split_by = NULL,
  compare = ".n",
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  order = "desc(sum)",
  include_zeros = FALSE
)

paired(df = ".", id, compare, idents = 2, uniq = TRUE)

top(
  df = ".",
  id,
  n = 10,
  compare = ".n",
  subset = NULL,
  with_ties = FALSE,
  split_by = NULL,
  return_type = c("uids", "ids", "subdf", "df", "interdf")
)

Arguments

df

The data frame. Use . if the function is called in a dplyr pipe.

group_by

The column name in the data frame to group the entities. It could be a quoted string or a bare variable, and defines the groups of entities for comparison.

idents

The values in compare to compare. It could be either an an integer or a vector. If it is an integer, the number of values in compare must be the same as the integer for the id to be regarded as paired. If it is a vector, the values in compare must be the same as the values in idents for the id to be regarded as paired.

id

The column name in df for the groups.

subset

An expression to subset the entities, will be passed to dplyr::filter(). Default is TRUE (no filtering).

split_by

A column name (without quotes) in metadata to split the cells.

compare

The column name in df to compare the values for each group. It could be either a numeric column or .n to compare the number of entities in each group. If a column is passed, the values in the column must be numeric and the same in each group. This won't be checked.

return_type

The type of the returned value. Default is uids. It could be one of

uids: return the unique ids of the selected entities
ids: return the ids of all entities in the same order as in df, where the non-selected ids will be NA
subdf: return a subset of df with the selected entities
df: return the original df with a new logical column .out to mark the selected entities
interdf: return the intermediate data frame with the id column, <compare>, predicate and the split_by column if provided.

order

An expression to order the intermediate data frame before returning the final result. Default is NULL. It does not work for subdf and df.

include_zeros

Whether to include the zero entities in the other group for more_than and less_than comparisons. Default is FALSE. By default, the zero entities will be excluded, meaning that the entities must exist in both groups to be selected.

uniq

Whether to return unique ids or not. Default is TRUE. If FALSE, you can mutate the meta data frame with the returned ids. Non-paired ids will be NA.

n

The number of top entities to return. if n < 1, it will be regarded as the percentage of the total number of entities in each group (after subsetting or each applied). Specify 0 to return all entities.

with_ties

Whether to return all entities with the same size as the last entity in the top list. Default is FALSE.

Value

Depending on the return_type, the function will return different values.

uids: a vector of unique ids of the selected entities
ids: a vector of ids of all entities in the same order as in df, where the non-selected ids will be NA
subdf: a subset of df with the selected entities
df: the original df with a new logical column .selected to mark the selected entities
interdf: the intermediate data frame with the id column, ident_1, ident_2, predicate, sum, and diff and the split_by column if provided.

A vector of paired ids (in id column)

Depending on the return_type, the function will return different values.

Examples

df <- data.frame(
    id = c("A", "A", "A", "B", "B", "B", "C", "C", "D", "D"),
    group = c("G1", "G1", "G2", "G1", "G2", "G2", "G1", "G2", "G1", "G2"),
    count = rep(1, 10),
    split = c("S1", "S2", "S1", "S1", "S2", "S1", "S1", "S2", "S1", "S2")
)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "uids")
#> character(0)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = ".n",
  return_type = "uids")
#> [1] "A"
more_than(df, group_by = group, split_by = split, idents = c("G1", "G2"), id = id,
  compare = count, return_type = "ids")
#>  [1] NA NA NA NA NA NA NA NA NA NA
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "subdf")
#> [1] id    group count split
#> <0 rows> (or 0-length row.names)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "ids")
#>  [1] NA NA NA NA NA NA NA NA NA NA
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "interdf")
#> # A tibble: 4 × 6
#>   id    ident_1 ident_2 predicate   sum  diff
#>   <chr>   <dbl>   <dbl> <lgl>     <dbl> <dbl>
#> 1 A           1       1 FALSE         2     0
#> 2 B           1       1 FALSE         2     0
#> 3 C           1       1 FALSE         2     0
#> 4 D           1       1 FALSE         2     0
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "df")
#>    id group count split .selected
#> 1   A    G1     1    S1     FALSE
#> 2   A    G1     1    S2     FALSE
#> 3   A    G2     1    S1     FALSE
#> 4   B    G1     1    S1     FALSE
#> 5   B    G2     1    S2     FALSE
#> 6   B    G2     1    S1     FALSE
#> 7   C    G1     1    S1     FALSE
#> 8   C    G2     1    S2     FALSE
#> 9   D    G1     1    S1     FALSE
#> 10  D    G2     1    S2     FALSE
more_than(df, group_by = group, idents = c("G1", "G2"), id = id,
  return_type = "uids", subset = id %in% c("A", "B"))
#> [1] "A"
dplyr::mutate(df, selected = more_than(group_by = group, idents = c("G1", "G2"),
  id = id, compare = count, return_type = "ids"))
#>    id group count split selected
#> 1   A    G1     1    S1     <NA>
#> 2   A    G1     1    S2     <NA>
#> 3   A    G2     1    S1     <NA>
#> 4   B    G1     1    S1     <NA>
#> 5   B    G2     1    S2     <NA>
#> 6   B    G2     1    S1     <NA>
#> 7   C    G1     1    S1     <NA>
#> 8   C    G2     1    S2     <NA>
#> 9   D    G1     1    S1     <NA>
#> 10  D    G2     1    S2     <NA>
less_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "uids")
#> character(0)
emerged(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "uids", order = sum)
#> character(0)
vanished(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "uids")
#> character(0)
df <- data.frame(
    id = c("A", "A", "B", "B", "C", "C", "D", "D"),
    compare = c(1, 2, 1, 1, 1, 2, 1, 2)
)
paired(df, id, compare, 2)
#> [1] "A" "B" "C" "D"
paired(df, id, compare, c(1, 2))
#> [1] "A" "C" "D"
paired(df, id, compare, c(1, 2), uniq = FALSE)
#> [1] "A" "A" NA  NA  "C" "C" "D" "D"
df <- data.frame(
    id = c("A", "B", "C", "D", "E", "F", "G", "H"),
    value = c(10, 20, 30, 40, 50, 60, 80, 80)
)
top(df, id, n = 1, compare = value, with_ties = TRUE, return_type = "uids")
#> [1] "G" "H"
top(df, "id", n = 2, compare = "value", return_type = "subdf")
#>   id value
#> 1  G    80
#> 2  H    80
top(df, "id", n = 2, compare = "value", return_type = "df")
#>   id value .selected
#> 1  A    10     FALSE
#> 2  B    20     FALSE
#> 3  C    30     FALSE
#> 4  D    40     FALSE
#> 5  E    50     FALSE
#> 6  F    60     FALSE
#> 7  G    80      TRUE
#> 8  H    80      TRUE
top(df, "id", n = 2, compare = "value", return_type = "interdf")
#>   id value predicate
#> 1  A    10     FALSE
#> 2  B    20     FALSE
#> 3  C    30     FALSE
#> 4  D    40     FALSE
#> 5  E    50     FALSE
#> 6  F    60     FALSE
#> 7  G    80      TRUE
#> 8  H    80      TRUE
top(df, id, n = 0.25, compare = value, return_type = "uids")
#> [1] "G" "H"
top(df, id, n = 0, compare = value, return_type = "uids")
#> [1] "A" "B" "C" "D" "E" "F" "G" "H"

df <- data.frame(id = c("A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D"))
top(df, id, n = 2, compare = ".n", return_type = "uids", with_ties = TRUE)
#> [1] "B" "C" "D"
dplyr::mutate(df, selected = top(id = id, n = 2, compare = ".n", return_type = "ids",
  with_ties = TRUE))
#>    id selected
#> 1   A     <NA>
#> 2   A     <NA>
#> 3   B        B
#> 4   B        B
#> 5   B        B
#> 6   C        C
#> 7   C        C
#> 8   C        C
#> 9   D        D
#> 10  D        D
#> 11  D        D
#> 12  D        D