Skip to contents
  • more_than: Select entities that have more counts in the first group than the second group.

  • less_than: Select entities that have less counts in the first group than the second group.

  • emerged: Select entities that have counts in the first group but not in the second group.

  • vanished: Select entities that have counts in the second group but not in the first group.

Usage

.size_compare(
  df,
  group_by,
  idents,
  id,
  fun,
  compare = ".n",
  split_by = NULL,
  order = "desc(sum)",
  subset = NULL,
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  include_zeros = FALSE
)

more_than(
  df = ".",
  group_by,
  idents,
  id,
  subset = NULL,
  split_by = NULL,
  compare = ".n",
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  order = "desc(sum)",
  include_zeros = FALSE
)

less_than(
  df = ".",
  group_by,
  idents,
  id,
  subset = NULL,
  split_by = NULL,
  compare = ".n",
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  order = "desc(sum)",
  include_zeros = FALSE
)

emerged(
  df = ".",
  group_by,
  idents,
  id,
  subset = NULL,
  split_by = NULL,
  compare = ".n",
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  order = "desc(sum)",
  include_zeros = FALSE
)

vanished(
  df = ".",
  group_by,
  idents,
  id,
  subset = NULL,
  split_by = NULL,
  compare = ".n",
  return_type = c("uids", "ids", "subdf", "df", "interdf"),
  order = "desc(sum)",
  include_zeros = FALSE
)

Arguments

df

The data frame

group_by

The column name in the data frame to group the entities. It could be a quoted string or a bare variable, and defines the groups of entities for comparison.

idents

The groups of entities to compare (values in group_by column). Either length 1 (ident_1) or length 2 (ident_1 and ident_2). If length 1, the rest of the cells with non-NA values in group_by will be used as ident_2.

id

The column name in data frame to mark the entities for the same group.

fun

The way to compare between groups. Either "more_than", "less_than", "emerged" or "vanished".

compare

Either a (numeric) column name (i.e. Count) in data frame to compare between groups, or .n to compare the number (count) of entities in each group. If a column name is given, only the first value of the entities from the same id will be used. So make sure that the values are the same for each group (id).

split_by

A column name in data frame to split the entities. Each comparison will be done for each split in this column.

order

An expression to order the intermediate data frame before returning the final result. Default is NULL. It does not work for subdf and df.

subset

An expression to subset the cells, will be passed to dplyr::filter(). Default is NULL (no filtering).

return_type

The type of the returned value. Default is uids. It could be one of

  • uids: return the unique ids of the selected entities

  • ids: return the ids of all entities in the same order as in df, where the non-selected ids will be NA

  • subdf: return a subset of df with the selected entities

  • df: return the original df with a new logical column .selected to mark the selected entities

  • interdf: return the intermediate data frame with the id column, ident_1, ident_2, predicate, sum, diff and the split_by column if provided.

include_zeros

Whether to include the zero entities in the other group for more_than and less_than comparisons. Default is FALSE. By default, the zero entities will be excluded, meaning that the entities must exist in both groups to be selected.

Value

Depending on the return_type, the function will return different values.

  • uids: a vector of unique ids of the selected entities

  • ids: a vector of ids of all entities in the same order as in df, where the non-selected ids will be NA

  • subdf: a subset of df with the selected entities

  • df: the original df with a new logical column .selected to mark the selected entities

  • interdf: the intermediate data frame with the id column, ident_1, ident_2, predicate, sum, and diff and the split_by column if provided.

Examples

df <- data.frame(
    id = c("A", "A", "A", "B", "B", "B", "C", "C", "D", "D"),
    group = c("G1", "G1", "G2", "G1", "G2", "G2", "G1", "G2", "G1", "G2"),
    count = rep(1, 10),
    split = c("S1", "S2", "S1", "S1", "S2", "S1", "S1", "S2", "S1", "S2")
)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "uids")
#> character(0)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = ".n",
  return_type = "uids")
#> [1] "A"
more_than(df, group_by = group, split_by = split, idents = c("G1", "G2"), id = id,
  compare = count, return_type = "ids")
#>  [1] NA NA NA NA NA NA NA NA NA NA
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "subdf")
#> [1] id    group count split
#> <0 rows> (or 0-length row.names)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "ids")
#>  [1] NA NA NA NA NA NA NA NA NA NA
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "interdf")
#> # A tibble: 4 × 6
#>   id    ident_1 ident_2 predicate   sum  diff
#>   <chr>   <dbl>   <dbl> <lgl>     <dbl> <dbl>
#> 1 A           1       1 FALSE         2     0
#> 2 B           1       1 FALSE         2     0
#> 3 C           1       1 FALSE         2     0
#> 4 D           1       1 FALSE         2     0
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "df")
#>    id group count split .selected
#> 1   A    G1     1    S1     FALSE
#> 2   A    G1     1    S2     FALSE
#> 3   A    G2     1    S1     FALSE
#> 4   B    G1     1    S1     FALSE
#> 5   B    G2     1    S2     FALSE
#> 6   B    G2     1    S1     FALSE
#> 7   C    G1     1    S1     FALSE
#> 8   C    G2     1    S2     FALSE
#> 9   D    G1     1    S1     FALSE
#> 10  D    G2     1    S2     FALSE
more_than(df, group_by = group, idents = c("G1", "G2"), id = id,
  return_type = "uids", subset = id %in% c("A", "B"))
#> [1] "A"
dplyr::mutate(df, selected = more_than(group_by = group, idents = c("G1", "G2"),
  id = id, compare = count, return_type = "ids"))
#>    id group count split selected
#> 1   A    G1     1    S1     <NA>
#> 2   A    G1     1    S2     <NA>
#> 3   A    G2     1    S1     <NA>
#> 4   B    G1     1    S1     <NA>
#> 5   B    G2     1    S2     <NA>
#> 6   B    G2     1    S1     <NA>
#> 7   C    G1     1    S1     <NA>
#> 8   C    G2     1    S2     <NA>
#> 9   D    G1     1    S1     <NA>
#> 10  D    G2     1    S2     <NA>
less_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "uids")
#> character(0)
emerged(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "uids", order = sum)
#> character(0)
vanished(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
  return_type = "uids")
#> character(0)