Get top entities from a data frame based on the number of entities in each group

Usage

top(
  df = ".",
  id,
  n = 10,
  compare = ".n",
  subset = NULL,
  with_ties = FALSE,
  split_by = NULL,
  return_type = c("uids", "ids", "subdf", "df", "interdf")
)

Arguments

df

The data frame. Use . if the function is called in a dplyr pipe.

id

The column name in df for the groups.

n

The number of top entities to return. if n < 1, it will be regarded as the percentage of the total number of entities in each group (after subsetting or each applied). Specify 0 to return all entities.

compare

The column name in df to compare the values for each group. It could be either a numeric column or .n to compare the number of entities in each group. If a column is passed, the values in the column must be numeric and the same in each group. This won't be checked.

subset

An expression to subset the entities, will be passed to dplyr::filter(). Default is TRUE (no filtering).

with_ties

Whether to return all entities with the same size as the last entity in the top list. Default is FALSE.

split_by

A column name (without quotes) in metadata to split the cells.

return_type

The type of the returned value. Default is uids. It could be one of

uids: return the unique ids of the selected entities
ids: return the ids of all entities in the same order as in df, where the non-selected ids will be NA
subdf: return a subset of df with the selected entities
df: return the original df with a new logical column .out to mark the selected entities
interdf: return the intermediate data frame with the id column, <compare>, predicate and the split_by column if provided.

Value

Depending on the return_type, the function will return different values.

Examples

df <- data.frame(
    id = c("A", "B", "C", "D", "E", "F", "G", "H"),
    value = c(10, 20, 30, 40, 50, 60, 80, 80)
)
top(df, id, n = 1, compare = value, with_ties = TRUE, return_type = "uids")
#> [1] "G" "H"
top(df, "id", n = 2, compare = "value", return_type = "subdf")
#>   id value
#> 1  G    80
#> 2  H    80
top(df, "id", n = 2, compare = "value", return_type = "df")
#>   id value .selected
#> 1  A    10     FALSE
#> 2  B    20     FALSE
#> 3  C    30     FALSE
#> 4  D    40     FALSE
#> 5  E    50     FALSE
#> 6  F    60     FALSE
#> 7  G    80      TRUE
#> 8  H    80      TRUE
top(df, "id", n = 2, compare = "value", return_type = "interdf")
#>   id value predicate
#> 1  A    10     FALSE
#> 2  B    20     FALSE
#> 3  C    30     FALSE
#> 4  D    40     FALSE
#> 5  E    50     FALSE
#> 6  F    60     FALSE
#> 7  G    80      TRUE
#> 8  H    80      TRUE
top(df, id, n = 0.25, compare = value, return_type = "uids")
#> [1] "G" "H"
top(df, id, n = 0, compare = value, return_type = "uids")
#> [1] "A" "B" "C" "D" "E" "F" "G" "H"

df <- data.frame(id = c("A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D"))
top(df, id, n = 2, compare = ".n", return_type = "uids", with_ties = TRUE)
#> [1] "B" "C" "D"
dplyr::mutate(df, selected = top(id = id, n = 2, compare = ".n", return_type = "ids",
  with_ties = TRUE))
#>    id selected
#> 1   A     <NA>
#> 2   A     <NA>
#> 3   B        B
#> 4   B        B
#> 5   B        B
#> 6   C        C
#> 7   C        C
#> 8   C        C
#> 9   D        D
#> 10  D        D
#> 11  D        D
#> 12  D        D