Get more_than, less_than, emerged or vanished entities from a data frame.
mutate-helper-1.Rd
more_than
: Select entities that have more counts in the first group than the second group.less_than
: Select entities that have less counts in the first group than the second group.emerged
: Select entities that have counts in the first group but not in the second group.vanished
: Select entities that have counts in the second group but not in the first group.
Usage
.size_compare(
df,
group_by,
idents,
id,
fun,
compare = ".n",
split_by = NULL,
order = "desc(sum)",
subset = NULL,
return_type = c("uids", "ids", "subdf", "df", "interdf"),
include_zeros = FALSE
)
more_than(
df = ".",
group_by,
idents,
id,
subset = NULL,
split_by = NULL,
compare = ".n",
return_type = c("uids", "ids", "subdf", "df", "interdf"),
order = "desc(sum)",
include_zeros = FALSE
)
less_than(
df = ".",
group_by,
idents,
id,
subset = NULL,
split_by = NULL,
compare = ".n",
return_type = c("uids", "ids", "subdf", "df", "interdf"),
order = "desc(sum)",
include_zeros = FALSE
)
emerged(
df = ".",
group_by,
idents,
id,
subset = NULL,
split_by = NULL,
compare = ".n",
return_type = c("uids", "ids", "subdf", "df", "interdf"),
order = "desc(sum)",
include_zeros = FALSE
)
vanished(
df = ".",
group_by,
idents,
id,
subset = NULL,
split_by = NULL,
compare = ".n",
return_type = c("uids", "ids", "subdf", "df", "interdf"),
order = "desc(sum)",
include_zeros = FALSE
)
Arguments
- df
The data frame
- group_by
The column name in the data frame to group the entities. It could be a quoted string or a bare variable, and defines the groups of entities for comparison.
- idents
The groups of entities to compare (values in
group_by
column). Either length 1 (ident_1
) or length 2 (ident_1
andident_2
). If length 1, the rest of the cells with non-NA values ingroup_by
will be used asident_2
.- id
The column name in data frame to mark the entities for the same group.
- fun
The way to compare between groups. Either
"more_than"
,"less_than"
,"emerged"
or"vanished"
.- compare
Either a (numeric) column name (i.e.
Count
) in data frame to compare between groups, or.n
to compare the number (count) of entities in each group. If a column name is given, only the first value of the entities from the sameid
will be used. So make sure that the values are the same for each group (id
).- split_by
A column name in data frame to split the entities. Each comparison will be done for each split in this column.
- order
An expression to order the intermediate data frame before returning the final result. Default is
NULL
. It does not work forsubdf
anddf
.- subset
An expression to subset the cells, will be passed to
dplyr::filter()
. Default isNULL
(no filtering).- return_type
The type of the returned value. Default is
uids
. It could be one ofuids
: return the unique ids of the selected entitiesids
: return the ids of all entities in the same order as indf
, where the non-selected ids will beNA
subdf
: return a subset ofdf
with the selected entitiesdf
: return the originaldf
with a new logical column.selected
to mark the selected entitiesinterdf
: return the intermediate data frame with the id column,ident_1
,ident_2
,predicate
,sum
,diff
and the split_by column if provided.
- include_zeros
Whether to include the zero entities in the other group for
more_than
andless_than
comparisons. Default isFALSE
. By default, the zero entities will be excluded, meaning that the entities must exist in both groups to be selected.
Value
Depending on the return_type
, the function will return different values.
uids
: a vector of unique ids of the selected entitiesids
: a vector of ids of all entities in the same order as indf
, where the non-selected ids will beNA
subdf
: a subset ofdf
with the selected entitiesdf
: the originaldf
with a new logical column.selected
to mark the selected entitiesinterdf
: the intermediate data frame with the id column,ident_1
,ident_2
,predicate
,sum
, anddiff
and the split_by column if provided.
Examples
df <- data.frame(
id = c("A", "A", "A", "B", "B", "B", "C", "C", "D", "D"),
group = c("G1", "G1", "G2", "G1", "G2", "G2", "G1", "G2", "G1", "G2"),
count = rep(1, 10),
split = c("S1", "S2", "S1", "S1", "S2", "S1", "S1", "S2", "S1", "S2")
)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "uids")
#> character(0)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = ".n",
return_type = "uids")
#> [1] "A"
more_than(df, group_by = group, split_by = split, idents = c("G1", "G2"), id = id,
compare = count, return_type = "ids")
#> [1] NA NA NA NA NA NA NA NA NA NA
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "subdf")
#> [1] id group count split
#> <0 rows> (or 0-length row.names)
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "ids")
#> [1] NA NA NA NA NA NA NA NA NA NA
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "interdf")
#> # A tibble: 4 × 6
#> id ident_1 ident_2 predicate sum diff
#> <chr> <dbl> <dbl> <lgl> <dbl> <dbl>
#> 1 A 1 1 FALSE 2 0
#> 2 B 1 1 FALSE 2 0
#> 3 C 1 1 FALSE 2 0
#> 4 D 1 1 FALSE 2 0
more_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "df")
#> id group count split .selected
#> 1 A G1 1 S1 FALSE
#> 2 A G1 1 S2 FALSE
#> 3 A G2 1 S1 FALSE
#> 4 B G1 1 S1 FALSE
#> 5 B G2 1 S2 FALSE
#> 6 B G2 1 S1 FALSE
#> 7 C G1 1 S1 FALSE
#> 8 C G2 1 S2 FALSE
#> 9 D G1 1 S1 FALSE
#> 10 D G2 1 S2 FALSE
more_than(df, group_by = group, idents = c("G1", "G2"), id = id,
return_type = "uids", subset = id %in% c("A", "B"))
#> [1] "A"
dplyr::mutate(df, selected = more_than(group_by = group, idents = c("G1", "G2"),
id = id, compare = count, return_type = "ids"))
#> id group count split selected
#> 1 A G1 1 S1 <NA>
#> 2 A G1 1 S2 <NA>
#> 3 A G2 1 S1 <NA>
#> 4 B G1 1 S1 <NA>
#> 5 B G2 1 S2 <NA>
#> 6 B G2 1 S1 <NA>
#> 7 C G1 1 S1 <NA>
#> 8 C G2 1 S2 <NA>
#> 9 D G1 1 S1 <NA>
#> 10 D G2 1 S2 <NA>
less_than(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "uids")
#> character(0)
emerged(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "uids", order = sum)
#> character(0)
vanished(df, group_by = group, idents = c("G1", "G2"), id = id, compare = count,
return_type = "uids")
#> character(0)