chop
%run nb_helpers.py
from datar.all import *
nb_header(chop, unchop)
★ chop¶
★ unchop¶
Makes df longer by expanding list-columns so that each element¶
of the list-column gets its own row in the output.
See https://tidyr.tidyverse.org/reference/chop.html
Recycling size-1 elements might be different from tidyr
>>> df = tibble(x=[1, [2,3]], y=[[2,3], 1])
>>> df >> unchop([f.x, f.y])
>>> # tibble(x=[1,2,3], y=[2,3,1])
>>> # instead of following in tidyr
>>> # tibble(x=[1,1,2,3], y=[2,3,1,1])
Args:¶
data
: A data frame.
cols
: Columns to unchop.
keep_empty
: By default, you get one row of output for each element
of the list your unchopping/unnesting.
This means that if there's a size-0 element
(like NULL or an empty data frame), that entire row will be
dropped from the output.
If you want to preserve all rows, use keep_empty
= True
to
replace size-0 elements with a single row of missing values.
dtypes
: Providing the dtypes for the output columns.
Could be a single dtype, which will be applied to all columns, or
a dictionary of dtypes with keys for the columns and values the
dtypes.
For nested data frames, we need to specify col$a
as key. If col
is used as key, all columns of the nested data frames will be casted
into that dtype.
Returns:¶
A data frame with selected columns unchopped.
df = tibble(x = c(1, 1, 1, 2, 2, 3), y = c[1:6:1], z = c[6:1:-1])
df >> nest(data = c(f.y, f.z))
x | data | |
---|---|---|
<int64> | <object> | |
0 | 1 | <DF 3x2> |
1 | 2 | <DF 2x2> |
2 | 3 | <DF 1x2> |
df >> chop(c(f.y, f.z))
x | y | z | |
---|---|---|---|
<int64> | <object> | <object> | |
0 | 1 | [1, 2, 3] | [6, 5, 4] |
1 | 2 | [4, 5] | [3, 2] |
2 | 3 | [6] | [1] |
# Unchop
df = tibble(x = c[1:5], y = [[], [1], [1,2], [1,2,3]])
df >> unchop(f.y)
x | y | |
---|---|---|
<int64> | <object> | |
0 | 2 | 1.0 |
1 | 3 | 1.0 |
2 | 3 | 2.0 |
3 | 4 | 1.0 |
4 | 4 | 2.0 |
5 | 4 | 3.0 |
df >> unchop(f.y, keep_empty=True, dtypes=int)
x | y | |
---|---|---|
<int64> | <int64> | |
0 | 2 | 1 |
1 | 3 | 1 |
2 | 3 | 2 |
3 | 4 | 1 |
4 | 4 | 2 |
5 | 4 | 3 |
df = tibble(x = c[1:2], y = ["a", [1,2,3]])
df >> unchop(f.y)
x | y | |
---|---|---|
<int64> | <object> | |
0 | 1 | a |
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
with try_catch():
df >> unchop(f.y, dtypes=int)
[ValueError] invalid literal for int() with base 10: 'a'
df = tibble(x = c[1:4], y = [NULL, tibble(x = 1), tibble(y = c[1:3])])
df >> unchop(f.y)
x | y$x | y$y | |
---|---|---|---|
<int64> | <float64> | <float64> | |
0 | 2 | 1.0 | NaN |
1 | 3 | NaN | 1.0 |
2 | 3 | NaN | 2.0 |
df >> unchop(f.y, keep_empty=True)
x | y$x | y$y | |
---|---|---|---|
<int64> | <float64> | <float64> | |
0 | 1 | NaN | NaN |
1 | 2 | 1.0 | NaN |
2 | 3 | NaN | 1.0 |
3 | 3 | NaN | 2.0 |