separate
# https://tidyr.tidyverse.org/reference/separate.html
%run nb_helpers.py
from datar.all import *
nb_header(separate, separate_rows)
★ separate¶
Given either a regular expression or a vector of character positions,¶
turns a single character column into multiple columns.
Args:¶
data
: The dataframe
col
: Column name or position.
into
: Names of new variables to create as character vector.
Use None
/NA
/NULL
to omit the variable in the output.
sep
: Separator between columns.
If str, sep
is interpreted as a regular expression.
The default value is a regular expression that matches
any sequence of non-alphanumeric values.
If int, sep
is interpreted as character positions to split at.
remove
: If TRUE, remove input column from output data frame.
convert
: The universal type for the extracted columns or a dict for
individual ones
Note that when given TRUE
, DataFrame.convert_dtypes()
is called,
but it will not convert str
to other types
(For example, '1'
to 1
). You have to specify the dtype yourself.
extra
: If sep is a character vector, this controls what happens when
there are too many pieces. There are three valid options:
- "warn" (the default): emit a warning and drop extra values.
- "drop": drop any extra values without a warning.
- "merge": only splits at most length(into) times
fill
: If sep is a character vector, this controls what happens when
there are not enough pieces. There are three valid options:
- "warn" (the default): emit a warning and fill from the right
- "right": fill with missing values on the right
- "left": fill with missing values on the left
Returns:¶
Dataframe with separated columns.
★ separate_rows¶
df = tibble(x=c(NA, "x.y", "x.z", "y.z"))
df >> separate(f.x, c("A", "B"))
A | B | |
---|---|---|
<object> | <object> | |
0 | NaN | NaN |
1 | x | y |
2 | x | z |
3 | y | z |
df >> separate(f.x, c(NA, "B"))
B | |
---|---|
<object> | |
0 | NaN |
1 | y |
2 | z |
3 | z |
df = tibble(x=c("x", "x y", "x y z", NA))
df >> separate(f.x, c("a", "b"))
[2022-12-02 14:25:28][datar][WARNING] Expected 2 pieces. Additional pieces discarded in 1 rows ['x y z']. [2022-12-02 14:25:28][datar][WARNING] Expected 2 pieces. Missing pieces filled with `NA` in 1 rows ['x'].
a | b | |
---|---|---|
<object> | <object> | |
0 | x | NaN |
1 | x | y |
2 | x | y |
3 | NaN | NaN |
df >> separate(f.x, c("a", "b"), extra="drop", fill="right")
a | b | |
---|---|---|
<object> | <object> | |
0 | x | NaN |
1 | x | y |
2 | x | y |
3 | NaN | NaN |
df >> separate(f.x, c("a", "b"), extra="merge", fill="left")
a | b | |
---|---|---|
<object> | <object> | |
0 | NaN | x |
1 | x | y |
2 | x | y z |
3 | NaN | NaN |
df >> separate(f.x, c("a", "b", "c"))
[2022-12-02 14:25:32][datar][WARNING] Expected 3 pieces. Missing pieces filled with `NA` in 2 rows ['x', 'x y'].
a | b | c | |
---|---|---|---|
<object> | <object> | <object> | |
0 | x | NaN | NaN |
1 | x | y | NaN |
2 | x | y | z |
3 | NaN | NaN | NaN |
df = tibble(x=c("x: 123", "y: error: 7"))
df >> separate(f.x, c("key", "value"), ": ", extra="merge")
key | value | |
---|---|---|
<object> | <object> | |
0 | x | 123 |
1 | y | error: 7 |
df = tibble(x=c(NA, "x?y", "x.z", "y:z"))
df >> separate(f.x, c("A","B"), sep=r"[.?:]")
A | B | |
---|---|---|
<object> | <object> | |
0 | NaN | NaN |
1 | x | y |
2 | x | z |
3 | y | z |
df = tibble(x=c("x:1", "x:2", "y:4", "z", NA))
df >> separate(f.x, c("key","value"), ":")
[2022-12-02 14:25:35][datar][WARNING] Expected 2 pieces. Missing pieces filled with `NA` in 1 rows ['z'].
key | value | |
---|---|---|
<object> | <object> | |
0 | x | 1 |
1 | x | 2 |
2 | y | 4 |
3 | z | NaN |
4 | NaN | NaN |
out = df >> separate(f.x, c("key","value"), ":", convert={'value': float})
out.dtypes
[2022-12-02 14:25:36][datar][WARNING] Expected 2 pieces. Missing pieces filled with `NA` in 1 rows ['z'].
key object value float64 dtype: object
df = tibble(
x=[1,2,3],
y=c("a", "d,e,f", "g,h"),
z=c("1", "2,3,4", "5,6")
)
df >> separate_rows(f.y, f.z, convert={'z': int})
x | y | z | |
---|---|---|---|
<int64> | <object> | <int64> | |
0 | 1 | a | 1 |
1 | 2 | d | 2 |
2 | 2 | e | 3 |
3 | 2 | f | 4 |
4 | 3 | g | 5 |
5 | 3 | h | 6 |