separate

In [1]:

Copied!

# https://tidyr.tidyverse.org/reference/separate.html
%run nb_helpers.py

from datar.all import *

nb_header(separate, separate_rows)
# https://tidyr.tidyverse.org/reference/separate.html
%run nb_helpers.py

from datar.all import *

nb_header(separate, separate_rows)

Try this notebook on binder.

★ separate
¶

Given either a regular expression or a vector of character positions,¶

turns a single character column into multiple columns.

Args:¶

data: The dataframe
col: Column name or position.
into: Names of new variables to create as character vector.
Use None/NA/NULL to omit the variable in the output.

sep: Separator between columns.
If str, sep is interpreted as a regular expression.
The default value is a regular expression that matches
any sequence of non-alphanumeric values.
If int, sep is interpreted as character positions to split at.

remove: If TRUE, remove input column from output data frame.
convert: The universal type for the extracted columns or a dict for
individual ones
Note that when given TRUE, DataFrame.convert_dtypes() is called,
but it will not convert str to other types
(For example, '1' to 1). You have to specify the dtype yourself.

extra: If sep is a character vector, this controls what happens when
there are too many pieces. There are three valid options:

- "warn" (the default): emit a warning and drop extra values.

- "drop": drop any extra values without a warning.

- "merge": only splits at most length(into) times

fill: If sep is a character vector, this controls what happens when
there are not enough pieces. There are three valid options:

- "warn" (the default): emit a warning and fill from the right

- "right": fill with missing values on the right

- "left": fill with missing values on the left

Returns:¶

Dataframe with separated columns.

★ separate_rows
¶

Separates the values and places each one in its own row.¶

Args:¶

data: The dataframe
*columns: The columns to separate on
sep: Separator between columns.
convert: The universal type for the extracted columns or a dict for
individual ones

Returns:¶

Dataframe with rows separated and repeated.

In [2]:

Copied!

df = tibble(x=c(NA, "x.y", "x.z", "y.z"))
df >> separate(f.x, c("A", "B"))
df = tibble(x=c(NA, "x.y", "x.z", "y.z"))
df >> separate(f.x, c("A", "B"))

Out[2]:

	A	B
	<object>	<object>
0	NaN	NaN
1	x	y
2	x	z
3	y	z

In [3]:

Copied!

df >> separate(f.x, c(NA, "B"))
df >> separate(f.x, c(NA, "B"))

Out[3]:

	B
	<object>
0	NaN
1	y
2	z
3	z

In [4]:

Copied!

df = tibble(x=c("x", "x y", "x y z", NA))
df >> separate(f.x, c("a", "b"))
df = tibble(x=c("x", "x y", "x y z", NA))
df >> separate(f.x, c("a", "b"))

[2022-12-02 14:25:28][datar][WARNING] Expected 2 pieces. Additional pieces discarded in 1 rows ['x y z'].
[2022-12-02 14:25:28][datar][WARNING] Expected 2 pieces. Missing pieces filled with `NA` in 1 rows ['x'].

Out[4]:

	a	b
	<object>	<object>
0	x	NaN
1	x	y
2	x	y
3	NaN	NaN

In [5]:

Copied!

df >> separate(f.x, c("a", "b"), extra="drop", fill="right")
df >> separate(f.x, c("a", "b"), extra="drop", fill="right")

Out[5]:

	a	b
	<object>	<object>
0	x	NaN
1	x	y
2	x	y
3	NaN	NaN

In [6]:

Copied!

df >> separate(f.x, c("a", "b"), extra="merge", fill="left")
df >> separate(f.x, c("a", "b"), extra="merge", fill="left")

Out[6]:

	a	b
	<object>	<object>
0	NaN	x
1	x	y
2	x	y z
3	NaN	NaN

In [7]:

Copied!

df >> separate(f.x, c("a", "b", "c"))
df >> separate(f.x, c("a", "b", "c"))

[2022-12-02 14:25:32][datar][WARNING] Expected 3 pieces. Missing pieces filled with `NA` in 2 rows ['x', 'x y'].

Out[7]:

	a	b	c
	<object>	<object>	<object>
0	x	NaN	NaN
1	x	y	NaN
2	x	y	z
3	NaN	NaN	NaN

In [8]:

Copied!

df = tibble(x=c("x: 123", "y: error: 7"))
df >> separate(f.x, c("key", "value"), ": ", extra="merge")
df = tibble(x=c("x: 123", "y: error: 7"))
df >> separate(f.x, c("key", "value"), ": ", extra="merge")

Out[8]:

	key	value
	<object>	<object>
0	x	123
1	y	error: 7

In [9]:

Copied!

df = tibble(x=c(NA, "x?y", "x.z", "y:z"))
df >> separate(f.x, c("A","B"), sep=r"[.?:]")
df = tibble(x=c(NA, "x?y", "x.z", "y:z"))
df >> separate(f.x, c("A","B"), sep=r"[.?:]")

Out[9]:

	A	B
	<object>	<object>
0	NaN	NaN
1	x	y
2	x	z
3	y	z

In [10]:

Copied!

df = tibble(x=c("x:1", "x:2", "y:4", "z", NA))
df >> separate(f.x, c("key","value"), ":")
df = tibble(x=c("x:1", "x:2", "y:4", "z", NA))
df >> separate(f.x, c("key","value"), ":")

[2022-12-02 14:25:35][datar][WARNING] Expected 2 pieces. Missing pieces filled with `NA` in 1 rows ['z'].

Out[10]:

	key	value
	<object>	<object>
0	x	1
1	x	2
2	y	4
3	z	NaN
4	NaN	NaN

In [11]:

Copied!

out = df >> separate(f.x, c("key","value"), ":", convert={'value': float}) 
out.dtypes
out = df >> separate(f.x, c("key","value"), ":", convert={'value': float}) 
out.dtypes

[2022-12-02 14:25:36][datar][WARNING] Expected 2 pieces. Missing pieces filled with `NA` in 1 rows ['z'].

Out[11]:

key       object
value    float64
dtype: object

In [12]:

Copied!





df = tibble(
  x=[1,2,3],
  y=c("a", "d,e,f", "g,h"),
  z=c("1", "2,3,4", "5,6")
)
df = tibble(
  x=[1,2,3],
  y=c("a", "d,e,f", "g,h"),
  z=c("1", "2,3,4", "5,6")
)

In [13]:

Copied!

df >> separate_rows(f.y, f.z, convert={'z': int})
df >> separate_rows(f.y, f.z, convert={'z': int})

Out[13]:

	x	y	z
	<int64>	<object>	<int64>
0	1	a	1
1	2	d	2
2	2	e	3
3	2	f	4
4	3	g	5
5	3	h	6

In [ ]:

separate

★ separate¶

Given either a regular expression or a vector of character positions,¶

Args:¶

Returns:¶

★ separate_rows¶

Separates the values and places each one in its own row.¶

Args:¶

Returns:¶

★ separate
¶

★ separate_rows
¶