pivot_longer
In [1]:
Copied!
# https://tidyr.tidyverse.org/reference/pivot_longer.html
%run nb_helpers.py
from datar.data import relig_income, billboard, who, anscombe
from datar.all import *
nb_header(pivot_longer)
# https://tidyr.tidyverse.org/reference/pivot_longer.html
%run nb_helpers.py
from datar.data import relig_income, billboard, who, anscombe
from datar.all import *
nb_header(pivot_longer)
Try this notebook on binder.
★ pivot_longer¶
"lengthens" data, increasing the number of rows and decreasing the number of columns.
The row order is a bit different from `tidyr` and `pandas.DataFrame.melt`.
>>> df = tibble(x=c[1:2], y=c[3:4])
>>> pivot_longer(df, f[f.x:f.y])
>>> # name value
>>> # 0 x 1
>>> # 1 x 2
>>> # 2 y 3
>>> # 3 y 4
But with `tidyr::pivot_longer`, the output will be:
>>> # # A tibble: 4 x 2
>>> # name value
>>> # <chr> <int>
>>> # 1 x 1
>>> # 2 y 3
>>> # 3 x 2
>>> # 4 y 4
Args:
_data: A data frame to pivot.
cols: Columns to pivot into longer format.
names_to: A string specifying the name of the column to create from
the data stored in the column names of data.
Can be a character vector, creating multiple columns, if names_sep
or names_pattern is provided. In this case, there are two special
values you can take advantage of:
- `None`/`NA`/`NULL` will discard that component of the name.
- `.value`/`_value` indicates that component of the name defines
the name of the column containing the cell values,
overriding values_to.
- Different as `tidyr`: With `.value`/`_value`, if there are other
parts of the names to distinguish the groups, they must be
captured. For example, use `r'(\w)_(\d)'` to match `'a_1'` and
`['.value', NA]` to discard the suffix, instead of use
`r'(\w)_\d'` to match.
names_prefix: A regular expression used to remove matching text from
the start of each variable name.
names_sep: and
names_pattern: If names_to contains multiple values,
these arguments control how the column name is broken up.
names_sep takes the same specification as separate(), and
can either be a numeric vector (specifying positions to break on),
or a single string (specifying a regular expression to split on).
names_pattern: takes the same specification as extract(),
a regular expression containing matching groups (()).
names_dtypes: and
values_dtypes: A list of column name-prototype pairs.
A prototype (or dtypes for short) is a zero-length vector
(like integer() or numeric()) that defines the type, class, and
attributes of a vector. Use these arguments if you want to confirm
that the created columns are the types that you expect.
Note that if you want to change (instead of confirm) the types
of specific columns, you should use names_transform or
values_transform instead.
names_transform: and
values_transform: A list of column name-function pairs.
Use these arguments if you need to change the types of
specific columns. For example,
names_transform = dict(week = as.integer) would convert a
character variable called week to an integer.
If not specified, the type of the columns generated from names_to
will be character, and the type of the variables generated from
values_to will be the common type of the input columns used to
generate them.
names_repair: Not supported yet.
values_to: A string specifying the name of the column to create from
the data stored in cell values. If names_to is a character
containing the special `.value`/`_value` sentinel, this value
will be ignored, and the name of the value column will be derived
from part of the existing column names.
values_drop_na: If TRUE, will drop rows that contain only NAs in
the value_to column. This effectively converts explicit missing
values to implicit missing values, and should generally be used
only when missing values in data were created by its structure.
names_repair: treatment of problematic column names:
- "minimal": No name repair or checks, beyond basic existence,
- "unique": Make sure names are unique and not empty,
- "check_unique": (default value), no name repair,
but check they are unique,
- "universal": Make the names unique and syntactic
- a function: apply custom name repair
Returns:
The pivoted dataframe.
In [2]:
Copied!
relig_income
relig_income
Out[2]:
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k | Don't know/refused | |
---|---|---|---|---|---|---|---|---|---|---|---|
<object> | <int64> | <int64> | <int64> | <int64> | <int64> | <int64> | <int64> | <int64> | <int64> | <int64> | |
0 | Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 | 96 |
1 | Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 | 76 |
2 | Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 | 54 |
3 | Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 | 1489 |
4 | Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 | 116 |
5 | Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 | 949 | 723 | 414 | 1529 |
6 | Hindu | 1 | 9 | 7 | 9 | 11 | 34 | 47 | 48 | 54 | 37 |
7 | Historically Black Prot | 228 | 244 | 236 | 238 | 197 | 223 | 131 | 81 | 78 | 339 |
8 | Jehovah's Witness | 20 | 27 | 24 | 24 | 21 | 30 | 15 | 11 | 6 | 37 |
9 | Jewish | 19 | 19 | 25 | 25 | 30 | 95 | 69 | 87 | 151 | 162 |
10 | Mainline Prot | 289 | 495 | 619 | 655 | 651 | 1107 | 939 | 753 | 634 | 1328 |
11 | Mormon | 29 | 40 | 48 | 51 | 56 | 112 | 85 | 49 | 42 | 69 |
12 | Muslim | 6 | 7 | 9 | 10 | 9 | 23 | 16 | 8 | 6 | 22 |
13 | Orthodox | 13 | 17 | 23 | 32 | 32 | 47 | 38 | 42 | 46 | 73 |
14 | Other Christian | 9 | 7 | 11 | 13 | 13 | 14 | 18 | 14 | 12 | 18 |
15 | Other Faiths | 20 | 33 | 40 | 46 | 49 | 63 | 46 | 40 | 41 | 71 |
16 | Other World Religions | 5 | 2 | 3 | 4 | 2 | 7 | 3 | 4 | 4 | 8 |
17 | Unaffiliated | 217 | 299 | 374 | 365 | 341 | 528 | 407 | 321 | 258 | 597 |
In [3]:
Copied!
relig_income >> \
pivot_longer(~f.religion, names_to="income", values_to="count")
relig_income >> \
pivot_longer(~f.religion, names_to="income", values_to="count")
Out[3]:
religion | income | count | |
---|---|---|---|
<object> | <object> | <int64> | |
0 | Agnostic | <$10k | 27 |
1 | Atheist | <$10k | 12 |
2 | Buddhist | <$10k | 27 |
3 | Catholic | <$10k | 418 |
... | ... | ... | ... |
4 | Don’t know/refused | <$10k | 15 |
175 | Orthodox | Don't know/refused | 73 |
176 | Other Christian | Don't know/refused | 18 |
177 | Other Faiths | Don't know/refused | 71 |
178 | Other World Religions | Don't know/refused | 8 |
179 | Unaffiliated | Don't know/refused | 597 |
180 rows × 3 columns
In [4]:
Copied!
billboard
billboard
Out[4]:
artist | track | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 | wk6 | wk7 | ... | wk67 | wk68 | wk69 | wk70 | wk71 | wk72 | wk73 | wk74 | wk75 | wk76 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<object> | <object> | <object> | <int64> | <float64> | <float64> | <float64> | <float64> | <float64> | <float64> | ... | <float64> | <float64> | <float64> | <float64> | <float64> | <float64> | <float64> | <float64> | <float64> | <float64> | |
0 | 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | 87 | 82.0 | 72.0 | 77.0 | 87.0 | 94.0 | 99.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2Ge+her | The Hardest Part Of ... | 2000-09-02 | 91 | 87.0 | 92.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 Doors Down | Kryptonite | 2000-04-08 | 81 | 70.0 | 68.0 | 67.0 | 66.0 | 57.0 | 54.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3 Doors Down | Loser | 2000-10-21 | 76 | 76.0 | 72.0 | 69.0 | 67.0 | 65.0 | 55.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4 | 504 Boyz | Wobble Wobble | 2000-04-15 | 57 | 34.0 | 25.0 | 17.0 | 17.0 | 31.0 | 36.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
312 | Yankee Grey | Another Nine Minutes | 2000-04-29 | 86 | 83.0 | 77.0 | 74.0 | 83.0 | 79.0 | 88.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
313 | Yearwood, Trisha | Real Live Woman | 2000-04-01 | 85 | 83.0 | 83.0 | 82.0 | 81.0 | 91.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
314 | Ying Yang Twins | Whistle While You Tw... | 2000-03-18 | 95 | 94.0 | 91.0 | 85.0 | 84.0 | 78.0 | 74.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
315 | Zombie Nation | Kernkraft 400 | 2000-09-02 | 99 | 99.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
316 | matchbox twenty | Bent | 2000-04-29 | 60 | 37.0 | 29.0 | 24.0 | 22.0 | 21.0 | 18.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
317 rows × 79 columns
In [5]:
Copied!
billboard >> \
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)
billboard >> \
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)
Out[5]:
artist | date.entered | track | week | rank | |
---|---|---|---|---|---|
<object> | <object> | <object> | <object> | <float64> | |
0 | 2 Pac | 2000-02-26 | Baby Don't Cry (Keep... | 1 | 87.0 |
1 | 2Ge+her | 2000-09-02 | The Hardest Part Of ... | 1 | 91.0 |
2 | 3 Doors Down | 2000-04-08 | Kryptonite | 1 | 81.0 |
3 | 3 Doors Down | 2000-10-21 | Loser | 1 | 76.0 |
... | ... | ... | ... | ... | ... |
4 | 504 Boyz | 2000-04-15 | Wobble Wobble | 1 | 57.0 |
19716 | Creed | 1999-09-11 | Higher | 63 | 50.0 |
19833 | Lonestar | 1999-06-05 | Amazed | 63 | 45.0 |
20033 | Creed | 1999-09-11 | Higher | 64 | 50.0 |
20150 | Lonestar | 1999-06-05 | Amazed | 64 | 50.0 |
20350 | Creed | 1999-09-11 | Higher | 65 | 49.0 |
5307 rows × 5 columns
In [6]:
Copied!
who >> pivot_longer(
cols = f[f.new_sp_m014:f.newrel_f65],
names_to = c("diagnosis", "gender", "age"),
names_pattern = r"new_?(.*)_(.)(.*)",
values_to = "count"
)
who >> pivot_longer(
cols = f[f.new_sp_m014:f.newrel_f65],
names_to = c("diagnosis", "gender", "age"),
names_pattern = r"new_?(.*)_(.)(.*)",
values_to = "count"
)
Out[6]:
country | iso2 | iso3 | newrel_f65 | year | diagnosis | gender | age | count | |
---|---|---|---|---|---|---|---|---|---|
<object> | <object> | <object> | <float64> | <int64> | <object> | <object> | <object> | <float64> | |
0 | Afghanistan | AF | AFG | NaN | 1980 | sp | m | 014 | NaN |
1 | Afghanistan | AF | AFG | NaN | 1981 | sp | m | 014 | NaN |
2 | Afghanistan | AF | AFG | NaN | 1982 | sp | m | 014 | NaN |
3 | Afghanistan | AF | AFG | NaN | 1983 | sp | m | 014 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4 | Afghanistan | AF | AFG | NaN | 1984 | sp | m | 014 | NaN |
398195 | Zimbabwe | ZW | ZWE | NaN | 2009 | rel | f | 5564 | NaN |
398196 | Zimbabwe | ZW | ZWE | NaN | 2010 | rel | f | 5564 | NaN |
398197 | Zimbabwe | ZW | ZWE | NaN | 2011 | rel | f | 5564 | NaN |
398198 | Zimbabwe | ZW | ZWE | NaN | 2012 | rel | f | 5564 | NaN |
398199 | Zimbabwe | ZW | ZWE | 725.0 | 2013 | rel | f | 5564 | 811.0 |
398200 rows × 9 columns
In [7]:
Copied!
anscombe
anscombe
Out[7]:
x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 | |
---|---|---|---|---|---|---|---|---|
<int64> | <int64> | <int64> | <int64> | <float64> | <float64> | <float64> | <float64> | |
0 | 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
1 | 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
2 | 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
3 | 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
4 | 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
5 | 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
6 | 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
7 | 4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
8 | 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
9 | 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
10 | 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
In [8]:
Copied!
anscombe >> \
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = r"(.)(.)"
)
anscombe >> \
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = r"(.)(.)"
)
Out[8]:
set | x | y | |
---|---|---|---|
<object> | <float64> | <float64> | |
0 | 1 | 10.0 | 8.04 |
1 | 2 | 10.0 | 9.14 |
2 | 3 | 10.0 | 7.46 |
3 | 4 | 8.0 | 6.58 |
4 | 1 | 8.0 | 6.95 |
5 | 2 | 8.0 | 8.14 |
6 | 3 | 8.0 | 6.77 |
7 | 4 | 8.0 | 5.76 |
8 | 1 | 13.0 | 7.58 |
9 | 2 | 13.0 | 8.74 |
10 | 3 | 13.0 | 12.74 |
11 | 4 | 8.0 | 7.71 |
12 | 1 | 9.0 | 8.81 |
13 | 2 | 9.0 | 8.77 |
14 | 3 | 9.0 | 7.11 |
15 | 4 | 8.0 | 8.84 |
16 | 1 | 11.0 | 8.33 |
17 | 2 | 11.0 | 9.26 |
18 | 3 | 11.0 | 7.81 |
19 | 4 | 8.0 | 8.47 |
20 | 1 | 14.0 | 9.96 |
21 | 2 | 14.0 | 8.10 |
22 | 3 | 14.0 | 8.84 |
23 | 4 | 8.0 | 7.04 |
24 | 1 | 6.0 | 7.24 |
25 | 2 | 6.0 | 6.13 |
26 | 3 | 6.0 | 6.08 |
27 | 4 | 8.0 | 5.25 |
28 | 1 | 4.0 | 4.26 |
29 | 2 | 4.0 | 3.10 |
30 | 3 | 4.0 | 5.39 |
31 | 4 | 19.0 | 12.50 |
32 | 1 | 12.0 | 10.84 |
33 | 2 | 12.0 | 9.13 |
34 | 3 | 12.0 | 8.15 |
35 | 4 | 8.0 | 5.56 |
36 | 1 | 7.0 | 4.82 |
37 | 2 | 7.0 | 7.26 |
38 | 3 | 7.0 | 6.42 |
39 | 4 | 8.0 | 7.91 |
40 | 1 | 5.0 | 5.68 |
41 | 2 | 5.0 | 4.74 |
42 | 3 | 5.0 | 5.73 |
43 | 4 | 8.0 | 6.89 |
In [ ]:
Copied!