Skip to content

datar.apis.tidyr

module

datar.apis.tidyr

Functions
  • chop(data, cols) (Any) Makes data frame shorter by converting rows within each groupinto list-columns. </>
  • complete(data, *args, fill, explict) (Any) Turns implicit missing values into explicit missing values.</>
  • crossing(*args, _name_repair, **kwargs) (Any) A wrapper around expand_grid() that de-duplicates and sorts its inputs</>
  • drop_na(_data, *columns, _how) (Any) Drop rows containing missing values</>
  • expand(data, *args, _name_repair, **kwargs) (Any) Generates all combination of variables found in a dataset.</>
  • extract(data, col, into, regex, remove, convert) (Any) Given a regular expression with capturing groups, extract() turns eachgroup into a new column. If the groups don't match, or the input is NA, the output will be NA. </>
  • fill(_data, *columns, _direction) (Any) Fills missing values in selected columns using the next orprevious entry. </>
  • full_seq(x, period, tol) (Any) Create the full sequence of values in a vector</>
  • nest(_data, _names_sep, **cols) (Any) Nesting creates a list-column of data frames</>
  • nesting(*args, _name_repair, **kwargs) (Any) A helper that only finds combinations already present in the data.</>
  • pack(_data, _names_sep, **cols) (Any) Makes df narrow by collapsing a set of columns into a single df-column.</>
  • pivot_longer(_data, cols, names_to, names_prefix, names_sep, names_pattern, names_dtypes, names_transform, names_repair, values_to, values_drop_na, values_dtypes, values_transform) (Any) "lengthens" data, increasing the number of rows anddecreasing the number of columns. </>
  • pivot_wider(_data, id_cols, names_from, names_prefix, names_sep, names_glue, names_sort, values_from, values_fill, values_fn) (Any) "widens" data, increasing the number of columns and decreasingthe number of rows. </>
  • replace_na(data, data_or_replace, replace) (Any) Replace NA with a value</>
  • separate(data, col, into, sep, remove, convert, extra, fill) (Any) Given either a regular expression or a vector of character positions,turns a single character column into multiple columns. </>
  • separate_rows(data, *columns, sep, convert) (Any) Separates the values and places each one in its own row.</>
  • unchop(data, cols, keep_empty, dtypes) (Any) Makes df longer by expanding list-columns so that each elementof the list-column gets its own row in the output. </>
  • uncount(data, weights, _remove, _id) (Any) Duplicating rows according to a weighting variable</>
  • unite(data, col, *columns, sep, remove, na_rm) (Any) Unite multiple columns into one by pasting strings together</>
  • unnest(data, *cols, keep_empty, dtypes, names_sep, names_repair) (Any) Flattens list-column of data frames back out into regular columns.</>
  • unpack(data, cols, names_sep, names_repair) (Any) Makes df wider by expanding df-columns back out into individual columns.</>
function

datar.apis.tidyr.full_seq(x, period, tol=1e-06)

Create the full sequence of values in a vector

Parameters
  • x A numeric vector.
  • period Gap between each observation. The existing data will bechecked to ensure that it is actually of this periodicity.
  • tol (optional) Numerical tolerance for checking periodicity.
Returns (Any)

The full sequence

function

datar.apis.tidyr.chop(data, cols=None)

Makes data frame shorter by converting rows within each groupinto list-columns.

Parameters
  • data A data frame
  • cols (optional) Columns to chop
Returns (Any)

Data frame with selected columns chopped

function

datar.apis.tidyr.unchop(data, cols=None, keep_empty=False, dtypes=None)

Makes df longer by expanding list-columns so that each elementof the list-column gets its own row in the output.

See https://tidyr.tidyverse.org/reference/chop.html

Recycling size-1 elements might be different from tidyr

>>> df = tibble(x=[1, [2,3]], y=[[2,3], 1])
>>> df >> unchop([f.x, f.y])
>>> # tibble(x=[1,2,3], y=[2,3,1])
>>> # instead of following in tidyr
>>> # tibble(x=[1,1,2,3], y=[2,3,1,1])

Parameters
  • data A data frame.
  • cols (optional) Columns to unchop.
  • keep_empty (bool, optional) By default, you get one row of output for each elementof the list your unchopping/unnesting. This means that if there's a size-0 element (like NULL or an empty data frame), that entire row will be dropped from the output. If you want to preserve all rows, use keep_empty = True to replace size-0 elements with a single row of missing values.
  • dtypes (optional) Providing the dtypes for the output columns.Could be a single dtype, which will be applied to all columns, or a dictionary of dtypes with keys for the columns and values the dtypes. For nested data frames, we need to specify col$a as key. If col is used as key, all columns of the nested data frames will be casted into that dtype.
Returns (Any)

A data frame with selected columns unchopped.

function

datar.apis.tidyr.nest(_data, _names_sep=None, **cols)

Nesting creates a list-column of data frames

Parameters
  • _data A data frame
  • _names_sep (str, optional) If None, the default, the names will be left as is.Inner names will come from the former outer names If a string, the inner and outer names will be used together. The names of the new outer columns will be formed by pasting together the outer and the inner column names, separated by _names_sep.
  • **cols (str | int) Columns to nest
Returns (Any)

Nested data frame.

function

datar.apis.tidyr.unnest(data, *cols, keep_empty=False, dtypes=None, names_sep=None, names_repair='check_unique')

Flattens list-column of data frames back out into regular columns.

Parameters
  • data A data frame to flatten.
  • *cols (str | int) Columns to unnest.
  • keep_empty (bool, optional) By default, you get one row of output for each elementof the list your unchopping/unnesting. This means that if there's a size-0 element (like NULL or an empty data frame), that entire row will be dropped from the output. If you want to preserve all rows, use keep_empty = True to replace size-0 elements with a single row of missing values.
  • dtypes (optional) Providing the dtypes for the output columns.Could be a single dtype, which will be applied to all columns, or a dictionary of dtypes with keys for the columns and values the dtypes.
  • names_sep (str, optional) If None, the default, the names will be left as is.Inner names will come from the former outer names If a string, the inner and outer names will be used together. The names of the new outer columns will be formed by pasting together the outer and the inner column names, separated by names_sep.
  • names_repair (Union, optional) treatment of problematic column names:
    • - "minimal": No name repair or checks, beyond basic existence,
    • - "unique": Make sure names are unique and not empty,
    • - "check_unique": (default value), no name repair,
        but check they are unique,
    • - "universal": Make the names unique and syntactic
    • - a function: apply custom name repair
Returns (Any)

Data frame with selected columns unnested.

function

datar.apis.tidyr.pack(_data, _names_sep=None, **cols) → Any

Makes df narrow by collapsing a set of columns into a single df-column.

Parameters
  • _data A data frame
  • _names_sep (str, optional) If None, the default, the names will be left as is.Inner names will come from the former outer names If a string, the inner and outer names will be used together. The names of the new outer columns will be formed by pasting together the outer and the inner column names, separated by _names_sep.
  • **cols (str | int) Columns to pack
function

datar.apis.tidyr.unpack(data, cols, names_sep=None, names_repair='check_unique')

Makes df wider by expanding df-columns back out into individual columns.

For empty columns, the column is kept asis, instead of removing it.

Parameters
  • data A data frame
  • cols Columns to unpack
  • names_sep (str, optional) If None, the default, the names will be left as is.Inner names will come from the former outer names If a string, the inner and outer names will be used together. The names of the new outer columns will be formed by pasting together the outer and the inner column names, separated by _names_sep.
  • name_repair treatment of problematic column names:
    • - "minimal": No name repair or checks, beyond basic existence,
    • - "unique": Make sure names are unique and not empty,
    • - "check_unique": (default value), no name repair,
        but check they are unique,
    • - "universal": Make the names unique and syntactic
    • - a function: apply custom name repair
Returns (Any)

Data frame with given columns unpacked.

function

datar.apis.tidyr.expand(data, *args, _name_repair='check_unique', **kwargs)

Generates all combination of variables found in a dataset.

Parameters
  • data A data frame
  • *args and,
  • _name_repair (Union, optional) treatment of problematic column names:
    • - "minimal": No name repair or checks, beyond basic existence,
    • - "unique": Make sure names are unique and not empty,
    • - "check_unique": (default value), no name repair,
        but check they are unique,
    • - "universal": Make the names unique and syntactic
    • - a function: apply custom name repair
  • **kwargs columns to expand. Columns can be atomic lists.
    • - To find all unique combinations of x, y and z, including
        those not present in the data, supply each variable as a
        separate argument: expand(df, x, y, z).
    • - To find only the combinations that occur in the data, use
        nesting: expand(df, nesting(x, y, z)).
    • - You can combine the two forms. For example,
        expand(df, nesting(school_id, student_id), date) would
        produce a row for each present school-student combination
        for all possible dates.
Returns (Any)

A data frame with all combination of variables.

function

datar.apis.tidyr.nesting(*args, _name_repair='check_unique', **kwargs)

A helper that only finds combinations already present in the data.

Parameters
  • *args and,
  • _name_repair (Union, optional) treatment of problematic column names:
    • - "minimal": No name repair or checks, beyond basic existence,
    • - "unique": Make sure names are unique and not empty,
    • - "check_unique": (default value), no name repair,
        but check they are unique,
    • - "universal": Make the names unique and syntactic
    • - a function: apply custom name repair
  • **kwargs columns to expand. Columns can be atomic lists.
    • - To find all unique combinations of x, y and z, including
        those not present in the data, supply each variable as a
        separate argument: expand(df, x, y, z).
    • - To find only the combinations that occur in the data, use
        nesting: expand(df, nesting(x, y, z)).
    • - You can combine the two forms. For example,
        expand(df, nesting(school_id, student_id), date) would
        produce a row for each present school-student combination
        for all possible dates.
Returns (Any)

A data frame with all combinations in data.

function

datar.apis.tidyr.crossing(*args, _name_repair='check_unique', **kwargs)

A wrapper around expand_grid() that de-duplicates and sorts its inputs

When values are not specified by literal list, they will be sorted.

Parameters
  • *args and,
  • _name_repair (Union, optional) treatment of problematic column names:
    • - "minimal": No name repair or checks, beyond basic existence,
    • - "unique": Make sure names are unique and not empty,
    • - "check_unique": (default value), no name repair,
        but check they are unique,
    • - "universal": Make the names unique and syntactic
    • - a function: apply custom name repair
  • **kwargs columns to expand. Columns can be atomic lists.
    • - To find all unique combinations of x, y and z, including
        those not present in the data, supply each variable as a
        separate argument: expand(df, x, y, z).
    • - To find only the combinations that occur in the data, use
        nesting: expand(df, nesting(x, y, z)).
    • - You can combine the two forms. For example,
        expand(df, nesting(school_id, student_id), date) would
        produce a row for each present school-student combination
        for all possible dates.
Returns (Any)

A data frame with values deduplicated and sorted.

function

datar.apis.tidyr.complete(data, *args, fill=None, explict=True)

Turns implicit missing values into explicit missing values.

Parameters
  • data A data frame
  • *args columns to expand. Columns can be atomic lists.
    • - To find all unique combinations of x, y and z, including
        those not present in the data, supply each variable as a
        separate argument: expand(df, x, y, z).
    • - To find only the combinations that occur in the data, use
        nesting: expand(df, nesting(x, y, z)).
    • - You can combine the two forms. For example,
        expand(df, nesting(school_id, student_id), date) would
        produce a row for each present school-student combination
        for all possible dates.
  • fill (optional) A named list that for each variable supplies a single valueto use instead of NA for missing combinations.
  • explict (bool, optional) Should both implicit (newly created) and explicit(pre-existing) missing values be filled by fill? By default, this is TRUE, but if set to FALSE this will limit the fill to only implicit missing values.
Returns (Any)

Data frame with missing values completed

function

datar.apis.tidyr.drop_na(_data, *columns, _how='any')

Drop rows containing missing values

See https://tidyr.tidyverse.org/reference/drop_na.html

Parameters
  • *columns (str) Columns to inspect for missing values.
  • _how (str, optional) How to select the rows to drop
    • - all: All columns of columns to be NAs
    • - any: Any columns of columns to be NAs
    (tidyr doesn't support this argument)
  • data A data frame.
Returns (Any)

Dataframe with rows with NAs dropped and indexes dropped

function

datar.apis.tidyr.extract(data, col, into, regex='(\\w+)', remove=True, convert=False)

Given a regular expression with capturing groups, extract() turns eachgroup into a new column. If the groups don't match, or the input is NA, the output will be NA.

See https://tidyr.tidyverse.org/reference/extract.html

Parameters
  • data The dataframe
  • col (str | int) Column name or position.
  • into Names of new variables to create as character vector.Use None to omit the variable in the output.
  • regex (str, optional) a regular expression used to extract the desired values.There should be one group (defined by ()) for each element of into.
  • remove (bool, optional) If TRUE, remove input column from output data frame.
  • convert (optional) The universal type for the extracted columns or a dict forindividual ones
Returns (Any)

Dataframe with extracted columns.

function

datar.apis.tidyr.fill(_data, *columns, _direction='down')

Fills missing values in selected columns using the next orprevious entry.

See https://tidyr.tidyverse.org/reference/fill.html

Parameters
  • _data A dataframe
  • *columns (str | int) Columns to fill
  • _direction (str, optional) Direction in which to fill missing values.Currently either "down" (the default), "up", "downup" (i.e. first down and then up) or "updown" (first up and then down).
Returns (Any)

The dataframe with NAs being replaced.

function

datar.apis.tidyr.pivot_longer(_data, cols, names_to='name', names_prefix=None, names_sep=None, names_pattern=None, names_dtypes=None, names_transform=None, names_repair='check_unique', values_to='value', values_drop_na=False, values_dtypes=None, values_transform=None)

"lengthens" data, increasing the number of rows anddecreasing the number of columns.

The row order is a bit different from tidyr and pandas.DataFrame.melt.

>>> df = tibble(x=c[1:2], y=c[3:4])
>>> pivot_longer(df, f[f.x:f.y])
>>> #    name   value
>>> # 0  x      1
>>> # 1  x      2
>>> # 2  y      3
>>> # 3  y      4
with `tidyr::pivot_longer`, the output will be:
>>> # # A tibble: 4 x 2
>>> # name  value
>>> # <chr> <int>
>>> # 1 x   1
>>> # 2 y   3
>>> # 3 x   2
>>> # 4 y   4

Parameters
  • _data A data frame to pivot.
  • cols Columns to pivot into longer format.
  • names_to (optional) A string specifying the name of the column to create fromthe data stored in the column names of data. Can be a character vector, creating multiple columns, if names_sep or names_pattern is provided. In this case, there are two special values you can take advantage of:
    • - None/NA/NULL will discard that component of the name.
    • - .value/_value indicates that component of the name defines
        the name of the column containing the cell values,
        overriding values_to.
    • - Different as tidyr: With .value/_value, if there are other
        parts of the names to distinguish the groups, they must be
        captured. For example, use r'(\w)_(\d)' to match 'a_1' and
        ['.value', NA] to discard the suffix, instead of use
        r'(\w)_\d' to match.
  • names_prefix (str, optional) A regular expression used to remove matching text fromthe start of each variable name.
  • names_sep (str, optional) and
  • names_pattern (str, optional) takes the same specification as extract(),a regular expression containing matching groups (()).
  • names_dtypes (optional) and
  • names_transform (Union, optional) and
  • names_repair (optional) treatment of problematic column names:
    • - "minimal": No name repair or checks, beyond basic existence,
    • - "unique": Make sure names are unique and not empty,
    • - "check_unique": (default value), no name repair,
        but check they are unique,
    • - "universal": Make the names unique and syntactic
    • - a function: apply custom name repair
  • values_to (str, optional) A string specifying the name of the column to create fromthe data stored in cell values. If names_to is a character containing the special .value/_value sentinel, this value will be ignored, and the name of the value column will be derived from part of the existing column names.
  • values_drop_na (bool, optional) If TRUE, will drop rows that contain only NAs inthe value_to column. This effectively converts explicit missing values to implicit missing values, and should generally be used only when missing values in data were created by its structure.
  • values_dtypes (optional) A list of column name-prototype pairs.A prototype (or dtypes for short) is a zero-length vector (like integer() or numeric()) that defines the type, class, and attributes of a vector. Use these arguments if you want to confirm that the created columns are the types that you expect. Note that if you want to change (instead of confirm) the types of specific columns, you should use names_transform or values_transform instead.
  • values_transform (Union, optional) A list of column name-function pairs.Use these arguments if you need to change the types of specific columns. For example, names_transform = dict(week = as.integer) would convert a character variable called week to an integer. If not specified, the type of the columns generated from names_to will be character, and the type of the variables generated from values_to will be the common type of the input columns used to generate them.
Returns (Any)

The pivoted dataframe.

function

datar.apis.tidyr.pivot_wider(_data, id_cols=None, names_from='name', names_prefix='', names_sep='_', names_glue=None, names_sort=False, values_from='value', values_fill=None, values_fn=None)

"widens" data, increasing the number of columns and decreasingthe number of rows.

Parameters
  • _data A data frame to pivot.
  • id_cols (optional) A set of columns that uniquely identifies each observation.Defaults to all columns in data except for the columns specified in names_from and values_from.
  • names_from (optional) and
  • names_prefix (str, optional) String added to the start of every variable name.
  • names_sep (str, optional) If names_from or values_from contains multiple variables,this will be used to join their values together into a single string to use as a column name.
  • names_glue (str, optional) Instead of names_sep and names_prefix, you can supplya glue specification that uses the names_from columns (and special _value) to create custom column names.
  • names_sort (bool, optional) Should the column names be sorted? If FALSE, the default,column names are ordered by first appearance.
  • values_from (optional) A pair of arguments describing which column(or columns) to get the name of the output column (names_from), and which column (or columns) to get the cell values from (values_from).
  • values_fill (optional) Optionally, a (scalar) value that specifies whateach value should be filled in with when missing.
  • values_fn (Union, optional) Optionally, a function applied to the value in each cellin the output. You will typically use this when the combination of id_cols and value column does not uniquely identify an observation. This can be a dict you want to apply different aggregations to different value columns. If not specified, will be numpy.mean
  • names_repair todo
Returns (Any)

The pivoted dataframe.

function

datar.apis.tidyr.separate(data, col, into, sep='[^0-9A-Za-z]+', remove=True, convert=False, extra='warn', fill='warn')

Given either a regular expression or a vector of character positions,turns a single character column into multiple columns.

Parameters
  • data The dataframe
  • col (int | str) Column name or position.
  • into Names of new variables to create as character vector.Use None/NA/NULL to omit the variable in the output.
  • sep (int | str, optional) Separator between columns.If str, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values. If int, sep is interpreted as character positions to split at.
  • remove (bool, optional) If TRUE, remove input column from output data frame.
  • convert (optional) The universal type for the extracted columns or a dict forindividual ones Note that when given TRUE, DataFrame.convert_dtypes() is called, but it will not convert str to other types (For example, '1' to 1). You have to specify the dtype yourself.
  • extra (str, optional) If sep is a character vector, this controls what happens whenthere are too many pieces. There are three valid options:
    • - "warn" (the default): emit a warning and drop extra values.
    • - "drop": drop any extra values without a warning.
    • - "merge": only splits at most length(into) times
  • fill (str, optional) If sep is a character vector, this controls what happens whenthere are not enough pieces. There are three valid options:
    • - "warn" (the default): emit a warning and fill from the right
    • - "right": fill with missing values on the right
    • - "left": fill with missing values on the left
Returns (Any)

Dataframe with separated columns.

function

datar.apis.tidyr.separate_rows(data, *columns, sep='[^0-9A-Za-z]+', convert=False)

Separates the values and places each one in its own row.

Parameters
  • data The dataframe
  • *columns (str) The columns to separate on
  • sep (str, optional) Separator between columns.
  • convert (optional) The universal type for the extracted columns or a dict forindividual ones
Returns (Any)

Dataframe with rows separated and repeated.

function

datar.apis.tidyr.uncount(data, weights, _remove=True, _id=None)

Duplicating rows according to a weighting variable

Parameters
  • data A data frame
  • weights A vector of weights. Evaluated in the context of data
  • _remove (bool, optional) If TRUE, and weights is the name of a column in data,then this column is removed.
  • _id (str, optional) Supply a string to create a new variable which gives aunique identifier for each created row (0-based).
Returns (Any)

dataframe with rows repeated.

function

datar.apis.tidyr.unite(data, col, *columns, sep='_', remove=True, na_rm=True)

Unite multiple columns into one by pasting strings together

Parameters
  • data A data frame.
  • col (str) The name of the new column, as a string or symbol.
  • *columns (str | int) Columns to unite
  • sep (str, optional) Separator to use between values.
  • remove (bool, optional) If True, remove input columns from output data frame.
  • na_rm (bool, optional) If True, missing values will be remove prior to unitingeach value.
Returns (Any)

The dataframe with selected columns united

function

datar.apis.tidyr.replace_na(data, data_or_replace=None, replace=None)

Replace NA with a value

This function can be also used not as a verb. As a function called as an argument in a verb, data is passed implicitly. Then one could pass data_or_replace as the data to replace.

Parameters
  • data The data piped in
  • data_or_replace (optional) When called as argument of a verb, this is thedata to replace. Otherwise this is the replacement.
  • replace (optional) The value to replace withCan only be a scalar or dict for data frame. So replace NA with a list is not supported yet.
Returns (Any)

Corresponding data with NAs replaced