Skip to content

Home

pipda

PyPI GitHub Codacy grade Codacy coverage Docs CI

A framework for data piping in Python.

Inspired by siuba, dfply, plydata, and dplython. Provides simple yet powerful APIs to mimic dplyr and tidyr in Python.

API | Changelog | Documentation

Installation

pip install -U pipda

Usage

Verbs

  • A verb is pipeable (able to be called like data >> verb(...))
  • A verb is dispatchable by the type of its first argument
  • A verb evaluates other arguments using the first one
  • A verb is passing down the context if not specified in the arguments
import pandas as pd
from pipda import (
    register_verb,
    register_func,
    register_operator,
    evaluate_expr,
    Operator,
    Symbolic,
    Context
)

f = Symbolic()

df = pd.DataFrame({
    'x': [0, 1, 2, 3],
    'y': ['zero', 'one', 'two', 'three']
})

df

#      x    y
# 0    0    zero
# 1    1    one
# 2    2    two
# 3    3    three

@register_verb(pd.DataFrame)
def head(data, n=5):
    return data.head(n)

df >> head(2)
#      x    y
# 0    0    zero
# 1    1    one

@register_verb(pd.DataFrame, context=Context.EVAL)
def mutate(data, **kwargs):
    data = data.copy()
    for key, val in kwargs.items():
        data[key] = val
    return data

df >> mutate(z=1)
#    x      y  z
# 0  0   zero  1
# 1  1    one  1
# 2  2    two  1
# 3  3  three  1

df >> mutate(z=f.x)
#    x      y  z
# 0  0   zero  0
# 1  1    one  1
# 2  2    two  2
# 3  3  three  3

Functions used as verb arguments

# verb can be used as an argument passed to another verb
# dependent=True makes the `data` argument invisible while calling
@register_verb(pd.DataFrame, context=Context.EVAL, dependent=True)
def if_else(data, cond, true, false):
    cond.loc[cond.isin([True]), ] = true
    cond.loc[cond.isin([False]), ] = false
    return cond

# The function is then also a singledispatch generic function

df >> mutate(z=if_else(f.x>1, 20, 10))
#    x      y   z
# 0  0   zero  10
# 1  1    one  10
# 2  2    two  20
# 3  3  three  20
# function without data argument
@register_func
def length(strings):
    return [len(s) for s in strings]

df >> mutate(z=length(f.y))

#    x     y    z
# 0  0  zero    4
# 1  1   one    3
# 2  2   two    3
# 3  3 three    5

Context

The context defines how a reference (f.A, f['A'], f.A.B) is evaluated

@register_verb(pd.DataFrame, context=Context.SELECT)
def select(df, *columns):
    return df[list(columns)]

df >> select(f.x, f.y)
#    x     y
# 0  0  zero
# 1  1   one
# 2  2   two
# 3  3 three

How it works

data %>% verb(arg1, ..., key1=kwarg1, ...)

The above is a typical dplyr/tidyr data piping syntax.

The Python counterpart is:

data >> verb(arg1, ..., key1=kwarg1, ...)

To implement this, execution of the verb must be deferred by turning it into a VerbCall object that holds the function and its arguments. The VerbCall is not evaluated until data is piped in via >>. This detection is made possible by the executing package, which inspects the AST to determine whether a function call appears on the right-hand side of a pipe operator.

Arguments that reference columns of the data must also be deferred. For example, in dplyr (R):

data %>% mutate(z = a)

This adds a column z with values from column a. In Python, the equivalent is:

data >> mutate(z=f.a)

Here f.a is a Reference object that captures the column name without immediately fetching the data.

The Symbolic object f acts as a proxy, chaining attribute/item accesses and operator expressions into a single Expression tree. That tree is later evaluated when data and context become available.

Documentation

https://pwwang.github.io/pipda/

See datar for real-world usage.