Home
pipda
A framework for data piping in Python.
Inspired by siuba, dfply, plydata, and dplython. Provides simple yet powerful APIs to mimic dplyr and tidyr in Python.
API | Changelog | Documentation
Installation
pip install -U pipda
Usage
Verbs
- A verb is pipeable (able to be called like
data >> verb(...)) - A verb is dispatchable by the type of its first argument
- A verb evaluates other arguments using the first one
- A verb is passing down the context if not specified in the arguments
import pandas as pd
from pipda import (
register_verb,
register_func,
register_operator,
evaluate_expr,
Operator,
Symbolic,
Context
)
f = Symbolic()
df = pd.DataFrame({
'x': [0, 1, 2, 3],
'y': ['zero', 'one', 'two', 'three']
})
df
# x y
# 0 0 zero
# 1 1 one
# 2 2 two
# 3 3 three
@register_verb(pd.DataFrame)
def head(data, n=5):
return data.head(n)
df >> head(2)
# x y
# 0 0 zero
# 1 1 one
@register_verb(pd.DataFrame, context=Context.EVAL)
def mutate(data, **kwargs):
data = data.copy()
for key, val in kwargs.items():
data[key] = val
return data
df >> mutate(z=1)
# x y z
# 0 0 zero 1
# 1 1 one 1
# 2 2 two 1
# 3 3 three 1
df >> mutate(z=f.x)
# x y z
# 0 0 zero 0
# 1 1 one 1
# 2 2 two 2
# 3 3 three 3
Functions used as verb arguments
# verb can be used as an argument passed to another verb
# dependent=True makes the `data` argument invisible while calling
@register_verb(pd.DataFrame, context=Context.EVAL, dependent=True)
def if_else(data, cond, true, false):
cond.loc[cond.isin([True]), ] = true
cond.loc[cond.isin([False]), ] = false
return cond
# The function is then also a singledispatch generic function
df >> mutate(z=if_else(f.x>1, 20, 10))
# x y z
# 0 0 zero 10
# 1 1 one 10
# 2 2 two 20
# 3 3 three 20
# function without data argument
@register_func
def length(strings):
return [len(s) for s in strings]
df >> mutate(z=length(f.y))
# x y z
# 0 0 zero 4
# 1 1 one 3
# 2 2 two 3
# 3 3 three 5
Context
The context defines how a reference (f.A, f['A'], f.A.B) is evaluated
@register_verb(pd.DataFrame, context=Context.SELECT)
def select(df, *columns):
return df[list(columns)]
df >> select(f.x, f.y)
# x y
# 0 0 zero
# 1 1 one
# 2 2 two
# 3 3 three
How it works
data %>% verb(arg1, ..., key1=kwarg1, ...)
The above is a typical dplyr/tidyr data piping syntax.
The Python counterpart is:
data >> verb(arg1, ..., key1=kwarg1, ...)
To implement this, execution of the verb must be deferred by turning it into a VerbCall object that holds the function and its arguments. The VerbCall is not evaluated until data is piped in via >>. This detection is made possible by the executing package, which inspects the AST to determine whether a function call appears on the right-hand side of a pipe operator.
Arguments that reference columns of the data must also be deferred. For example, in dplyr (R):
data %>% mutate(z = a)
This adds a column z with values from column a. In Python, the equivalent is:
data >> mutate(z=f.a)
Here f.a is a Reference object that captures the column name without immediately fetching the data.
The Symbolic object f acts as a proxy, chaining attribute/item accesses and operator expressions into a single Expression tree. That tree is later evaluated when data and context become available.
Documentation
https://pwwang.github.io/pipda/
See datar for real-world usage.