module

pipen.channel

Provide some function for creating and modifying channels (dataframes)

Classes
  • Channel A DataFrame wrapper with creators</>
Functions
  • collapse_files(data, col) (DataFrame) Collapse a Channel according to the files in ,other cols will use the values in row 0. </>
  • expand_dir(data, col, pattern, ftype, sortby, reverse) (DataFrame) Expand a Channel according to the files in ,other cols will keep the same. </>
class

pipen.channel.Channel(data=None, index=None, columns=None, dtype=None, copy=None)

Bases
pandas.DataFrame pandas.core.generic.NDFrame pandas.core.base.PandasObject pandas.core.accessor.DirNamesMixin pandas.core.indexing.IndexingMixin pandas.core.arraylike.OpsMixin

A DataFrame wrapper with creators

Parameters
  • data (optional) Dict can contain Series, arrays, constants, dataclass or list-like objects. Ifdata is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs.
    If data is a list of dicts, column order follows insertion-order.
  • index (Axes | None, optional) Index to use for resulting frame. Will default to RangeIndex ifno indexing information part of input data and no index provided.
  • columns (Axes | None, optional) Column labels to use for resulting frame when data does not have them,defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead.
  • dtype (Dtype | None, optional) Data type to force. Only a single dtype is allowed. If None, infer.If data is DataFrame then is ignored.
  • copy (bool | none, optional) Copy data from inputs.For dict data, the default of None behaves like copy=True. For DataFrame or 2d ndarray input, the default of None behaves like copy=False. If data is a dict containing one or more Series (possibly of different dtypes), copy=False will ensure that these inputs are not copied.
Attributes
  • at (_AtIndexer) Access a single value for a row/column label pair.
    Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a single value in a DataFrame or Series. </>
  • attrs (dict[Hashable, Any]) Dictionary of global attributes of this dataset.
    .. warning::
    attrs is experimental and may change without warning. </>
  • dtypes Return the dtypes in the DataFrame.
    This returns a Series with the data type of each column. The result's index is the original DataFrame's columns. Columns with mixed types are stored with the object dtype. See :ref:the User Guide <basics.dtypes> for more. </>
  • empty Indicator whether Series/DataFrame is empty.
    True if Series/DataFrame is entirely empty (no items), meaning any of the axes are of length 0. </>
  • flags (Flags) Get the properties associated with this pandas object.
    The available flags are
    • :attr:Flags.allows_duplicate_labels
    </>
  • iat (_iAtIndexer) Access a single value for a row/column pair by integer position.
    Similar to iloc, in that both provide integer-based lookups. Use iat if you only need to get or set a single value in a DataFrame or Series. </>
  • iloc (_iLocIndexer) Purely integer-location based indexing for selection by position.
    .. versionchanged:: 3.0
    Callables which return a tuple are deprecated as input.
    .iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.
    Allowed inputs are:
    • An integer, e.g. 5.
    • A list or array of integers, e.g. [4, 3, 0].
    • A slice object with ints, e.g. 1:7.
    • A boolean array.
    • A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don't have a reference to the calling object, but would like to base your selection on some value.
    • A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g. (0, 1).
    .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).
    See more at :ref:Selection by Position <indexing.integer>. </>
  • loc (_LocIndexer) Access a group of rows and columns by label(s) or a boolean array.
    .loc[] is primarily label based, but may also be used with a boolean array.
    Allowed inputs are:
    • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
    • A list or array of labels, e.g. ['a', 'b', 'c'].
    • A slice object with labels, e.g. 'a':'f'.
    .. warning:: Note that contrary to usual python slices, both the start and the stop are included
    • A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
    • An alignable boolean Series. The index of the key will be aligned before masking.
    • An alignable Index. The Index of the returned selection will be the input.
    • A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
    See more at :ref:Selection by Label <indexing.label>. </>
  • ndim (int) Return an int representing the number of axes / array dimensions.
    Return 1 if Series. Otherwise return 2 if DataFrame. </>
  • size (int) Return an int representing the number of elements in this object.
    Return the number of rows if Series. Otherwise return the number of rows times number of columns if DataFrame. </>
Methods
  • __add__(other) (DataFrame) Get Addition of DataFrame and other, column-wise.</>
  • __contains__(key) (bool) True if the key is in the info axis</>
  • __delitem__(key) Delete item</>
  • __dir__() (list) Provide method name lookup and completion.</>
  • __finalize__(other, method, **kwargs) (Self) Propagate metadata from other to self.</>
  • __getattr__(name) After regular attribute access, try looking up the nameThis allows simpler access to columns for interactive use. </>
  • __iter__() (iterator) Iterate over info axis.</>
  • __setattr__(name, value) After regular attribute access, try setting the nameThis allows simpler access to columns for interactive use. </>
  • __sizeof__() (int) Generates the total memory usage for an object that returnseither a value or Series of values </>
  • a_from_glob(pattern, ftype, sortby, reverse) (DataFrame) Create a channel with a glob pattern asynchronously</>
  • a_from_pairs(pattern, ftype, sortby, reverse) (DataFrame) Create a width=2 channel with a glob pattern</>
  • abs() (abs) Return a Series/DataFrame with absolute numeric value of each element.</>
  • add_prefix(prefix, axis) (Series or DataFrame) Prefix labels with string prefix.</>
  • add_suffix(suffix, axis) (Series or DataFrame) Suffix labels with string suffix.</>
  • align(other, join, axis, level, copy, fill_value) (tuple of (Series/DataFrame, type of other)) Align two objects on their axes with the specified join method.</>
  • asfreq(freq, method, how, normalize, fill_value) (Series/DataFrame) Convert time series to specified frequency.</>
  • asof(where, subset) (scalar, Series, or DataFrame) Return the last row(s) without any NaNs before where.</>
  • astype(dtype, copy, errors) (same type as caller) Cast a pandas object to a specified dtype dtype.</>
  • at_time(time, asof, axis) (Series or DataFrame) Select values at particular time of day (e.g., 9:30AM).</>
  • between_time(start_time, end_time, inclusive, axis) (Series or DataFrame) Select values between particular times of the day (e.g., 9:00-9:30 AM).</>
  • bfill(axis, inplace, limit, limit_area) (Series/DataFrame) Fill NA/NaN values by using the next valid observation to fill the gap.</>
  • clip(lower, upper, axis, inplace, **kwargs) (Series or DataFrame) Trim values at input threshold(s).</>
  • convert_dtypes(infer_objects, convert_string, convert_integer, convert_boolean, convert_floating, dtype_backend) (Series or DataFrame) Convert columns from numpy dtypes to the best dtypes that support pd.NA.</>
  • copy(deep) (Series or DataFrame) Make a copy of this object's indices and data.</>
  • create(value) (DataFrame) Create a channel from a list.</>
  • describe(percentiles, include, exclude) (Series or DataFrame) Generate descriptive statistics.</>
  • droplevel(level, axis) (Series/DataFrame) Return Series/DataFrame with requested index / column level(s) removed.</>
  • equals(other) (bool) Test whether two objects contain the same elements.</>
  • ewm(com, span, halflife, alpha, min_periods, adjust, ignore_na, times, method) (pandas.api.typing.ExponentialMovingWindow) Provide exponentially weighted (EW) calculations.</>
  • expanding(min_periods, method) (pandas.api.typing.Expanding) Provide expanding window calculations.</>
  • ffill(axis, inplace, limit, limit_area) (Series/DataFrame) Fill NA/NaN values by propagating the last valid observation to next valid.</>
  • fillna(value, axis, inplace, limit) (Series/DataFrame) Fill NA/NaN values with value.</>
  • filter(items, like, regex, axis) (Same type as caller) Subset the DataFrame or Series according to the specified index labels.</>
  • first_valid_index() (type of index) Return index for first non-missing value or None, if no value is found.</>
  • from_csv(*args, **kwargs) Create a channel from a csv file</>
  • from_excel(*args, **kwargs) Create a channel from an excel file.</>
  • from_glob(pattern, ftype, sortby, reverse) (DataFrame) Create a channel with a glob pattern</>
  • from_pairs(pattern, ftype, sortby, reverse) (DataFrame) Create a width=2 channel with a glob pattern</>
  • from_table(*args, **kwargs) Create a channel from a table file.</>
  • get(key, default) (same type as items contained in object) Get item from object for given key (ex: DataFrame column).</>
  • head(n) (same type as caller) Return the first n rows.</>
  • infer_objects(copy) (same type as input object) Attempt to infer better dtypes for object columns.</>
  • interpolate(method, axis, limit, inplace, limit_direction, limit_area, **kwargs) (Series or DataFrame) Fill NaN values using an interpolation method.</>
  • keys() (Index) Get the 'info axis' (see Indexing for more).</>
  • last_valid_index() (type of index) Return index for last non-missing value or None, if no value is found.</>
  • mask(cond, other, inplace, axis, level) (Series or DataFrame) Replace values where the condition is True.</>
  • pct_change(periods, fill_method, freq, **kwargs) (Series or DataFrame) Fractional change between the current and a prior element.</>
  • pipe(func, *args, **kwargs) (The return type of ``func``.) Apply chainable functions that expect Series or DataFrames.</>
  • rank(axis, method, numeric_only, na_option, ascending, pct) (same type as caller) Compute numerical data ranks (1 through n) along axis.</>
  • reindex_like(other, method, copy, limit, tolerance) (Series or DataFrame) Return an object with matching indices as other object.</>
  • rename_axis(mapper, index, columns, axis, copy, inplace) (DataFrame, or None) Set the name of the axis for the index or columns.</>
  • replace(to_replace, value, inplace, regex) (Series/DataFrame) Replace values given in to_replace with value.</>
  • resample(rule, closed, label, convention, on, level, origin, offset, group_keys) (pandas.api.typing.Resampler) Resample time-series data.</>
  • rolling(window, min_periods, center, win_type, on, closed, step, method) (pandas.api.typing.Window or pandas.api.typing.Rolling) Provide rolling window calculations.</>
  • sample(n, frac, replace, weights, random_state, axis, ignore_index) (Series or DataFrame) Return a random sample of items from an axis of object.</>
  • set_flags(copy, allows_duplicate_labels) (Series or DataFrame) Return a new object with updated flags.</>
  • squeeze(axis) (DataFrame, Series, or scalar) Squeeze 1 dimensional axis objects into scalars.</>
  • tail(n) (type of caller) Return the last n rows.</>
  • take(indices, axis, **kwargs) (same type as caller) Return the elements in the given positional indices along an axis.</>
  • to_clipboard(excel, sep, **kwargs) Copy object to the system clipboard.</>
  • to_csv(path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options) (None or str) Write object to a comma-separated values (csv) file.</>
  • to_excel(excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, inf_rep, freeze_panes, storage_options, engine_kwargs, autofilter) Write object to an Excel sheet.</>
  • to_hdf(path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding) Write the contained data to an HDF5 file using HDFStore.</>
  • to_json(path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options, mode) (None or str) Convert the object to a JSON string.</>
  • to_latex(buf, columns, header, index, na_rep, formatters, float_format, sparsify, index_names, bold_rows, column_format, longtable, escape, encoding, decimal, multicolumn, multicolumn_format, multirow, caption, label, position) (str or None) Render object to a LaTeX tabular, longtable, or nested table.</>
  • to_pickle(path, compression, protocol, storage_options) Pickle (serialize) object to file.</>
  • to_sql(name, con, schema, if_exists, index, index_label, chunksize, dtype, method) (None or int) Write records stored in a DataFrame to a SQL database.</>
  • to_xarray() (xarray.DataArray or xarray.Dataset) Return an xarray object from the pandas object.</>
  • truncate(before, after, axis, copy) (type of caller) Truncate a Series or DataFrame before and after some index value.</>
  • tz_convert(tz, axis, level, copy) (Series/DataFrame) Convert tz-aware axis to target time zone.</>
  • tz_localize(tz, axis, level, copy, ambiguous, nonexistent) (Series/DataFrame) Localize time zone naive index of a Series or DataFrame to target time zone.</>
  • where(cond, other, inplace, axis, level) (Series or DataFrame) Replace values where the condition is False.</>
  • xs(key, axis, level, drop_level) (Series or DataFrame) Return cross-section from the Series/DataFrame.</>
method

__add__(other)

Get Addition of DataFrame and other, column-wise.

Equivalent to DataFrame.add(other).

Parameters
  • other (scalar, sequence, Series, dict or DataFrame) Object to be added to the DataFrame.
Returns (DataFrame)

The result of adding other to DataFrame.

See Also

DataFrame.add : Add a DataFrame and another object, with option for index- or column-oriented addition.

Examples
>>> df = pd.DataFrame(...     {"height": [1.5, 2.6], "weight": [500, 800]}, index=["elk", "moose"]
... )
>>> df
       height  weight
elk       1.5     500
moose     2.6     800

Adding a scalar affects all rows and columns.

>>> df[["height", "weight"]] + 1.5
       height  weight
elk       3.0   501.5
moose     4.1   801.5

Each element of a list is added to a column of the DataFrame, in order.

>>> df[["height", "weight"]] + [0.5, 1.5]
       height  weight
elk       2.0   501.5
moose     3.1   801.5

Keys of a dictionary are aligned to the DataFrame, based on column names; each value in the dictionary is added to the corresponding column.

>>> df[["height", "weight"]] + {"height": 0.5, "weight": 1.5}
       height  weight
elk       2.0   501.5
moose     3.1   801.5

When other is a :class:Series, the index of other is aligned with the columns of the DataFrame.

>>> s1 = pd.Series([0.5, 1.5], index=["weight", "height"])
>>> df[["height", "weight"]] + s1
       height  weight
elk       3.0   500.5
moose     4.1   800.5

Even when the index of other is the same as the index of the DataFrame, the :class:Series will not be reoriented. If index-wise alignment is desired, :meth:DataFrame.add should be used with axis='index'.

>>> s2 = pd.Series([0.5, 1.5], index=["elk", "moose"])
>>> df[["height", "weight"]] + s2
       elk  height  moose  weight
elk    NaN     NaN    NaN     NaN
moose  NaN     NaN    NaN     NaN
>>> df[["height", "weight"]].add(s2, axis="index")
       height  weight
elk       2.0   500.5
moose     4.1   801.5

When other is a :class:DataFrame, both columns names and the index are aligned.

>>> other = pd.DataFrame(
...     {"height": [0.2, 0.4, 0.6]}, index=["elk", "moose", "deer"]
... )
>>> df[["height", "weight"]] + other
       height  weight
deer      NaN     NaN
elk       1.7     NaN
moose     3.0     NaN
method

__dir__() → list

Provide method name lookup and completion.

Notes

Only provide 'public' methods.

method

__sizeof__() → int

Generates the total memory usage for an object that returnseither a value or Series of values

method

set_flags(copy=<no_default>, allows_duplicate_labels=None)

Return a new object with updated flags.

This method creates a shallow copy of the original object, preserving its underlying data while modifying its global flags. In particular, it allows you to update properties such as whether duplicate labels are permitted. This behavior is especially useful in method chains, where one wishes to adjust DataFrame or Series characteristics without altering the original object.

Parameters
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
  • allows_duplicate_labels (bool, optional) Whether the returned object allows duplicate labels.
Returns (Series or DataFrame)

The same type as the caller.

See Also

DataFrame.attrs : Global metadata applying to this dataset.DataFrame.flags : Global flags applying to this object.

Notes

This method returns a new object that's a view on the same data as the input. Mutating the input or the output values will be reflected in the other.

This method is intended to be used in method chains.

"Flags" differ from "metadata". Flags reflect properties of the pandas object (the Series or DataFrame). Metadata refer to properties of the dataset, and should be stored in :attr:DataFrame.attrs.

Examples
>>> df = pd.DataFrame({"A": [1, 2]})>>> df.flags.allows_duplicate_labels
True
>>> df2 = df.set_flags(allows_duplicate_labels=False)
>>> df2.flags.allows_duplicate_labels
False
method

droplevel(level, axis=0)

Return Series/DataFrame with requested index / column level(s) removed.

Parameters
  • level (int, str, or list-like) If a string is given, must be the name of a levelIf list-like, elements must be names or positional indexes of levels.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) Axis along which the level(s) is removed:
    • 0 or 'index': remove level(s) in column.
    • 1 or 'columns': remove level(s) in row.
    For Series this parameter is unused and defaults to 0.
Returns (Series/DataFrame)

Series/DataFrame with requested index / column level(s) removed.

See Also

DataFrame.replace : Replace values given in to_replace with value.DataFrame.pivot : Return reshaped DataFrame organized by given index / column values.

Examples
>>> df = (...     pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
...     .set_index([0, 1])
...     .rename_axis(["a", "b"])
... )
>>> df.columns = pd.MultiIndex.from_tuples(
...     [("c", "e"), ("d", "f")], names=["level_1", "level_2"]
... )
>>> df
level_1   c   d
level_2   e   f
a b
1 2      3   4
5 6      7   8
9 10    11  12
>>> df.droplevel("a")
level_1   c   d
level_2   e   f
b
2        3   4
6        7   8
10      11  12
>>> df.droplevel("level_2", axis=1)
level_1   c   d
a b
1 2      3   4
5 6      7   8
9 10    11  12
method

squeeze(axis=None)

Squeeze 1 dimensional axis objects into scalars.

Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged.

This method is most useful when you don't know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.

Parameters
  • axis ({0 or 'index', 1 or 'columns', None}, default None) A specific axis to squeeze. By default, all length-1 axes aresqueezed. For Series this parameter is unused and defaults to None.
Returns (DataFrame, Series, or scalar)

The projection after squeezing axis or all the axes.

See Also

Series.iloc : Integer-location based indexing for selecting scalars.DataFrame.iloc : Integer-location based indexing for selecting Series. Series.to_frame : Inverse of DataFrame.squeeze for a single-column DataFrame.

Examples
>>> primes = pd.Series([2, 3, 5, 7])

Slicing might produce a Series with a single value:

>>> even_primes = primes[primes % 2 == 0]
>>> even_primes
0    2
dtype: int64
>>> even_primes.squeeze()
np.int64(2)

Squeezing objects with more than one value in every axis does nothing:

>>> odd_primes = primes[primes % 2 == 1]
>>> odd_primes
1    3
2    5
3    7
dtype: int64
>>> odd_primes.squeeze()
1    3
2    5
3    7
dtype: int64

Squeezing is even more effective when used with DataFrames.

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=["a", "b"])
>>> df
   a  b
0  1  2
1  3  4

Slicing a single column will produce a DataFrame with the columns having only one value:

>>> df_a = df[["a"]]
>>> df_a
   a
0  1
1  3

So the columns can be squeezed down, resulting in a Series:

>>> df_a.squeeze("columns")
0    1
1    3
Name: a, dtype: int64

Slicing a single row from a single column will produce a single scalar DataFrame:

>>> df_0a = df.loc[df.index < 1, ["a"]]
>>> df_0a
   a
0  1

Squeezing the rows produces a single scalar Series:

>>> df_0a.squeeze("rows")
a    1
Name: 0, dtype: int64

Squeezing all axes will project directly into a scalar:

>>> df_0a.squeeze()
np.int64(1)
method

rename_axis(mapper=<no_default>, index=<no_default>, columns=<no_default>, axis=0, copy=<no_default>, inplace=False)

Set the name of the axis for the index or columns.

Parameters
  • mapper (scalar, list-like, optional) Value to set the axis name attribute.
    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.
  • index (scalar, list-like, dict-like or function, optional) A scalar, list-like, dict-like or functions transformations toapply to that axis' values.
  • columns (scalar, list-like, dict-like or function, optional) A scalar, list-like, dict-like or functions transformations toapply to that axis' values.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) The axis to rename.
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
  • inplace (bool, default False) Modifies the object directly, instead of creating a new Seriesor DataFrame.
Returns (DataFrame, or None)

The same type as the caller or None if inplace=True.

See Also

Series.rename : Alter Series index labels or name.DataFrame.rename : Alter DataFrame index labels or name. Index.rename : Set new names on index.

Notes

DataFrame.rename_axis supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)
  • (mapper, axis={'index', 'columns'}, ...)

The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter copy is ignored.

The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.

We highly recommend using keyword arguments to clarify your intent.

Examples

DataFrame

>>> df = pd.DataFrame(
...     {"num_legs": [4, 4, 2], "num_arms": [0, 0, 2]}, ["dog", "cat", "monkey"]
... )
>>> df
        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("animal")
>>> df
        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("limbs", axis="columns")
>>> df
limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

MultiIndex

>>> df.index = pd.MultiIndex.from_product(
...     [["mammal"], ["dog", "cat", "monkey"]], names=["type", "name"]
... )
>>> df
limbs          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
>>> df.rename_axis(index={"type": "class"})
limbs          num_legs  num_arms
class  name
mammal dog            4         0
       cat            4         0
       monkey         2         2
>>> df.rename_axis(columns=str.upper)
LIMBS          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
method

equals(other)

Test whether two objects contain the same elements.

This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

The row/column index do not need to have the same type, as long as the values are considered equal. Corresponding columns and index must be of the same dtype.

Parameters
  • other (Series or DataFrame) The other Series or DataFrame to be compared with the first.
Returns (bool)

True if all elements are the same in both objects, Falseotherwise.

See Also

Series.eq : Compare two Series objects of the same length and return a Series where each element is True if the element in each Series is equal, False otherwise. DataFrame.eq : Compare two DataFrame objects of the same shape and return a DataFrame where each element is True if the respective element in each DataFrame is equal, False otherwise. testing.assert_series_equal : Raises an AssertionError if left and right are not equal. Provides an easy interface to ignore inequality in dtypes, indexes and precision among others. testing.assert_frame_equal : Like assert_series_equal, but targets DataFrames. numpy.array_equal : Return True if two arrays have the same shape and elements, False otherwise.

Examples
>>> df = pd.DataFrame({1: [10], 2: [20]})>>> df
    1   2
0  10  20

DataFrames df and exactly_equal have the same types and values for their elements and column labels, which will return True.

>>> exactly_equal = pd.DataFrame({1: [10], 2: [20]})
>>> exactly_equal
    1   2
0  10  20
>>> df.equals(exactly_equal)
True

DataFrames df and different_column_type have the same element types and values, but have different types for the column labels, which will still return True.

>>> different_column_type = pd.DataFrame({1.0: [10], 2.0: [20]})
>>> different_column_type
   1.0  2.0
0   10   20
>>> df.equals(different_column_type)
True

DataFrames df and different_data_type have different types for the same values for their elements, and will return False even though their column labels are the same values and types.

>>> different_data_type = pd.DataFrame({1: [10.0], 2: [20.0]})
>>> different_data_type
      1     2
0  10.0  20.0
>>> df.equals(different_data_type)
False

DataFrames with NaN in the same locations compare equal.

>>> df_nan1 = pd.DataFrame({"a": [1, np.nan], "b": [3, np.nan]})
>>> df_nan2 = pd.DataFrame({"a": [1, np.nan], "b": [3, np.nan]})
>>> df_nan1.equals(df_nan2)
True

If the NaN values are not in the same locations, they compare unequal.

>>> df_nan3 = pd.DataFrame({"a": [1, np.nan], "b": [3, 4]})
>>> df_nan1.equals(df_nan3)
False
method

abs()

Return a Series/DataFrame with absolute numeric value of each element.

This function only applies to elements that are all numeric.

Returns (abs)

Series/DataFrame containing the absolute value of each element.

See Also

numpy.absolute : Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is :math:\sqrt{ a^2 + b^2 }.

Examples

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])
>>> s.abs()
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])
>>> s.abs()
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta("1 days")])
>>> s.abs()
0   1 days
dtype: timedelta64[us]

Select rows with data closest to certain value using argsort (from StackOverflow <https://stackoverflow.com/a/17758115>__).

>>> df = pd.DataFrame(
...     {"a": [4, 5, 6, 7], "b": [10, 20, 30, 40], "c": [100, 50, -30, -50]}
... )
>>> df
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
method

__iter__()

Iterate over info axis.

Returns (iterator)

Info axis as iterator.

See Also

DataFrame.items : Iterate over (column name, Series) pairs.DataFrame.itertuples : Iterate over DataFrame rows as namedtuples.

Examples
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})>>> for x in df:
...     print(x)
A
B
method

keys()

Get the 'info axis' (see Indexing for more).

This is index for Series, columns for DataFrame.

Returns (Index)

Info axis.

See Also

DataFrame.index : The index (row labels) of the DataFrame.DataFrame.columns: The column labels of the DataFrame.

Examples
>>> d = pd.DataFrame(...     data={"A": [1, 2, 3], "B": [0, 4, 8]}, index=["a", "b", "c"]
... )
>>> d
   A  B
a  1  0
b  2  4
c  3  8
>>> d.keys()
Index(['A', 'B'], dtype='str')
method

__contains__(key) → bool

True if the key is in the info axis

method

to_excel(excel_writer, sheet_name='Sheet1', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0, engine=None, merge_cells=True, inf_rep='inf', freeze_panes=None, storage_options=None, engine_kwargs=None, autofilter=False)

Write object to an Excel sheet.

To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet in the file to write to.

Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already exists will overwrite the existing file because the default mode is write.

Parameters
  • excel_writer (path-like, file-like, or ExcelWriter object) File path or existing ExcelWriter.
  • sheet_name (str, default 'Sheet1') Name of sheet which will contain DataFrame.
  • na_rep (str, default '') Missing data representation.
  • float_format (str, optional) Format string for floating point numbers. For examplefloat_format="%.2f" will format 0.1234 to 0.12.
  • columns (sequence or list of str, optional) Columns to write.
  • header (bool or list of str, default True) Write out the column names. If a list of string is given it isassumed to be aliases for the column names.
  • index (bool, default True) Write row names (index).
  • index_label (str or sequence, optional) Column label for index column(s) if desired. If not specified, andheader and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
  • startrow (int, default 0) Upper left cell row to dump data frame.
  • startcol (int, default 0) Upper left cell column to dump data frame.
  • engine (str, optional) Write engine to use, 'openpyxl' or 'xlsxwriter'. You can also set thisvia the options io.excel.xlsx.writer or io.excel.xlsm.writer.
  • merge_cells (bool or 'columns', default False) If True, write MultiIndex index and columns as merged cells.If 'columns', merge MultiIndex column cells only.
  • inf_rep (str, default 'inf') Representation for infinity (there is no native representation forinfinity in Excel).
  • freeze_panes (tuple of int (length 2), optional) Specifies the one-based bottommost row and rightmost column thatis to be frozen.
  • storage_options (dict, optional) Extra options that make sense for a particular storage connection, e.g.host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here <https://pandas.pydata.org/docs/user_guide/io.html? highlight=storage_options#reading-writing-remote-files>_.
  • engine_kwargs (dict, optional) Arbitrary keyword arguments passed to excel engine.
  • autofilter (bool, default False) If True, add automatic filters to all columns.
See Also

to_csv : Write DataFrame to a comma-separated values (csv) file.ExcelWriter : Class for writing DataFrame objects into excel sheets. read_excel : Read an Excel file into a pandas DataFrame. read_csv : Read a comma-separated values (csv) file into DataFrame. io.formats.style.Styler.to_excel : Add styles to Excel sheet.

Notes

For compatibility with :meth:~DataFrame.to_csv, to_excel serializes lists and dicts to strings before writing.

Once a workbook has been saved it is not possible to write further data without rewriting the whole workbook.

pandas will check the number of rows, columns, and cell character count does not exceed Excel's limitations. All other limitations must be checked by the user.

Examples

:

( , , , ) P

:

P

s :

) P ) )

:

P )

, s :

P

method

to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=None, indent=None, storage_options=None, mode='w')

Convert the object to a JSON string.

Note NaN's and None will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
  • path_or_buf (str, path object, file-like object, or None, default None) String, path object (implementing os.PathLike[str]), or file-likeobject implementing a write() function. If None, the result is returned as a string.
  • orient (str) Indication of expected JSON string format.
    • Series:
      • default is 'index'
      • allowed values are: {'split', 'records', 'index', 'table'}.
    • DataFrame:
      • default is 'columns'
      • allowed values are: {'split', 'records', 'index', 'columns', 'values', 'table'}.
    • The format of the JSON string:
      • 'split' : dict like {'index' -> [index], 'columns' -> [columns], 'data' -> [values]}
      • 'records' : list like [{column -> value}, ... , {column -> value}]
      • 'index' : dict like {index -> {column -> value}}
      • 'columns' : dict like {column -> {index -> value}}
      • 'values' : just the values array
      • 'table' : dict like {'schema': {schema}, 'data': {data}}
      Describing the data, where data component is like orient='records'.
  • date_format ({None, 'epoch', 'iso'}) Type of date conversion. 'epoch' = epoch milliseconds,'iso' = ISO8601. The default depends on the orient. For orient='table', the default is 'iso'. For all other orients, the default is 'epoch'.
    .. deprecated:: 3.0.0 'epoch' date format is deprecated and will be removed in a future version, please use 'iso' instead.
  • double_precision (int, default 10) The number of decimal places to use when encodingfloating point values. The possible maximal value is 15. Passing double_precision greater than 15 will raise a ValueError.
  • force_ascii (bool, default True) Force encoded string to be ASCII.
  • date_unit (str, default 'ms' (milliseconds)) The time unit to encode to, governs timestamp and ISO8601precision. One of 's', 'ms', 'us', 'ns' for second, millisecond, microsecond, and nanosecond respectively.
  • default_handler (callable, default None) Handler to call if object cannot otherwise be converted to asuitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.
  • lines (bool, default False) If 'orient' is 'records' write out line-delimited json format. Willthrow ValueError if incorrect 'orient' since others are not list-like.
  • compression (str or dict, default 'infer') For on-the-fly compression of the output data. If 'infer' and'path_or_buf' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.
  • index (bool or None, default None) The index is only used when 'orient' is 'split', 'index', 'column',or 'table'. Of these, 'index' and 'column' do not support index=False. The string 'index' as a column name with empty :class:Index or if it is 'index' will raise a ValueError.
  • indent (int, optional) Length of whitespace used to indent each record.
  • storage_options (dict, optional) Extra options that make sense for a particular storage connection, e.g.host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here <https://pandas.pydata.org/docs/user_guide/io.html? highlight=storage_options#reading-writing-remote-files>_.
  • mode (str, default 'w' (writing)) Specify the IO mode for output when supplying a path_or_buf.Accepted args are 'w' (writing) and 'a' (append) only. mode='a' is only supported when lines is True and orient is 'records'.
Returns (None or str)

If path_or_buf is None, returns the resulting json format as astring. Otherwise returns None.

See Also

read_json : Convert a JSON string to pandas object.

Notes

The behavior of indent=0 varies from the stdlib, which does not indent the output but does insert newlines. Currently, indent=0 and the default indent=None are equivalent in pandas, though this may change in a future release.

orient='table' contains a 'pandas_version' field under 'schema'. This stores the version of pandas used in the latest revision of the schema.

Examples
>>> from json import loads, dumps>>> df = pd.DataFrame(
...     [["a", "b"], ["c", "d"]],
...     index=["row 1", "row 2"],
...     columns=["col 1", "col 2"],
... )
>>> result = df.to_json(orient="split")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  # doctest: +SKIP
{
    "columns": [
        "col 1",
        "col 2"
    ],
    "index": [
        "row 1",
        "row 2"
    ],
    "data": [
        [
            "a",
            "b"
        ],
        [
            "c",
            "d"
        ]
    ]
}

Encoding/decoding a Dataframe using 'records' formatted JSON. Note that index labels are not preserved with this encoding.

>>> result = df.to_json(orient="records")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  # doctest: +SKIP
[
    {
        "col 1": "a",
        "col 2": "b"
    },
    {
        "col 1": "c",
        "col 2": "d"
    }
]

Encoding/decoding a Dataframe using 'index' formatted JSON:

>>> result = df.to_json(orient="index")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  # doctest: +SKIP
{
    "row 1": {
        "col 1": "a",
        "col 2": "b"
    },
    "row 2": {
        "col 1": "c",
        "col 2": "d"
    }
}

Encoding/decoding a Dataframe using 'columns' formatted JSON:

>>> result = df.to_json(orient="columns")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  # doctest: +SKIP
{
    "col 1": {
        "row 1": "a",
        "row 2": "c"
    },
    "col 2": {
        "row 1": "b",
        "row 2": "d"
    }
}

Encoding/decoding a Dataframe using 'values' formatted JSON:

>>> result = df.to_json(orient="values")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  # doctest: +SKIP
[
    [
        "a",
        "b"
    ],
    [
        "c",
        "d"
    ]
]

Encoding with Table Schema:

>>> result = df.to_json(orient="table")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  # doctest: +SKIP
{
    "schema": {
        "fields": [
            {
                "name": "index",
                "type": "string"
            },
            {
                "name": "col 1",
                "type": "string"
            },
            {
                "name": "col 2",
                "type": "string"
            }
        ],
        "primaryKey": [
            "index"
        ],
        "pandas_version": "1.4.0"
    },
    "data": [
        {
            "index": "row 1",
            "col 1": "a",
            "col 2": "b"
        },
        {
            "index": "row 2",
            "col 1": "c",
            "col 2": "d"
        }
    ]
}
method

to_hdf(path_or_buf, key, mode='a', complevel=None, complib=None, append=False, format=None, index=True, min_itemsize=None, nan_rep=None, dropna=None, data_columns=None, errors='strict', encoding='UTF-8')

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

.. warning::

One can store a subclass of DataFrame or Series to HDF5, but the type of the subclass is lost upon storing.

For more information see the :ref:user guide <io.hdf5>.

Parameters
  • path_or_buf (str or pandas.HDFStore) File path or HDFStore object.
  • key (str) Identifier for the group in the store.
  • mode ({'a', 'w', 'r+'}, default 'a') Mode to open file:
    • 'w': write, a new file is created (an existing file with the same name would be deleted).
    • 'a': append, an existing file is opened for reading and writing, and if the file does not exist it is created.
    • 'r+': similar to 'a', but the file must already exist.
  • complevel ({0-9}, default None) Specifies a compression level for data.A value of 0 or None disables compression.
  • complib ({'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib') Specifies the compression library to be used.These additional compressors for Blosc are supported (default if no compressor specified: 'blosc:blosclz'): {'blosc:blosclz', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'}. Specifying a compression library which is not available issues a ValueError.
  • append (bool, default False) For Table formats, append the input data to the existing.
  • format ({'fixed', 'table', None}, default 'fixed') Possible values:
    • 'fixed': Fixed format. Fast writing/reading. Not-appendable, nor searchable.
    • 'table': Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.
    • If None, pd.get_option('io.hdf.default_format') is checked, followed by fallback to "fixed".
  • index (bool, default True) Write DataFrame index as a column.
  • min_itemsize (dict or int, optional) Map column names to minimum string sizes for columns.
  • nan_rep (Any, optional) How to represent null values as str.Not allowed with append=True.
  • dropna (bool, default False, optional) Remove missing values.
  • data_columns (list of columns or True, optional) List of columns to create as indexed data columns for on-diskqueries, or True to use all columns. By default only the axes of the object are indexed. See :ref:Query via data columns<io.hdf5-query-data-columns>. for more information. Applicable only to format='table'.
  • errors (str, default 'strict') Specifies how encoding and decoding errors are to be handled.See the errors argument for :func:open for a full list of options.
  • encoding (str, default "UTF-8") Set character encoding.
See Also

read_hdf : Read from HDF file.DataFrame.to_orc : Write a DataFrame to the binary orc format. DataFrame.to_parquet : Write a DataFrame to the binary parquet format. DataFrame.to_sql : Write to a SQL table. DataFrame.to_feather : Write out feather-format for DataFrames. DataFrame.to_csv : Write out to a csv file.

Examples
>>> df = pd.DataFrame(...     {"A": [1, 2, 3], "B": [4, 5, 6]}, index=["a", "b", "c"]
... )  # doctest: +SKIP
>>> df.to_hdf("data.h5", key="df", mode="w")  # doctest: +SKIP

We can add another object to the same file:

>>> s = pd.Series([1, 2, 3, 4])  # doctest: +SKIP
>>> s.to_hdf("data.h5", key="s")  # doctest: +SKIP

Reading from HDF file:

>>> pd.read_hdf("data.h5", "df")  # doctest: +SKIP
A  B
a  1  4
b  2  5
c  3  6
>>> pd.read_hdf("data.h5", "s")  # doctest: +SKIP
0    1
1    2
2    3
3    4
dtype: int64
method

to_sql(name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)

Write records stored in a DataFrame to a SQL database.

Databases supported by SQLAlchemy [1]_ are supported. Tables can be newly created, appended to, or overwritten.

.. warning:: The pandas library does not attempt to sanitize inputs provided via a to_sql call. Please refer to the documentation for the underlying database driver to see if it will properly prevent injection, or alternatively be advised of a security risk when executing arbitrary commands in a to_sql call.

Parameters
  • name (str) Name of SQL table.
  • con (ADBC connection, sqlalchemy.engine.(Engine or Connection) or sqlite3.Connection) ADBC provides high performance I/O with native type support, where available.Using SQLAlchemy makes it possible to use any DB supported by that library. Legacy support is provided for sqlite3.Connection objects. The user is responsible for engine disposal and connection closure for the SQLAlchemy connectable. See here <https://docs.sqlalchemy.org/en/20/core/connections.html>_. If passing a sqlalchemy.engine.Connection which is already in a transaction, the transaction will not be committed. If passing a sqlite3.Connection, it will not be possible to roll back the record insertion.
  • schema (str, optional) Specify the schema (if database flavor supports this). If None, usedefault schema.
  • if_exists ({'fail', 'replace', 'append', 'delete_rows'}, default 'fail') How to behave if the table already exists.
    • fail: Raise a ValueError.
    • replace: Drop the table before inserting new values.
    • append: Insert new values to the existing table.
    • delete_rows: If a table exists, delete all records and insert data.
  • index (bool, default True) Write DataFrame index as a column. Uses index_label as the columnname in the table. Creates a table index for this column.
  • index_label (str or sequence, default None) Column label for index column(s). If None is given (default) andindex is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
  • chunksize (int, optional) Specify the number of rows in each batch to be written to the database connection at a time.By default, all rows will be written at once. Also see the method keyword.
  • dtype (dict or scalar, optional) Specifying the datatype for columns. If a dictionary is used, thekeys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.
  • method ({None, 'multi', callable}, optional) Controls the SQL insertion clause used:
    • None : Uses standard SQL INSERT clause (one per row).
    • 'multi': Pass multiple values in a single INSERT clause.
    • callable with signature (pd_table, conn, keys, data_iter).
    Details and a sample callable implementation can be found in the section :ref:insert method <io.sql.method>.
Returns (None or int)

Number of rows affected by to_sql. None is returned if the callablepassed into method does not return an integer number of rows.

The number of returned rows affected is the sum of the rowcount attribute of sqlite3.Cursor or SQLAlchemy connectable which may not reflect the exact number of written rows as stipulated in the sqlite3 <https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.rowcount> or SQLAlchemy <https://docs.sqlalchemy.org/en/20/core/connections.html#sqlalchemy.engine.CursorResult.rowcount>.

Raises
  • ValueError When the table already exists and if_exists is 'fail' (thedefault).
See Also

read_sql : Read a DataFrame from a table.

Notes

Timezone aware datetime columns will be written as Timestamp with timezone type with SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone unaware timestamps local to the original timezone.

Not all datastores support method="multi". Oracle, for example, does not support multi-value insert.

References

.. [1] https://docs.sqlalchemy.org.. [2] https://www.python.org/dev/peps/pep-0249/

Examples

Create an in-memory SQLite database.

>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite://', echo=False)

Create a table from scratch with 3 rows.

>>> df = pd.DataFrame({'name' : ['User 1', 'User 2', 'User 3']})
>>> df
     name
0  User 1
1  User 2
2  User 3
>>> df.to_sql(name='users', con=engine)
3
>>> from sqlalchemy import text
>>> with engine.connect() as conn:
...     conn.execute(text("SELECT * FROM users")).fetchall()
[(0, 'User 1'), (1, 'User 2'), (2, 'User 3')]

An sqlalchemy.engine.Connection can also be passed to con:

>>> with engine.begin() as connection:
...     df1 = pd.DataFrame({'name' : ['User 4', 'User 5']})
...     df1.to_sql(name='users', con=connection, if_exists='append')
2

This is allowed to support operations that require that the same DBAPI connection is used for the entire operation.

>>> df2 = pd.DataFrame({'name' : ['User 6', 'User 7']})
>>> df2.to_sql(name='users', con=engine, if_exists='append')
2
>>> with engine.connect() as conn:
...     conn.execute(text("SELECT * FROM users")).fetchall()
[(0, 'User 1'), (1, 'User 2'), (2, 'User 3'),
 (0, 'User 4'), (1, 'User 5'), (0, 'User 6'),
 (1, 'User 7')]

Overwrite the table with just df2.

>>> df2.to_sql(name='users', con=engine, if_exists='replace',
...            index_label='id')
2
>>> with engine.connect() as conn:
...     conn.execute(text("SELECT * FROM users")).fetchall()
[(0, 'User 6'), (1, 'User 7')]

Delete all rows before inserting new records with df3

>>> df3 = pd.DataFrame({"name": ['User 8', 'User 9']})
>>> df3.to_sql(name='users', con=engine, if_exists='delete_rows',
...            index_label='id')
2
>>> with engine.connect() as conn:
...     conn.execute(text("SELECT * FROM users")).fetchall()
[(0, 'User 8'), (1, 'User 9')]

Use method to define a callable insertion method to do nothing if there's a primary key conflict on a table in a PostgreSQL database.

>>> from sqlalchemy.dialects.postgresql import insert
>>> def insert_on_conflict_nothing(table, conn, keys, data_iter):
...     # "a" is the primary key in "conflict_table"
...     data = [dict(zip(keys, row)) for row in data_iter]
...     stmt = insert(table.table).values(data).on_conflict_do_nothing(index_elements=["a"])
...     result = conn.execute(stmt)
...     return result.rowcount
>>> df_conflict.to_sql(name="conflict_table", con=conn, if_exists="append",  # noqa: F821
...                    method=insert_on_conflict_nothing)  # doctest: +SKIP
0

For MySQL, a callable to update columns b and c if there's a conflict on a primary key.

>>> from sqlalchemy.dialects.mysql import insert   # noqa: F811
>>> def insert_on_conflict_update(table, conn, keys, data_iter):
...     # update columns "b" and "c" on primary key conflict
...     data = [dict(zip(keys, row)) for row in data_iter]
...     stmt = (
...         insert(table.table)
...         .values(data)
...     )
...     stmt = stmt.on_duplicate_key_update(b=stmt.inserted.b, c=stmt.inserted.c)
...     result = conn.execute(stmt)
...     return result.rowcount
>>> df_conflict.to_sql(name="conflict_table", con=conn, if_exists="append",  # noqa: F821
...                    method=insert_on_conflict_update)  # doctest: +SKIP
2

Specify the dtype (especially useful for integers with missing values). Notice that while pandas is forced to store the data as floating point, the database supports nullable integers. When fetching the data with Python, we get back integer scalars.

>>> df = pd.DataFrame({"A": [1, None, 2]})
>>> df
     A
0  1.0
1  NaN
2  2.0
>>> from sqlalchemy.types import Integer
>>> df.to_sql(name='integers', con=engine, index=False,
...           dtype={"A": Integer()})
3
>>> with engine.connect() as conn:
...     conn.execute(text("SELECT * FROM integers")).fetchall()
[(1,), (None,), (2,)]

.. versionadded:: 2.2.0

pandas now supports writing via ADBC drivers

>>> df = pd.DataFrame({'name' : ['User 10', 'User 11', 'User 12']})
>>> df
      name
0  User 10
1  User 11
2  User 12
>>> from adbc_driver_sqlite import dbapi  # doctest:+SKIP
>>> with dbapi.connect("sqlite://") as conn:  # doctest:+SKIP
...     df.to_sql(name="users", con=conn)
3
method

to_pickle(path, compression='infer', protocol=5, storage_options=None)

Pickle (serialize) object to file.

Parameters
  • path (str, path object, or file-like object) String, path object (implementing os.PathLike[str]), or file-likeobject implementing a binary write() function. File path where the pickled object will be stored.
  • compression (str or dict, default 'infer') For on-the-fly compression of the output data. If 'infer' and'path_or_buf' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.
  • protocol (int) Int which indicates which protocol should be used by the pickler,default HIGHEST_PROTOCOL (see [1]_ paragraph 12.1.2). The possible values are 0, 1, 2, 3, 4, 5. A negative value for the protocol parameter is equivalent to setting its value to HIGHEST_PROTOCOL.
    .. [1] https://docs.python.org/3/library/pickle.html.
  • storage_options (dict, optional) Extra options that make sense for a particular storage connection, e.g.host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here <https://pandas.pydata.org/docs/user_guide/io.html? highlight=storage_options#reading-writing-remote-files>_.
See Also

read_pickle : Load pickled pandas object (or any object) from file.DataFrame.to_hdf : Write DataFrame to an HDF5 file. DataFrame.to_sql : Write DataFrame to a SQL database. DataFrame.to_parquet : Write a DataFrame to the binary parquet format.

Examples
>>> original_df = pd.DataFrame(...     {"foo": range(5), "bar": range(5, 10)}
... )  # doctest: +SKIP
>>> original_df  # doctest: +SKIP
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
>>> original_df.to_pickle("./dummy.pkl")  # doctest: +SKIP
>>> unpickled_df = pd.read_pickle("./dummy.pkl")  # doctest: +SKIP
>>> unpickled_df  # doctest: +SKIP
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
method

to_clipboard(excel=True, sep=None, **kwargs)

Copy object to the system clipboard.

Write a text representation of object to the system clipboard. This can be pasted into Excel, for example.

Parameters
  • excel (bool, default True) Produce output in a csv format for easy pasting into excel.
    • True, use the provided separator for csv pasting.
    • False, write a string representation of the object to the clipboard.
  • sep (str, default ``'\t'``) Field delimiter.
  • **kwargs These parameters will be passed to DataFrame.to_csv.
See Also

DataFrame.to_csv : Write a DataFrame to a comma-separated values (csv) file. read_clipboard : Read text from clipboard and pass to read_csv.

Notes

Requirements for your platform.

  • Linux : xclip, or xsel (with PyQt4 modules)
  • Windows : none
  • macOS : none

This method uses the processes developed for the package pyperclip. A solution to render any output string format is given in the examples.

Examples

Copy the contents of a DataFrame to the clipboard.

>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["A", "B", "C"])
>>> df.to_clipboard(sep=",")  # doctest: +SKIP
... # Wrote the following to the system clipboard:
... # ,A,B,C
... # 0,1,2,3
... # 1,4,5,6

We can omit the index by passing the keyword index and setting it to false.

>>> df.to_clipboard(sep=",", index=False)  # doctest: +SKIP
... # Wrote the following to the system clipboard:
... # A,B,C
... # 1,2,3
... # 4,5,6

Using the original pyperclip package for any string output format.

.. code-block:: python

import pyperclip

html = df.style.to_html() pyperclip.copy(html)

method

to_xarray()

Return an xarray object from the pandas object.

Returns (xarray.DataArray or xarray.Dataset)

Data in the pandas structure converted to Dataset if the object isa DataFrame, or a DataArray if the object is a Series.

See Also

DataFrame.to_hdf : Write DataFrame to an HDF5 file.DataFrame.to_parquet : Write a DataFrame to the binary parquet format.

Notes

See the xarray docs <https://xarray.pydata.org/en/stable/>__

Examples
>>> df = pd.DataFrame(...     [
...         ("falcon", "bird", 389.0, 2),
...         ("parrot", "bird", 24.0, 2),
...         ("lion", "mammal", 80.5, 4),
...         ("monkey", "mammal", np.nan, 4),
...     ],
...     columns=["name", "class", "max_speed", "num_legs"],
... )
>>> df
     name   class  max_speed  num_legs
0  falcon    bird      389.0         2
1  parrot    bird       24.0         2
2    lion  mammal       80.5         4
3  monkey  mammal        NaN         4
>>> df.to_xarray()  # doctest: +SKIP
<xarray.Dataset>
Dimensions:    (index: 4)
Coordinates:
  * index      (index) int64 32B 0 1 2 3
Data variables:
    name       (index) object 32B 'falcon' 'parrot' 'lion' 'monkey'
    class      (index) object 32B 'bird' 'bird' 'mammal' 'mammal'
    max_speed  (index) float64 32B 389.0 24.0 80.5 nan
    num_legs   (index) int64 32B 2 2 4 4
>>> df["max_speed"].to_xarray()  # doctest: +SKIP
<xarray.DataArray 'max_speed' (index: 4)>
array([389. ,  24. ,  80.5,   nan])
Coordinates:
  * index    (index) int64 0 1 2 3
>>> dates = pd.to_datetime(
...     ["2018-01-01", "2018-01-01", "2018-01-02", "2018-01-02"]
... )
>>> df_multiindex = pd.DataFrame(
...     {
...         "date": dates,
...         "animal": ["falcon", "parrot", "falcon", "parrot"],
...         "speed": [350, 18, 361, 15],
...     }
... )
>>> df_multiindex = df_multiindex.set_index(["date", "animal"])
>>> df_multiindex
                   speed
date       animal
2018-01-01 falcon    350
           parrot     18
2018-01-02 falcon    361
           parrot     15
>>> df_multiindex.to_xarray()  # doctest: +SKIP
<xarray.Dataset>
Dimensions:  (date: 2, animal: 2)
Coordinates:
  * date     (date) datetime64[s] 2018-01-01 2018-01-02
  * animal   (animal) object 'falcon' 'parrot'
Data variables:
    speed    (date, animal) int64 350 18 361 15
method

to_latex(buf=None, columns=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, bold_rows=False, column_format=None, longtable=None, escape=None, encoding=None, decimal='.', multicolumn=None, multicolumn_format=None, multirow=None, caption=None, label=None, position=None)

Render object to a LaTeX tabular, longtable, or nested table.

Requires \usepackage{booktabs}. The output can be copy/pasted into a main LaTeX document or read from an external file with \input{table.tex}.

.. versionchanged:: 2.0.0 Refactored to use the Styler implementation via jinja2 templating.

Parameters
  • buf (str, Path or StringIO-like, optional, default None) Buffer to write to. If None, the output is returned as a string.
  • columns (list of label, optional) The subset of columns to write. Writes all columns by default.
  • header (bool or list of str, default True) Write out the column names. If a list of strings is given,it is assumed to be aliases for the column names. Braces must be escaped.
  • index (bool, default True) Write row names (index).
  • na_rep (str, default 'NaN') Missing data representation.
  • formatters (list of functions or dict of {str: function}, optional) Formatter functions to apply to columns' elements by position orname. The result of each function must be a unicode string. List must be of length equal to the number of columns.
  • float_format (one-parameter function or str, optional, default None) Formatter for floating point numbers. For examplefloat_format="%.2f" and float_format="{:0.2f}".format will both result in 0.1234 being formatted as 0.12.
  • sparsify (bool, optional) Set to False for a DataFrame with a hierarchical index to printevery multiindex key at each row. By default, the value will be read from the config module.
  • index_names (bool, default True) Prints the names of the indexes.
  • bold_rows (bool, default False) Make the row labels bold in the output.
  • column_format (str, optional) The columns format as specified in LaTeX table format<https://en.wikibooks.org/wiki/LaTeX/Tables>__ e.g. 'rcl' for 3 columns. By default, 'l' will be used for all columns except columns of numbers, which default to 'r'.
  • longtable (bool, optional) Use a longtable environment instead of tabular. Requiresadding a \usepackage{longtable} to your LaTeX preamble. By default, the value will be read from the pandas config module, and set to True if the option styler.latex.environment is "longtable".
    .. versionchanged:: 2.0.0 The pandas option affecting this argument has changed.
  • escape (bool, optional) By default, the value will be read from the pandas configmodule and set to True if the option styler.format.escape is "latex". When set to False prevents from escaping latex special characters in column names.
    .. versionchanged:: 2.0.0 The pandas option affecting this argument has changed, as has the default value to False.
  • encoding (str, optional) A string representing the encoding to use in the output file,defaults to 'utf-8'.
  • decimal (str, default '.') Character recognized as decimal separator, e.g. ',' in Europe.
  • multicolumn (bool, default True) Use \multicolumn to enhance MultiIndex columns.The default will be read from the config module, and is set as the option styler.sparse.columns.
    .. versionchanged:: 2.0.0 The pandas option affecting this argument has changed.
  • multicolumn_format (str, default 'r') The alignment for multicolumns, similar to column_formatThe default will be read from the config module, and is set as the option styler.latex.multicol_align.
    .. versionchanged:: 2.0.0 The pandas option affecting this argument has changed, as has the default value to "r".
  • multirow (bool, default True) Use \multirow to enhance MultiIndex rows. Requires adding a\usepackage{multirow} to your LaTeX preamble. Will print centered labels (instead of top-aligned) across the contained rows, separating groups via clines. The default will be read from the pandas config module, and is set as the option styler.sparse.index.
    .. versionchanged:: 2.0.0 The pandas option affecting this argument has changed, as has the default value to True.
  • caption (str or tuple, optional) Tuple (full_caption, short_caption),which results in \caption[short_caption]{full_caption}; if a single string is passed, no short caption will be set.
  • label (str, optional) The LaTeX label to be placed inside \label{} in the output.This is used with \ref{} in the main .tex file.
  • position (str, optional) The LaTeX positional argument for tables, to be placed after\begin{} in the output.
Returns (str or None)

If buf is None, returns the result as a string. Otherwise returns None.

See Also

io.formats.style.Styler.to_latex : Render a DataFrame to LaTeX with conditional formatting. DataFrame.to_string : Render a DataFrame to a console-friendly tabular output. DataFrame.to_html : Render a DataFrame as an HTML table.

Notes

As of v2.0.0 this method has changed to use the Styler implementation as part of :meth:.Styler.to_latex via jinja2 templating. This means that jinja2 is a requirement, and needs to be installed, for this method to function. It is advised that users switch to using Styler, since that implementation is more frequently updated and contains much more flexibility with the output.

Examples

Convert a general DataFrame to LaTeX with formatting:

>>> df = pd.DataFrame(dict(name=['Raphael', 'Donatello'],
...                        age=[26, 45],
...                        height=[181.23, 177.65]))
>>> print(df.to_latex(index=False,
...                   formatters={"name": str.upper},
...                   float_format="{:.1f}".format,
...                   ))  # doctest: +SKIP
\begin{tabular}{lrr}
\toprule
name & age & height \\
\midrule
RAPHAEL & 26 & 181.2 \\
DONATELLO & 45 & 177.7 \\
\bottomrule
\end{tabular}
method

to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', lineterminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.', errors='strict', storage_options=None)

Write object to a comma-separated values (csv) file.

Parameters
  • path_or_buf (str, path object, file-like object, or None, default None) String, path object (implementing os.PathLike[str]), or file-likeobject implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline='', disabling universal newlines. If a binary file object is passed, mode might need to contain a 'b'.
  • sep (str, default ',') String of length 1. Field delimiter for the output file.
  • na_rep (str, default '') Missing data representation.
  • float_format (str, Callable, default None) Format string for floating point numbers. If a Callable is given, it takesprecedence over other numeric formatting parameters, like decimal.
  • columns (sequence, optional) Columns to write.
  • header (bool or list of str, default True) Write out the column names. If a list of strings is given it isassumed to be aliases for the column names.
  • index (bool, default True) Write row names (index).
  • index_label (str or sequence, or False, default None) Column label for index column(s) if desired. If None is given, andheader and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.
  • mode ({'w', 'x', 'a'}, default 'w') Forwarded to either open(mode=) or fsspec.open(mode=) to controlthe file opening. Typical values include:
    • 'w', truncate the file first.
    • 'x', exclusive creation, failing if the file already exists.
    • 'a', append to the end of file if it exists.
  • encoding (str, optional) A string representing the encoding to use in the output file,defaults to 'utf-8'. encoding is not supported if path_or_buf is a non-binary file object.
  • compression (str or dict, default 'infer') For on-the-fly compression of the output data. If 'infer' and'path_or_buf' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.
    May be a dict with key 'method' as compression mode and other entries as additional compression options if compression mode is 'zip'.
    Passing compression options as keys in dict is supported for compression modes 'gzip', 'bz2', 'zstd', and 'zip'.
  • quoting (optional constant from csv module) Defaults to csv.QUOTE_MINIMAL. If you have set a float_formatthen floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric.
  • quotechar (str, default '\"') String of length 1. Character used to quote fields.
  • lineterminator (str, optional) The newline character or character sequence to use in the outputfile. Defaults to os.linesep, which depends on the OS in which this method is called ('\n' for linux, '\r\n' for Windows, i.e.).
  • chunksize (int or None) Rows to write at a time.
  • date_format (str, default None) Format string for datetime objects.
  • doublequote (bool, default True) Control quoting of quotechar inside a field.
  • escapechar (str, default None) String of length 1. Character used to escape sep and quotecharwhen appropriate.
  • decimal (str, default '.') Character recognized as decimal separator. E.g. use ',' forEuropean data.
  • errors (str, default 'strict') Specifies how encoding and decoding errors are to be handled.See the errors argument for :func:open for a full list of options.
  • storage_options (dict, optional) Extra options that make sense for a particular storage connection, e.g.host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here <https://pandas.pydata.org/docs/user_guide/io.html? highlight=storage_options#reading-writing-remote-files>_.
Returns (None or str)

If path_or_buf is None, returns the resulting csv format as astring. Otherwise returns None.

See Also

read_csv : Load a CSV file into a DataFrame.to_excel : Write DataFrame to an Excel file.

Examples

Create 'out.csv' containing 'df' without indices

>>> df = pd.DataFrame(
...     [["Raphael", "red", "sai"], ["Donatello", "purple", "bo staff"]],
...     columns=["name", "mask", "weapon"],
... )
>>> df.to_csv("out.csv", index=False)  # doctest: +SKIP

Create 'out.zip' containing 'out.csv'

>>> df.to_csv(index=False)
'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'
>>> compression_opts = dict(
...     method="zip", archive_name="out.csv"
... )  # doctest: +SKIP
>>> df.to_csv(
...     "out.zip", index=False, compression=compression_opts
... )  # doctest: +SKIP

To write a csv file to a new folder or nested folder you will first need to create it using either Pathlib or os:

>>> from pathlib import Path  # doctest: +SKIP
>>> filepath = Path("folder/subfolder/out.csv")  # doctest: +SKIP
>>> filepath.parent.mkdir(parents=True, exist_ok=True)  # doctest: +SKIP
>>> df.to_csv(filepath)  # doctest: +SKIP
>>> import os  # doctest: +SKIP
>>> os.makedirs("folder/subfolder", exist_ok=True)  # doctest: +SKIP
>>> df.to_csv("folder/subfolder/out.csv")  # doctest: +SKIP

Format floats to two decimal places:

>>> df.to_csv("out1.csv", float_format="%.2f")  # doctest: +SKIP

Format floats using scientific notation:

>>> df.to_csv("out2.csv", float_format="{:.2e}".format)  # doctest: +SKIP
method

take(indices, axis=0, **kwargs)

Return the elements in the given positional indices along an axis.

This means that we are not indexing according to actual values in the index attribute of the object. We are indexing according to the actual position of the element in the object.

Parameters
  • indices (array-like) An array of ints indicating which positions to take.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) The axis on which to select elements. 0 means that we areselecting rows, 1 means that we are selecting columns. For Series this parameter is unused and defaults to 0.
  • **kwargs For compatibility with :meth:numpy.take. Has no effect on theoutput.
Returns (same type as caller)

An array-like containing the elements taken from the object.

See Also

DataFrame.loc : Select a subset of a DataFrame by labels.DataFrame.iloc : Select a subset of a DataFrame by positions. numpy.take : Take elements from an array along an axis.

Examples
>>> df = pd.DataFrame(...     [
...         ("falcon", "bird", 389.0),
...         ("parrot", "bird", 24.0),
...         ("lion", "mammal", 80.5),
...         ("monkey", "mammal", np.nan),
...     ],
...     columns=["name", "class", "max_speed"],
...     index=[0, 2, 3, 1],
... )
>>> df
     name   class  max_speed
0  falcon    bird      389.0
2  parrot    bird       24.0
3    lion  mammal       80.5
1  monkey  mammal        NaN

Take elements at positions 0 and 3 along the axis 0 (default).

Note how the actual indices selected (0 and 1) do not correspond to our selected indices 0 and 3. That's because we are selecting the 0th and 3rd rows, not rows whose indices equal 0 and 3.

>>> df.take([0, 3])
     name   class  max_speed
0  falcon    bird      389.0
1  monkey  mammal        NaN

Take elements at indices 1 and 2 along the axis 1 (column selection).

>>> df.take([1, 2], axis=1)
    class  max_speed
0    bird      389.0
2    bird       24.0
3  mammal       80.5
1  mammal        NaN

We may take elements using negative integers for positive indices, starting from the end of the object, just like with Python lists.

>>> df.take([-1, -2])
     name   class  max_speed
1  monkey  mammal        NaN
3    lion  mammal       80.5
method

xs(key, axis=0, level=None, drop_level=True)

Return cross-section from the Series/DataFrame.

This method takes a key argument to select data at a particular level of a MultiIndex.

Parameters
  • key (label or tuple of label) Label contained in the index, or partially in a MultiIndex.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) Axis to retrieve cross-section on.
  • level (object, defaults to first n levels (n=1 or len(key))) In case of a key partially contained in a MultiIndex, indicatewhich levels are used. Levels can be referred by label or position.
  • drop_level (bool, default True) If False, returns object with same levels as self.
Returns (Series or DataFrame)

Cross-section from the original Series or DataFramecorresponding to the selected index levels.

See Also

DataFrame.loc : Access a group of rows and columns by label(s) or a boolean array. DataFrame.iloc : Purely integer-location based indexing for selection by position.

Notes

xs can not be used to set values.

MultiIndex Slicers is a generic way to get/set values on any level or levels. It is a superset of xs functionality, see :ref:MultiIndex Slicers <advanced.mi_slicers>.

Examples
>>> d = {...     "num_legs": [4, 4, 2, 2],
...     "num_wings": [0, 0, 2, 2],
...     "class": ["mammal", "mammal", "mammal", "bird"],
...     "animal": ["cat", "dog", "bat", "penguin"],
...     "locomotion": ["walks", "walks", "flies", "walks"],
... }
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(["class", "animal", "locomotion"])
>>> df
                           num_legs  num_wings
class  animal  locomotion
mammal cat     walks              4          0
       dog     walks              4          0
       bat     flies              2          2
bird   penguin walks              2          2

Get values at specified index

>>> df.xs("mammal")
                   num_legs  num_wings
animal locomotion
cat    walks              4          0
dog    walks              4          0
bat    flies              2          2

Get values at several indexes

>>> df.xs(("mammal", "dog", "walks"))
num_legs     4
num_wings    0
Name: (mammal, dog, walks), dtype: int64

Get values at specified index and level

>>> df.xs("cat", level=1)
                   num_legs  num_wings
class  locomotion
mammal walks              4          0

Get values at several indexes and levels

>>> df.xs(("bird", "walks"), level=[0, "locomotion"])
         num_legs  num_wings
animal
penguin         2          2

Get values at specified column and axis

>>> df.xs("num_wings", axis=1)
class   animal   locomotion
mammal  cat      walks         0
        dog      walks         0
        bat      flies         2
bird    penguin  walks         2
Name: num_wings, dtype: int64
method

__delitem__(key)

Delete item

method

get(key, default=None)

Get item from object for given key (ex: DataFrame column).

Returns default value if not found.

Parameters
  • key (object) Key for which item should be returned.
  • default (object, default None) Default value to return if key is not found.
Returns (same type as items contained in object)

Item for given key or default value, if key is not found.

See Also

DataFrame.get : Get item from object for given key (ex: DataFrame column).Series.get : Get item from object for given key (ex: DataFrame column).

Examples
>>> df = pd.DataFrame(...     [
...         [24.3, 75.7, "high"],
...         [31, 87.8, "high"],
...         [22, 71.6, "medium"],
...         [35, 95, "medium"],
...     ],
...     columns=["temp_celsius", "temp_fahrenheit", "windspeed"],
...     index=pd.date_range(start="2014-02-12", end="2014-02-15", freq="D"),
... )
>>> df
            temp_celsius  temp_fahrenheit windspeed
2014-02-12          24.3             75.7      high
2014-02-13          31.0             87.8      high
2014-02-14          22.0             71.6    medium
2014-02-15          35.0             95.0    medium
>>> df.get(["temp_celsius", "windspeed"])
            temp_celsius windspeed
2014-02-12          24.3      high
2014-02-13          31.0      high
2014-02-14          22.0    medium
2014-02-15          35.0    medium
>>> ser = df["windspeed"]
>>> ser.get("2014-02-13")
'high'

If the key isn't found, the default value will be used.

>>> df.get(["temp_celsius", "temp_kelvin"], default="default_value")
'default_value'
>>> ser.get("2014-02-10", "[unknown]")
'[unknown]'
method

reindex_like(other, method=None, copy=<no_default>, limit=None, tolerance=None)

Return an object with matching indices as other object.

Conform the object to the same index on all axes. Optional filling logic, placing NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.

Parameters
  • other (Object of the same data type) Its row and column indices are used to define the new indicesof this object.
  • method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) Method to use for filling holes in reindexed DataFrame.Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
    .. deprecated:: 3.0.0
    • None (default): don't fill gaps
    • pad / ffill: propagate last valid observation forward to next valid
    • backfill / bfill: use next valid observation to fill gap
    • nearest: use nearest valid observations to fill gap.
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
  • limit (int, default None) Maximum number of consecutive labels to fill for inexact matches.
  • tolerance (optional) Maximum distance between original and new labels for inexactmatches. The values of the index at the matching locations must satisfy the equation abs(index[indexer] - target) <= tolerance.
    Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index's type.
Returns (Series or DataFrame)

Same type as caller, but with changed indices on each axis.

See Also

DataFrame.set_index : Set row labels.DataFrame.reset_index : Remove row labels or move them to new columns. DataFrame.reindex : Change to new indices or expand indices.

Notes

Same as calling .reindex(index=other.index, columns=other.columns,...).

Examples
>>> df1 = pd.DataFrame(...     [
...         [24.3, 75.7, "high"],
...         [31, 87.8, "high"],
...         [22, 71.6, "medium"],
...         [35, 95, "medium"],
...     ],
...     columns=["temp_celsius", "temp_fahrenheit", "windspeed"],
...     index=pd.date_range(start="2014-02-12", end="2014-02-15", freq="D"),
... )
>>> df1
            temp_celsius  temp_fahrenheit windspeed
2014-02-12          24.3             75.7      high
2014-02-13          31.0             87.8      high
2014-02-14          22.0             71.6    medium
2014-02-15          35.0             95.0    medium
>>> df2 = pd.DataFrame(
...     [[28, "low"], [30, "low"], [35.1, "medium"]],
...     columns=["temp_celsius", "windspeed"],
...     index=pd.DatetimeIndex(["2014-02-12", "2014-02-13", "2014-02-15"]),
... )
>>> df2
            temp_celsius windspeed
2014-02-12          28.0       low
2014-02-13          30.0       low
2014-02-15          35.1    medium
>>> df2.reindex_like(df1)
            temp_celsius  temp_fahrenheit windspeed
2014-02-12          28.0              NaN       low
2014-02-13          30.0              NaN       low
2014-02-14           NaN              NaN       NaN
2014-02-15          35.1              NaN    medium
method

add_prefix(prefix, axis=None)

Prefix labels with string prefix.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters
  • prefix (str) The string to add before each label.
  • axis ({0 or 'index', 1 or 'columns', None}, default None) Axis to add prefix on
    .. versionadded:: 2.0.0
Returns (Series or DataFrame)

New Series or DataFrame with updated labels.

See Also

Series.add_suffix: Suffix row labels with string suffix.DataFrame.add_suffix: Suffix column labels with string suffix.

Examples
>>> s = pd.Series([1, 2, 3, 4])>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_prefix("item_")
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_prefix("col_")
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
method

add_suffix(suffix, axis=None)

Suffix labels with string suffix.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters
  • suffix (str) The string to add after each label.
  • axis ({0 or 'index', 1 or 'columns', None}, default None) Axis to add suffix on
    .. versionadded:: 2.0.0
Returns (Series or DataFrame)

New Series or DataFrame with updated labels.

See Also

Series.add_prefix: Prefix row labels with string prefix.DataFrame.add_prefix: Prefix column labels with string prefix.

Examples
>>> s = pd.Series([1, 2, 3, 4])>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_suffix("_item")
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_suffix("_col")
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
method

filter(items=None, like=None, regex=None, axis=None)

Subset the DataFrame or Series according to the specified index labels.

For DataFrame, filter rows or columns depending on axis argument. Note that this routine does not filter based on content. The filter is applied to the labels of the index.

Parameters
  • items (list-like) Keep labels from axis which are in items.
  • like (str) Keep labels from axis for which "like in label == True".
  • regex (str (regular expression)) Keep labels from axis for which re.search(regex, label) == True.
  • axis ({0 or 'index', 1 or 'columns', None}, default None) The axis to filter on, expressed either as an index (int)or axis name (str). By default this is the info axis, 'columns' for DataFrame. For Series this parameter is unused and defaults to None.
Returns (Same type as caller)

The filtered subset of the DataFrame or Series.

See Also

DataFrame.loc : Access a group of rows and columns by label(s) or a boolean array.

Notes

The items, like, and regex parameters are enforced to be mutually exclusive.

axis defaults to the info axis that is used when indexing with [].

Examples
>>> df = pd.DataFrame(...     np.array(([1, 2, 3], [4, 5, 6])),
...     index=["mouse", "rabbit"],
...     columns=["one", "two", "three"],
... )
>>> df
        one  two  three
mouse     1    2      3
rabbit    4    5      6
>>> # select columns by name
>>> df.filter(items=["one", "three"])
         one  three
mouse     1      3
rabbit    4      6
>>> # select columns by regular expression
>>> df.filter(regex="e$", axis=1)
         one  three
mouse     1      3
rabbit    4      6
>>> # select rows containing 'bbi'
>>> df.filter(like="bbi", axis=0)
         one  two  three
rabbit    4    5      6
method

head(n=5)

Return the first n rows.

This function exhibits the same behavior as df[:n], returning the first n rows based on position. It is useful for quickly checking if your object has the right type of data in it.

When n is positive, it returns the first n rows. For n equal to 0, it returns an empty object. When n is negative, it returns all rows except the last |n| rows, mirroring the behavior of df[:n].

If n is larger than the number of rows, this function returns all rows.

Parameters
  • n (int, default 5) Number of rows to select.
Returns (same type as caller)

The first n rows of the caller object.

See Also

DataFrame.tail: Returns the last n rows.

Examples
>>> df = pd.DataFrame(...     {
...         "animal": [
...             "alligator",
...             "bee",
...             "falcon",
...             "lion",
...             "monkey",
...             "parrot",
...             "shark",
...             "whale",
...             "zebra",
...         ]
...     }
... )
>>> df
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale
8      zebra

Viewing the first 5 lines

>>> df.head()
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey

Viewing the first n lines (three in this case)

>>> df.head(3)
      animal
0  alligator
1        bee
2     falcon

For negative values of n

>>> df.head(-3)
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
method

tail(n=5)

Return the last n rows.

This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

For negative values of n, this function returns all rows except the first |n| rows, equivalent to df[|n|:].

If n is larger than the number of rows, this function returns all rows.

Parameters
  • n (int, default 5) Number of rows to select.
Returns (type of caller)

The last n rows of the caller object.

See Also

DataFrame.head : The first n rows of the caller object.

Examples
>>> df = pd.DataFrame(...     {
...         "animal": [
...             "alligator",
...             "bee",
...             "falcon",
...             "lion",
...             "monkey",
...             "parrot",
...             "shark",
...             "whale",
...             "zebra",
...         ]
...     }
... )
>>> df
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale
8      zebra

Viewing the last 5 lines

>>> df.tail()
   animal
4  monkey
5  parrot
6   shark
7   whale
8   zebra

Viewing the last n lines (three in this case)

>>> df.tail(3)
  animal
6  shark
7  whale
8  zebra

For negative values of n

>>> df.tail(-3)
   animal
3    lion
4  monkey
5  parrot
6   shark
7   whale
8   zebra
method

sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

Return a random sample of items from an axis of object.

You can use random_state for reproducibility.

Parameters
  • n (int, optional) Number of items from axis to return. Cannot be used with frac.Default = 1 if frac = None.
  • frac (float, optional) Fraction of axis items to return. Cannot be used with n.
  • replace (bool, default False) Allow or disallow sampling of the same row more than once.
  • weights (str or ndarray-like, optional) Default None results in equal probability weighting.If passed a Series, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DataFrame, will accept the name of a column when axis = 0. Unless weights are a Series, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed. When replace = False will not allow (n * max(weights) / sum(weights)) > 1 in order to avoid biased results. See the Notes below for more details.
  • random_state (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) If int, array-like, or BitGenerator, seed for random number generator.If np.random.RandomState or np.random.Generator, use as given. Default None results in sampling with the current state of np.random.
  • axis ({0 or 'index', 1 or 'columns', None}, default None) Axis to sample. Accepts axis number or name. Default is stat axisfor given data type. For Series this parameter is unused and defaults to None.
  • ignore_index (bool, default False) If True, the resulting index will be labeled 0, 1, …, n - 1.
Returns (Series or DataFrame)

A new object of same type as caller containing n items randomlysampled from the caller object.

See Also

DataFrameGroupBy.sample: Generates random samples from each group of a DataFrame object. SeriesGroupBy.sample: Generates random samples from each group of a Series object. numpy.random.choice: Generates a random sample from a given 1-D numpy array.

Notes

If frac > 1, replacement should be set to True.

When replace = False will not allow (n * max(weights) / sum(weights)) > 1, since that would cause results to be biased. E.g. sampling 2 items without replacement with weights [100, 1, 1] would yield two last items in 1/2 of cases, instead of 1/102. This is similar to specifying n=4 without replacement on a Series with 3 elements.

Examples
>>> df = pd.DataFrame(...     {
...         "num_legs": [2, 4, 8, 0],
...         "num_wings": [2, 0, 0, 0],
...         "num_specimen_seen": [10, 2, 1, 8],
...     },
...     index=["falcon", "dog", "spider", "fish"],
... )
>>> df
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8

Extract 3 random elements from the Series df['num_legs']: Note that we use random_state to ensure the reproducibility of the examples.

>>> df["num_legs"].sample(n=3, random_state=1)
fish      0
spider    8
falcon    2
Name: num_legs, dtype: int64

A random 50% sample of the DataFrame with replacement:

>>> df.sample(frac=0.5, replace=True, random_state=1)
      num_legs  num_wings  num_specimen_seen
dog          4          0                  2
fish         0          0                  8

An upsample sample of the DataFrame with replacement: Note that replace parameter has to be True for frac parameter > 1.

>>> df.sample(frac=2, replace=True, random_state=1)
        num_legs  num_wings  num_specimen_seen
dog            4          0                  2
fish           0          0                  8
falcon         2          2                 10
falcon         2          2                 10
fish           0          0                  8
dog            4          0                  2
fish           0          0                  8
dog            4          0                  2

Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen column are more likely to be sampled.

>>> df.sample(n=2, weights="num_specimen_seen", random_state=1)
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
fish           0          0                  8
method

pipe(func, *args, **kwargs)

Apply chainable functions that expect Series or DataFrames.

Parameters
  • func (function) Function to apply to the Series/DataFrame.args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the Series/DataFrame.
  • *args (iterable, optional) Positional arguments passed into func.
  • **kwargs (mapping, optional) A dictionary of keyword arguments passed into func.
Returns (The return type of ``func``.)

The result of applying func to the Series or DataFrame.

See Also

DataFrame.apply : Apply a function along input axis of DataFrame.DataFrame.map : Apply a function elementwise on a whole DataFrame. Series.map : Apply a mapping correspondence on a :class:~pandas.Series.

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects.

Examples

Constructing an income DataFrame from a dictionary.

>>> data = [[8000, 1000], [9500, np.nan], [5000, 2000]]
>>> df = pd.DataFrame(data, columns=["Salary", "Others"])
>>> df
   Salary  Others
0    8000  1000.0
1    9500     NaN
2    5000  2000.0

Functions that perform tax reductions on an income DataFrame.

>>> def subtract_federal_tax(df):
...     return df * 0.9
>>> def subtract_state_tax(df, rate):
...     return df * (1 - rate)
>>> def subtract_national_insurance(df, rate, rate_increase):
...     new_rate = rate + rate_increase
...     return df * (1 - new_rate)

Instead of writing

>>> subtract_national_insurance(
...     subtract_state_tax(subtract_federal_tax(df), rate=0.12),
...     rate=0.05,
...     rate_increase=0.02,
... )  # doctest: +SKIP

You can write

>>> (
...     df.pipe(subtract_federal_tax)
...     .pipe(subtract_state_tax, rate=0.12)
...     .pipe(subtract_national_insurance, rate=0.05, rate_increase=0.02)
... )
    Salary   Others
0  5892.48   736.56
1  6997.32      NaN
2  3682.80  1473.12

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose national_insurance takes its data as df in the second argument:

>>> def subtract_national_insurance(rate, df, rate_increase):
...     new_rate = rate + rate_increase
...     return df * (1 - new_rate)
>>> (
...     df.pipe(subtract_federal_tax)
...     .pipe(subtract_state_tax, rate=0.12)
...     .pipe(
...         (subtract_national_insurance, "df"), rate=0.05, rate_increase=0.02
...     )
... )
    Salary   Others
0  5892.48   736.56
1  6997.32      NaN
2  3682.80  1473.12
method

__finalize__(other, method=None, **kwargs) → Self

Propagate metadata from other to self.

This is the default implementation. Subclasses may override this method to implement their own metadata handling.

Parameters
  • other (the object from which to get the attributes that we are going) to propagate. If other has an input_objs attribute, thenthis attribute must contain an iterable of objects, each with an attrs attribute.
  • method (str, optional) A passed method name providing context on where __finalize__was called.
    .. warning::
    The value passed as method are not currently considered stable across pandas releases.

Notes

In case other has an input_objs attribute, this method only propagates its metadata if each object in input_objs has the exact same metadata as the others.

method

__getattr__(name)

After regular attribute access, try looking up the nameThis allows simpler access to columns for interactive use.

method

__setattr__(name, value)

After regular attribute access, try setting the nameThis allows simpler access to columns for interactive use.

method

astype(dtype, copy=<no_default>, errors='raise')

Cast a pandas object to a specified dtype dtype.

This method allows the conversion of the data types of pandas objects, including DataFrames and Series, to the specified dtype. It supports casting entire objects to a single data type or applying different data types to individual columns using a mapping.

Parameters
  • dtype (str, data type, Series or Mapping of column name -> data type) Use a str, numpy.dtype, pandas.ExtensionDtype or Python type tocast entire pandas object to the same type. Alternatively, use a mapping, e.g. {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
  • errors ({'raise', 'ignore'}, default 'raise') Control raising of exceptions on invalid data for provided dtype.
    • raise : allow exceptions to be raised
    • ignore : suppress exceptions. On error return original object.
Returns (same type as caller)

The pandas object casted to the specified dtype.

See Also

to_datetime : Convert argument to datetime.to_timedelta : Convert argument to timedelta. to_numeric : Convert argument to a numeric type. numpy.ndarray.astype : Cast a numpy array to a specified type.

Notes

.. versionchanged:: 2.0.0

Using ``astype`` to convert from timezone-naive dtype to
timezone-aware dtype will raise an exception.
Use :meth:`Series.dt.tz_localize` instead.
Examples

Create a DataFrame:

>>> d = {"col1": [1, 2], "col2": [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df.dtypes
col1    int64
col2    int64
dtype: object

Cast all columns to int32:

>>> df.astype("int32").dtypes
col1    int32
col2    int32
dtype: object

Cast col1 to int32 using a dictionary:

>>> df.astype({"col1": "int32"}).dtypes
col1    int32
col2    int64
dtype: object

Create a series:

>>> ser = pd.Series([1, 2], dtype="int32")
>>> ser
0    1
1    2
dtype: int32
>>> ser.astype("int64")
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype("category")
0    1
1    2
dtype: category
Categories (2, int32): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> from pandas.api.types import CategoricalDtype
>>> cat_dtype = CategoricalDtype(categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Create a series of dates:

>>> ser_date = pd.Series(pd.date_range("20200101", periods=3))
>>> ser_date
0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[us]
method

copy(deep=True)

Make a copy of this object's indices and data.

When deep=True (default), a new object will be created with a copy of the calling object's data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

When deep=False, a new object will be created without copying the calling object's data or index (only references to the data and index are copied). With Copy-on-Write, changes to the original will not be reflected in the shallow copy (and vice versa). The shallow copy uses a lazy (deferred) copy mechanism that copies the data only when any changes to the original or shallow copy are made, ensuring memory efficiency while maintaining data integrity.

.. note:: In pandas versions prior to 3.0, the default behavior without Copy-on-Write was different: changes to the original were reflected in the shallow copy (and vice versa). See the :ref:Copy-on-Write user guide <copy_on_write> for more information.

Parameters
  • deep (bool, default True) Make a deep copy, including a copy of the data and the indices.With deep=False neither the indices nor the data are copied.
Returns (Series or DataFrame)

Object type matches caller.

See Also

copy.copy : Return a shallow copy of an object.copy.deepcopy : Return a deep copy of an object.

Notes

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

While Index objects are copied when deep=True, the underlying numpy array is not copied for performance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not needed.

Since pandas is not thread safe, see the :ref:gotchas <gotchas.thread-safety> when copying in a threading environment.

Copy-on-Write protects shallow copies against accidental modifications. This means that any changes to the copied data would make a new copy of the data upon write (and vice versa). Changes made to either the original or copied variable would not be reflected in the counterpart. See :ref:Copy_on_Write <copy_on_write> for more information.

Examples
>>> s = pd.Series([1, 2], index=["a", "b"])>>> s
a    1
b    2
dtype: int64
>>> s_copy = s.copy(deep=True)
>>> s_copy
a    1
b    2
dtype: int64

Due to Copy-on-Write, shallow copies still protect data modifications. Note shallow does not get modified below.

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> shallow = s.copy(deep=False)
>>> s.iloc[1] = 200
>>> shallow
a    1
b    2
dtype: int64

When the data has object dtype, even a deep copy does not copy the underlying Python objects. Updating a nested data object will be reflected in the deep copy.

>>> s = pd.Series([[1, 2], [3, 4]])
>>> deep = s.copy()
>>> s[0][0] = 10
>>> s
0    [10, 2]
1     [3, 4]
dtype: object
>>> deep
0    [10, 2]
1     [3, 4]
dtype: object
method

__deepcopy__(memo=None) → Self

method

infer_objects(copy=<no_default>)

Attempt to infer better dtypes for object columns.

Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.

Parameters
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
Returns (same type as input object)

Returns an object of the same type as the input object.

See Also

to_datetime : Convert argument to datetime.to_timedelta : Convert argument to timedelta. to_numeric : Convert argument to numeric type. convert_dtypes : Convert argument to best possible dtype.

Examples
>>> df = pd.DataFrame({"A": ["a", 1, 2, 3]})>>> df = df.iloc[1:]
>>> df
   A
1  1
2  2
3  3
>>> df.dtypes
A    object
dtype: object
>>> df.infer_objects().dtypes
A    int64
dtype: object
method

convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True, dtype_backend='numpy_nullable')

Convert columns from numpy dtypes to the best dtypes that support pd.NA.

Parameters
  • infer_objects (bool, default True) Whether object dtypes should be converted to the best possible types.
  • convert_string (bool, default True) Whether object dtypes should be converted to StringDtype().
  • convert_integer (bool, default True) Whether, if possible, conversion can be done to integer extension types.
  • convert_boolean (bool, defaults True) Whether object dtypes should be converted to BooleanDtypes().
  • convert_floating (bool, defaults True) Whether, if possible, conversion can be done to floating extension types.If convert_integer is also True, preference will be give to integer dtypes if the floats can be faithfully casted to integers.
  • dtype_backend ({'numpy_nullable', 'pyarrow'}, default 'numpy_nullable') Back-end data type applied to the resultant :class:DataFrame or:class:Series (still experimental). Behaviour is as follows:
    • "numpy_nullable": returns nullable-dtype-backed :class:DataFrame or :class:Serires.
    • "pyarrow": returns pyarrow-backed nullable :class:ArrowDtype :class:DataFrame or :class:Series.
    .. versionadded:: 2.0
Returns (Series or DataFrame)

Copy of input object with new dtype.

See Also

infer_objects : Infer dtypes of objects.to_datetime : Convert argument to datetime. to_timedelta : Convert argument to timedelta. to_numeric : Convert argument to a numeric type.

Notes

By default, convert_dtypes will attempt to convert a Series (or each Series in a DataFrame) to dtypes that support pd.NA. By using the options convert_string, convert_integer, convert_boolean and convert_floating, it is possible to turn off individual conversions to StringDtype, the integer extension types, BooleanDtype or floating extension types, respectively.

For object-dtyped columns, if infer_objects is True, use the inference rules as during normal Series/DataFrame construction. Then, if possible, convert to StringDtype, BooleanDtype or an appropriate integer or floating extension type, otherwise leave as object.

If the dtype is integer, convert to an appropriate integer extension type.

If the dtype is numeric, and consists of all integers, convert to an appropriate integer extension type. Otherwise, convert to an appropriate floating extension type.

In the future, as new dtypes are added that support pd.NA, the results of this method will change to support those new dtypes.

Examples
>>> df = pd.DataFrame(...     {
...         "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
...         "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
...         "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
...         "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
...         "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
...         "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
...     }
... )

Start with a DataFrame with default dtypes.

>>> df
   a  b      c    d     e      f
0  1  x   True    h  10.0    NaN
1  2  y  False    i   NaN  100.5
2  3  z    NaN  NaN  20.0  200.0
>>> df.dtypes
a      int32
b     object
c     object
d     object
e    float64
f    float64
dtype: object

Convert the DataFrame to use best possible dtypes.

>>> dfn = df.convert_dtypes()
>>> dfn
   a  b      c     d     e      f
0  1  x   True     h    10   <NA>
1  2  y  False     i  <NA>  100.5
2  3  z   <NA>  <NA>    20  200.0
>>> dfn.dtypes
a      Int32
b     string
c    boolean
d     string
e      Int64
f    Float64
dtype: object

Start with a Series of strings and missing data represented by np.nan.

>>> s = pd.Series(["a", "b", np.nan])
>>> s
0      a
1      b
2    NaN
dtype: str

Obtain a Series with dtype StringDtype.

>>> s.convert_dtypes()
0       a
1       b
2    <NA>
dtype: string
method

fillna(value, axis=None, inplace=False, limit=None)

Fill NA/NaN values with value.

Parameters
  • value (scalar, dict, Series, or DataFrame) Value to use to fill holes (e.g. 0), alternately adict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
  • axis ({0 or 'index'} for Series, {0 or 'index', 1 or 'columns'} for DataFrame) Axis along which to fill missing values. For Seriesthis parameter is unused and defaults to 0.
  • inplace (bool, default False) If True, fill in-place. Note: this will modify anyother views on this object (e.g., a no-copy slice for a column in a DataFrame).
  • limit (int, default None) This is the maximum number of entries along the entire axiswhere NaNs will be filled. Must be greater than 0 if not None.
Returns (Series/DataFrame)

Object with missing values filled.

See Also

ffill : Fill values by propagating the last valid observation to next valid.bfill : Fill values by using the next valid observation to fill the gap. interpolate : Fill NaN values using interpolation. reindex : Conform object to new index. asfreq : Convert TimeSeries to specified frequency.

Notes

For non-object dtype, value=None will use the NA value of the dtype. See more details in the :ref:Filling missing data<missing_data.fillna> section.

Examples
>>> df = pd.DataFrame(...     [
...         [np.nan, 2, np.nan, 0],
...         [3, 4, np.nan, 1],
...         [np.nan, np.nan, np.nan, np.nan],
...         [np.nan, 3, np.nan, 4],
...     ],
...     columns=list("ABCD"),
... )
>>> df
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0

Replace all NaN elements with 0s.

>>> df.fillna(0)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  0.0
3  0.0  3.0  0.0  4.0

Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1, 2, and 3 respectively.

>>> values = {"A": 0, "B": 1, "C": 2, "D": 3}
>>> df.fillna(value=values)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  2.0  1.0
2  0.0  1.0  2.0  3.0
3  0.0  3.0  2.0  4.0

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  NaN  1.0
2  NaN  1.0  NaN  3.0
3  NaN  3.0  NaN  4.0

When filling using a DataFrame, replacement happens along the same column names and same indices

>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
>>> df.fillna(df2)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  NaN
3  0.0  3.0  0.0  4.0

Note that column D is not affected since it is not present in df2.

method

ffill(axis=None, inplace=False, limit=None, limit_area=None)

Fill NA/NaN values by propagating the last valid observation to next valid.

Parameters
  • axis ({0 or 'index'} for Series, {0 or 'index', 1 or 'columns'} for DataFrame) Axis along which to fill missing values. For Seriesthis parameter is unused and defaults to 0.
  • inplace (bool, default False) If True, fill in-place. Note: this will modify anyother views on this object (e.g., a no-copy slice for a column in a DataFrame).
  • limit (int, default None) If method is specified, this is the maximum number of consecutiveNaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
  • limit_area ({`None`, 'inside', 'outside'}, default None) If limit is specified, consecutive NaNs will be filled with thisrestriction.
    • None: No fill restriction.
    • 'inside': Only fill NaNs surrounded by valid values (interpolate).
    • 'outside': Only fill NaNs outside valid values (extrapolate).
    .. versionadded:: 2.2.0
Returns (Series/DataFrame)

Object with missing values filled.

See Also

DataFrame.bfill : Fill NA/NaN values by using the next valid observation to fill the gap.

Examples
>>> df = pd.DataFrame(...     [
...         [np.nan, 2, np.nan, 0],
...         [3, 4, np.nan, 1],
...         [np.nan, np.nan, np.nan, np.nan],
...         [np.nan, 3, np.nan, 4],
...     ],
...     columns=list("ABCD"),
... )
>>> df
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0
>>> df.ffill()
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  3.0  4.0 NaN  1.0
3  3.0  3.0 NaN  4.0
>>> ser = pd.Series([1, np.nan, 2, 3])
>>> ser.ffill()
0   1.0
1   1.0
2   2.0
3   3.0
dtype: float64
method

bfill(axis=None, inplace=False, limit=None, limit_area=None)

Fill NA/NaN values by using the next valid observation to fill the gap.

This method fills missing values in a backward direction along the specified axis, propagating non-null values from later positions to earlier positions containing NaN.

Parameters
  • axis ({0 or 'index'} for Series, {0 or 'index', 1 or 'columns'} for DataFrame) Axis along which to fill missing values. For Seriesthis parameter is unused and defaults to 0.
  • inplace (bool, default False) If True, fill in-place. Note: this will modify anyother views on this object (e.g., a no-copy slice for a column in a DataFrame).
  • limit (int, default None) If method is specified, this is the maximum number of consecutiveNaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
  • limit_area ({`None`, 'inside', 'outside'}, default None) If limit is specified, consecutive NaNs will be filled with thisrestriction.
    • None: No fill restriction.
    • 'inside': Only fill NaNs surrounded by valid values (interpolate).
    • 'outside': Only fill NaNs outside valid values (extrapolate).
    .. versionadded:: 2.2.0
Returns (Series/DataFrame)

Object with missing values filled.

See Also

DataFrame.ffill : Fill NA/NaN values by propagating the last valid observation to next valid.

Examples

For Series:

>>> s = pd.Series([1, None, None, 2])
>>> s.bfill()
0    1.0
1    2.0
2    2.0
3    2.0
dtype: float64
>>> s.bfill(limit=1)
0    1.0
1    NaN
2    2.0
3    2.0
dtype: float64

With DataFrame:

>>> df = pd.DataFrame({"A": [1, None, None, 4], "B": [None, 5, None, 7]})
>>> df
      A     B
0   1.0   NaN
1   NaN   5.0
2   NaN   NaN
3   4.0   7.0
>>> df.bfill()
      A     B
0   1.0   5.0
1   4.0   5.0
2   4.0   7.0
3   4.0   7.0
>>> df.bfill(limit=1)
      A     B
0   1.0   5.0
1   NaN   5.0
2   4.0   7.0
3   4.0   7.0
method

replace(to_replace=None, value=<no_default>, inplace=False, regex=False)

Replace values given in to_replace with value.

Values of the Series/DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters
  • to_replace (str, regex, list, dict, Series, int, float, or None) How to find the values that will be replaced.
    • numeric, str or regex:
      • numeric: numeric values equal to to_replace will be replaced with value
      • str: string exactly matching to_replace will be replaced with value
      • regex: regexes matching to_replace will be replaced with value
    • list of str, regex, or numeric:
      • First, if to_replace and value are both lists, they must be the same length.
      • Second, if regex=True then all of the strings in both lists will be interpreted as regexes otherwise they will match directly. This doesn't matter much for value since there are only a few possible substitution regexes you can use.
      • str, regex and numeric rules apply as above.
    • dict:
      • Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and 'y' with 'z'. To use a dict in this way, the optional value parameter should not be given.
      • For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and the value 'z' in column 'b' and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
      • For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column 'a' for the value 'b' and replace it with NaN. The optional value parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
    • None:
      • This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.
    See the examples section for examples of each of these.
  • value (scalar, dict, list, str, regex, default None) Value to replace any values matching to_replace with.For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
  • inplace (bool, default False) If True, performs operation inplace.
  • regex (bool or same types as `to_replace`, default False) Whether to interpret to_replace and/or value as regularexpressions. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.
Returns (Series/DataFrame)

Object after replacement.

Raises
  • AssertionError
    • If regex is not a bool and to_replace is not None.
  • TypeError
    • If to_replace is not a scalar, array-like, dict, or None
    • If to_replace is a dict and value is not a list, dict, ndarray, or Series
    • If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series.
    • When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced
  • ValueError
    • If a list or an ndarray is passed to to_replace and value but they are not the same length.
See Also

Series.fillna : Fill NA values.DataFrame.fillna : Fill NA values. Series.where : Replace values based on boolean condition. DataFrame.where : Replace values based on boolean condition. DataFrame.map: Apply a function to a Dataframe elementwise. Series.map: Map values of Series according to an input mapping or function. Series.str.replace : Simple string replacement.

Notes

  • Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.
  • Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
  • This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
  • When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.
Examples

*

) ) 5 2 3 4 5 4

( { , , , } ) ) C a b c d e

*

) C a b c d e

) C a b c d e

*

) C a b c d e

) C a b c d e

) C a b c d e

*

) ) B c w z

) B c r z

) B c w z

) B c w z

) B c w z

d s :

)

e . o :

) 0 e e b e t

:

) 0 e e b e t

, .

( { , , , } )

) C e e h i j

y .

) C f g e e e

method

interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, **kwargs)

Fill NaN values using an interpolation method.

Please note that only method='linear' is supported for DataFrame/Series with a MultiIndex.

Parameters
  • method (str, default 'linear') Interpolation technique to use. One of:
    • 'linear': Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
    • 'time': Works on daily and higher resolution data to interpolate given length of interval. This interpolates values based on time interval between observations.
    • 'index': The interpolation uses the numerical values of the DataFrame's index to linearly calculate missing values.
    • 'values': Interpolation based on the numerical values in the DataFrame, treating them as equally spaced along the index.
    • 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'barycentric', 'polynomial': Passed to scipy.interpolate.interp1d, whereas 'spline' is passed to scipy.interpolate.UnivariateSpline. These methods use the numerical values of the index. Both 'polynomial' and 'spline' require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5). Note that, slinear method in Pandas refers to the Scipy first order spline instead of Pandas first order spline.
    • 'krogh', 'piecewise_polynomial', 'spline', 'pchip', 'akima', 'cubicspline': Wrappers around the SciPy interpolation methods of similar names. See Notes.
    • 'from_derivatives': Refers to scipy.interpolate.BPoly.from_derivatives.
  • axis ({0 or 'index', 1 or 'columns', None}, default None) Axis to interpolate along. For Series this parameter is unusedand defaults to 0.
  • limit (int, optional) Maximum number of consecutive NaNs to fill. Must be greater than0.
  • inplace (bool, default False) Update the data in place if possible.
  • limit_direction ({'forward', 'backward', 'both'}, optional, default 'forward') Consecutive NaNs will be filled in this direction.
  • limit_area ({`None`, 'inside', 'outside'}, default None) If limit is specified, consecutive NaNs will be filled with thisrestriction.
    • None: No fill restriction.
    • 'inside': Only fill NaNs surrounded by valid values (interpolate).
    • 'outside': Only fill NaNs outside valid values (extrapolate).
  • **kwargs (optional) Keyword arguments to pass on to the interpolating function.
Returns (Series or DataFrame)

Returns the same object type as the caller, interpolated atsome or all NaN values.

See Also

fillna : Fill missing values using different methods.scipy.interpolate.Akima1DInterpolator : Piecewise cubic polynomials (Akima interpolator). scipy.interpolate.BPoly.from_derivatives : Piecewise polynomial in the Bernstein basis. scipy.interpolate.interp1d : Interpolate a 1-D function. scipy.interpolate.KroghInterpolator : Interpolate polynomial (Krogh interpolator). scipy.interpolate.PchipInterpolator : PCHIP 1-d monotonic cubic interpolation. scipy.interpolate.CubicSpline : Cubic spline data interpolator.

Notes

The 'krogh', 'piecewise_polynomial', 'spline', 'pchip' and 'akima' methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index. For more information on their behavior, see the SciPy documentation <https://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation>__.

Examples

Filling in NaN in a :class:~pandas.Series via linearinterpolation.

>>> s = pd.Series([0, 1, np.nan, 3])
>>> s
0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64
>>> s.interpolate()
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

Filling in NaN in a Series via polynomial interpolation or splines: Both 'polynomial' and 'spline' methods require that you also specify an order (int).

>>> s = pd.Series([0, 2, np.nan, 8])
>>> s.interpolate(method="polynomial", order=2)
0    0.000000
1    2.000000
2    4.666667
3    8.000000
dtype: float64

Fill the DataFrame forward (that is, going down) along each column using linear interpolation.

Note how the last entry in column 'a' is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column 'b' remains NaN, because there is no entry before it to use for interpolation.

>>> df = pd.DataFrame(
...     [
...         (0.0, np.nan, -1.0, 1.0),
...         (np.nan, 2.0, np.nan, np.nan),
...         (2.0, 3.0, np.nan, 9.0),
...         (np.nan, 4.0, -4.0, 16.0),
...     ],
...     columns=list("abcd"),
... )
>>> df
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  NaN  2.0  NaN   NaN
2  2.0  3.0  NaN   9.0
3  NaN  4.0 -4.0  16.0
>>> df.interpolate(method="linear", limit_direction="forward", axis=0)
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  1.0  2.0 -2.0   5.0
2  2.0  3.0 -3.0   9.0
3  2.0  4.0 -4.0  16.0

Using polynomial interpolation.

>>> df["d"].interpolate(method="polynomial", order=2)
0     1.0
1     4.0
2     9.0
3    16.0
Name: d, dtype: float64
method

asof(where, subset=None)

Return the last row(s) without any NaNs before where.

The last row (for each element in where, if list) without any NaN is taken. In case of a :class:~pandas.DataFrame, the last row without NaN considering only the subset of columns (if not None)

If there is no good value, NaN is returned for a Series or a Series of NaN values for a DataFrame

Parameters
  • where (date or array-like of dates) Date(s) before which the last row(s) are returned.
  • subset (str or array-like of str, default `None`) For DataFrame, if not None, only use these columns tocheck for NaNs.
Returns (scalar, Series, or DataFrame)

:

r , r n e

See Also

merge_asof : Perform an asof merge. Similar to left join.

Notes

Dates are assumed to be sorted. Raises if this is not the case.

Examples

A Series and a scalar where.

>>> s = pd.Series([1, 2, np.nan, 4], index=[10, 20, 30, 40])
>>> s
10    1.0
20    2.0
30    NaN
40    4.0
dtype: float64
>>> s.asof(20)
np.float64(2.0)

For a sequence where, a Series is returned. The first value is NaN, because the first element of where is before the first index value.

>>> s.asof([5, 20])
5     NaN
20    2.0
dtype: float64

Missing values are not considered. The following is 2.0, not NaN, even though NaN is at the index location for 30.

>>> s.asof(30)
np.float64(2.0)

Take all columns into consideration

>>> df = pd.DataFrame(
...     {
...         "a": [10.0, 20.0, 30.0, 40.0, 50.0],
...         "b": [None, None, None, None, 500],
...     },
...     index=pd.DatetimeIndex(
...         [
...             "2018-02-27 09:01:00",
...             "2018-02-27 09:02:00",
...             "2018-02-27 09:03:00",
...             "2018-02-27 09:04:00",
...             "2018-02-27 09:05:00",
...         ]
...     ),
... )
>>> df.asof(pd.DatetimeIndex(["2018-02-27 09:03:30", "2018-02-27 09:04:30"]))
                      a   b
2018-02-27 09:03:30 NaN NaN
2018-02-27 09:04:30 NaN NaN

Take a single column into consideration

>>> df.asof(
...     pd.DatetimeIndex(["2018-02-27 09:03:30", "2018-02-27 09:04:30"]),
...     subset=["a"],
... )
                        a   b
2018-02-27 09:03:30  30.0 NaN
2018-02-27 09:04:30  40.0 NaN
method

clip(lower=None, upper=None, axis=None, inplace=False, **kwargs)

Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters
  • lower (float or array-like, default None) Minimum threshold value. All values below thisthreshold will be set to it. A missing threshold (e.g NA) will not clip the value.
  • upper (float or array-like, default None) Maximum threshold value. All values above thisthreshold will be set to it. A missing threshold (e.g NA) will not clip the value.
  • axis ({0 or 'index', 1 or 'columns', None}, default None) Align object with lower and upper along the given axis.For Series this parameter is unused and defaults to None.
  • inplace (bool, default False) Whether to perform the operation in place on the data.
  • **kwargs Additional keywords have no effect but might be acceptedfor compatibility with numpy.
Returns (Series or DataFrame)

Same type as calling object with the values outside theclip boundaries replaced.

See Also

Series.clip : Trim values at input threshold in series.DataFrame.clip : Trim values at input threshold in DataFrame. numpy.clip : Clip (limit) the values in an array.

Examples
>>> data = {"col_0": [9, -3, 0, -1, 5], "col_1": [-2, -7, 6, 8, -5]}>>> df = pd.DataFrame(data)
>>> df
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column:

>>> df.clip([-2, -1], [4, 5])
    col_0  col_1
0      4     -1
1     -2     -1
2      0      5
3     -1      5
4      4     -1

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])
>>> t
0    2
1   -4
2   -1
3    6
4    3
dtype: int64
>>> df.clip(t, t + 4, axis=0)
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3

Clips using specific lower threshold per column element, with missing values:

>>> t = pd.Series([2, -4, np.nan, 6, 3])
>>> t
0    2.0
1   -4.0
2    NaN
3    6.0
4    3.0
dtype: float64
>>> df.clip(t, axis=0)
col_0  col_1
0      9.0    2.0
1     -3.0   -4.0
2      0.0    6.0
3      6.0    8.0
4      5.0    3.0
method

asfreq(freq, method=None, how=None, normalize=False, fill_value=None)

Convert time series to specified frequency.

Returns the original data conformed to a new index with the specified frequency.

If the index of this Series/DataFrame is a :class:~pandas.PeriodIndex, the new index is the result of transforming the original index with :meth:PeriodIndex.asfreq <pandas.PeriodIndex.asfreq> (so the original index will map one-to-one to the new index).

Otherwise, the new index will be equivalent to pd.date_range(start, end, freq=freq) where start and end are, respectively, the min and max entries in the original index (see :func:pandas.date_range). The values corresponding to any timesteps in the new index which were not present in the original index will be null (NaN), unless a method for filling such unknowns is provided (see the method parameter below).

The :meth:resample method is more appropriate if an operation on each group of timesteps (such as an aggregate) is necessary to represent the data at the new frequency.

Parameters
  • freq (DateOffset or str) Frequency DateOffset or string.
  • method ({'backfill'/'bfill', 'pad'/'ffill'}, default None) Method to use for filling holes in reindexed Series (note thisdoes not fill NaNs that already were present):
    • 'pad' / 'ffill': propagate last valid observation forward to next valid based on the order of the index
    • 'backfill' / 'bfill': use NEXT valid observation to fill.
  • how ({'start', 'end'}, default end) For PeriodIndex only (see PeriodIndex.asfreq).
  • normalize (bool, default False) Whether to reset output index to midnight.
  • fill_value (scalar, optional) Value to use for missing values, applied during upsampling (notethis does not fill NaNs that already were present).
Returns (Series/DataFrame)

Series/DataFrame object reindexed to the specified frequency.

See Also

reindex : Conform DataFrame to new index with optional filling logic.

Notes

To learn more about the frequency strings, please see :ref:this link<timeseries.offset_aliases>.

Examples

Start by creating a series with 4 one minute timestamps.

>>> index = pd.date_range("1/1/2000", periods=4, freq="min")
>>> series = pd.Series([0.0, None, 2.0, 3.0], index=index)
>>> df = pd.DataFrame({"s": series})
>>> df
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:01:00    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:03:00    3.0

Upsample the series into 30 second bins.

>>> df.asfreq(freq="30s")
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    NaN
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    NaN
2000-01-01 00:03:00    3.0

Upsample again, providing a fill value.

>>> df.asfreq(freq="30s", fill_value=9.0)
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    9.0
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    9.0
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    9.0
2000-01-01 00:03:00    3.0

Upsample again, providing a method.

>>> df.asfreq(freq="30s", method="bfill")
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    NaN
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    2.0
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    3.0
2000-01-01 00:03:00    3.0
method

at_time(time, asof=False, axis=None)

Select values at particular time of day (e.g., 9:30AM).

Parameters
  • time (datetime.time or str) The values to select.
  • asof (bool, default False) This parameter is currently not supported.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) For Series this parameter is unused and defaults to 0.
Returns (Series or DataFrame)

The values with the specified time.

Raises
  • TypeError If the index is not a :class:DatetimeIndex
See Also

between_time : Select values between particular times of the day.first : Select initial periods of time series based on a date offset. last : Select final periods of time series based on a date offset. DatetimeIndex.indexer_at_time : Get just the index locations for values at particular time of the day.

Examples
>>> i = pd.date_range("2018-04-09", periods=4, freq="12h")>>> ts = pd.DataFrame({"A": [1, 2, 3, 4]}, index=i)
>>> ts
                     A
2018-04-09 00:00:00  1
2018-04-09 12:00:00  2
2018-04-10 00:00:00  3
2018-04-10 12:00:00  4
>>> ts.at_time("12:00")
                     A
2018-04-09 12:00:00  2
2018-04-10 12:00:00  4
method

between_time(start_time, end_time, inclusive='both', axis=None)

Select values between particular times of the day (e.g., 9:00-9:30 AM).

By setting start_time to be later than end_time, you can get the times that are not between the two times.

Parameters
  • start_time (datetime.time or str) Initial time as a time filter limit.
  • end_time (datetime.time or str) End time as a time filter limit.
  • inclusive ({"both", "neither", "left", "right"}, default "both") Include boundaries; whether to set each bound as closed or open.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) Determine range time on index or columns value.For Series this parameter is unused and defaults to 0.
Returns (Series or DataFrame)

Data from the original object filtered to the specified dates range.

Raises
  • TypeError If the index is not a :class:DatetimeIndex
See Also

at_time : Select values at a particular time of the day.first : Select initial periods of time series based on a date offset. last : Select final periods of time series based on a date offset. DatetimeIndex.indexer_between_time : Get just the index locations for values between particular times of the day.

Examples
>>> i = pd.date_range("2018-04-09", periods=4, freq="1D20min")>>> ts = pd.DataFrame({"A": [1, 2, 3, 4]}, index=i)
>>> ts
                     A
2018-04-09 00:00:00  1
2018-04-10 00:20:00  2
2018-04-11 00:40:00  3
2018-04-12 01:00:00  4
>>> ts.between_time("0:15", "0:45")
                     A
2018-04-10 00:20:00  2
2018-04-11 00:40:00  3

You get the times that are not between two times by setting start_time later than end_time:

>>> ts.between_time("0:45", "0:15")
                     A
2018-04-09 00:00:00  1
2018-04-12 01:00:00  4
method

resample(rule, closed=None, label=None, convention='start', on=None, level=None, origin='start_day', offset=None, group_keys=False)

Resample time-series data.

Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the on/level keyword parameter.

Parameters
  • rule (DateOffset, Timedelta or str) The offset string or object representing target conversion.
  • closed ({'right', 'left'}, default None) Which side of bin interval is closed. The default is 'left'for all frequency offsets except for 'ME', 'YE', 'QE', 'BME', 'BA', 'BQE', and 'W' which all have a default of 'right'.
  • label ({'right', 'left'}, default None) Which bin edge label to label bucket with. The default is 'left'for all frequency offsets except for 'ME', 'YE', 'QE', 'BME', 'BA', 'BQE', and 'W' which all have a default of 'right'.
  • convention ({'start', 'end', 's', 'e'}, default 'start') For PeriodIndex only, controls whether to use the start orend of rule.
  • on (str, optional) For a DataFrame, column to use instead of index for resampling.Column must be datetime-like.
  • level (str or int, optional) For a MultiIndex, level (name or number) to use forresampling. level must be datetime-like.
  • origin (Timestamp or str, default 'start_day') The timestamp on which to adjust the grouping. The timezone of originmust match the timezone of the index. If string, must be Timestamp convertible or one of the following:
    • 'epoch': origin is 1970-01-01
    • 'start': origin is the first value of the timeseries
    • 'start_day': origin is the first day at midnight of the timeseries
    • 'end': origin is the last value of the timeseries
    • 'end_day': origin is the ceiling midnight of the last day
    .. note::
    Only takes effect for Tick-frequencies (i.e. fixed frequencies like
    days, hours, and minutes, rather than months or quarters).
    
  • offset (Timedelta or str, default is None) An offset timedelta added to the origin.
  • group_keys (bool, default False) Whether to include the group keys in the result index when using.apply() on the resampled object.
    .. versionchanged:: 2.0.0
    ``group_keys`` now defaults to ``False``.
    
Returns (pandas.api.typing.Resampler)

:class:~pandas.core.Resampler object.

See Also

Series.resample : Resample a Series.DataFrame.resample : Resample a DataFrame. groupby : Group Series/DataFrame by mapping, function, label, or list of labels. asfreq : Reindex a Series/DataFrame with the given frequency without grouping.

Notes

See the user guide <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling>__ for more.

To learn more about the offset strings, please see this link <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects>__.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range("1/1/2000", periods=9, freq="min")
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: min, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample("3min").sum()
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3min, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3).

>>> series.resample("3min", label="right").sum()
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3min, dtype: int64

To include this value close the right side of the bin interval, as shown below.

>>> series.resample("3min", label="right", closed="right").sum()
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3min, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample("30s").asfreq()[0:5]  # Select first 5 rows
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30s, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the ffill method.

>>> series.resample("30s").ffill()[0:5]
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30s, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample("30s").bfill()[0:5]
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30s, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(arraylike):
...     return np.sum(arraylike) + 5
>>> series.resample("3min").apply(custom_resampler)
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3min, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

Resample a year by quarter using 'start' convention. Values are assigned to the first quarter of the period.

>>> s = pd.Series(
...     [1, 2], index=pd.period_range("2012-01-01", freq="Y", periods=2)
... )
>>> s
2012    1
2013    2
Freq: Y-DEC, dtype: int64
>>> s.resample("Q", convention="start").asfreq()
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

Resample quarters by month using 'end' convention. Values are assigned to the last month of the period.

>>> q = pd.Series(
...     [1, 2, 3, 4], index=pd.period_range("2018-01-01", freq="Q", periods=4)
... )
>>> q
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample("M", convention="end").asfreq()
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> df = pd.DataFrame([10, 11, 9, 13, 14, 18, 17, 19], columns=["price"])
>>> df["volume"] = [50, 60, 40, 100, 50, 100, 40, 50]
>>> df["week_starting"] = pd.date_range("01/01/2018", periods=8, freq="W")
>>> df
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample("ME", on="week_starting").mean()
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.

>>> days = pd.date_range("1/1/2000", periods=4, freq="D")
>>> df2 = pd.DataFrame(
...     [
...         [10, 50],
...         [11, 60],
...         [9, 40],
...         [13, 100],
...         [14, 50],
...         [18, 100],
...         [17, 40],
...         [19, 50],
...     ],
...     columns=["price", "volume"],
...     index=pd.MultiIndex.from_product([days, ["morning", "afternoon"]]),
... )
>>> df2
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample("D", level=0).sum()
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90

If you want to adjust the start of the bins based on a fixed timestamp:

>>> start, end = "2000-10-01 23:30:00", "2000-10-02 00:30:00"
>>> rng = pd.date_range(start, end, freq="7min")
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7min, dtype: int64
>>> ts.resample("17min").sum()
2000-10-01 23:14:00     0
2000-10-01 23:31:00     9
2000-10-01 23:48:00    21
2000-10-02 00:05:00    54
2000-10-02 00:22:00    24
Freq: 17min, dtype: int64
>>> ts.resample("17min", origin="epoch").sum()
2000-10-01 23:18:00     0
2000-10-01 23:35:00    18
2000-10-01 23:52:00    27
2000-10-02 00:09:00    39
2000-10-02 00:26:00    24
Freq: 17min, dtype: int64
>>> ts.resample("17min", origin="2000-01-01").sum()
2000-10-01 23:24:00     3
2000-10-01 23:41:00    15
2000-10-01 23:58:00    45
2000-10-02 00:15:00    45
Freq: 17min, dtype: int64

If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:

>>> ts.resample("17min", origin="start").sum()
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17min, dtype: int64
>>> ts.resample("17min", offset="23h30min").sum()
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17min, dtype: int64

If you want to take the largest Timestamp as the end of the bins:

>>> ts.resample("17min", origin="end").sum()
2000-10-01 23:35:00     0
2000-10-01 23:52:00    18
2000-10-02 00:09:00    27
2000-10-02 00:26:00    63
Freq: 17min, dtype: int64

In contrast with the start_day, you can use end_day to take the ceiling midnight of the largest Timestamp as the end of the bins and drop the bins not containing data:

>>> ts.resample("17min", origin="end_day").sum()
2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17min, dtype: int64
method

rank(axis=0, method='average', numeric_only=False, na_option='keep', ascending=True, pct=False)

Compute numerical data ranks (1 through n) along axis.

By default, equal values are assigned a rank that is the average of the ranks of those values.

Parameters
  • axis ({0 or 'index', 1 or 'columns'}, default 0) Index to direct ranking.For Series this parameter is unused and defaults to 0.
  • method ({'average', 'min', 'max', 'first', 'dense'}, default 'average') How to rank the group of records that have the same value (i.e. ties):
    • average: average rank of the group
    • min: lowest rank in the group
    • max: highest rank in the group
    • first: ranks assigned in order they appear in the array
    • dense: like 'min', but rank always increases by 1 between groups.
  • numeric_only (bool, default False) For DataFrame objects, rank only numeric columns if set to True.
    .. versionchanged:: 2.0.0 The default value of numeric_only is now False.
  • na_option ({'keep', 'top', 'bottom'}, default 'keep') How to rank NaN values:
    • keep: assign NaN rank to NaN values
    • top: assign lowest rank to NaN values
    • bottom: assign highest rank to NaN values
  • ascending (bool, default True) Whether or not the elements should be ranked in ascending order.
  • pct (bool, default False) Whether or not to display the returned rankings in percentileform.
Returns (same type as caller)

Return a Series or DataFrame with data ranks as values.

See Also

core.groupby.DataFrameGroupBy.rank : Rank of values within each group.core.groupby.SeriesGroupBy.rank : Rank of values within each group.

Examples
>>> df = pd.DataFrame(...     data={
...         "Animal": ["cat", "penguin", "dog", "spider", "snake"],
...         "Number_legs": [4, 2, 4, 8, np.nan],
...     }
... )
>>> df
    Animal  Number_legs
0      cat          4.0
1  penguin          2.0
2      dog          4.0
3   spider          8.0
4    snake          NaN

Ties are assigned the mean of the ranks (by default) for the group.

>>> s = pd.Series(range(5), index=list("abcde"))
>>> s["d"] = s["b"]
>>> s.rank()
a    1.0
b    2.5
c    4.0
d    2.5
e    5.0
dtype: float64

The following example shows how the method behaves with the above parameters:

  • default_rank: this is the default behaviour obtained without using any parameter.
  • max_rank: setting method = 'max' the records that have the same values are ranked using the highest rank (e.g.: since 'cat' and 'dog' are both in the 2nd and 3rd position, rank 3 is assigned.)
  • NA_bottom: choosing na_option = 'bottom', if there are records with NaN values they are placed at the bottom of the ranking.
  • pct_rank: when setting pct = True, the ranking is expressed as percentile rank.
>>> df["default_rank"] = df["Number_legs"].rank()
>>> df["max_rank"] = df["Number_legs"].rank(method="max")
>>> df["NA_bottom"] = df["Number_legs"].rank(na_option="bottom")
>>> df["pct_rank"] = df["Number_legs"].rank(pct=True)
>>> df
    Animal  Number_legs  default_rank  max_rank  NA_bottom  pct_rank
0      cat          4.0           2.5       3.0        2.5     0.625
1  penguin          2.0           1.0       1.0        1.0     0.250
2      dog          4.0           2.5       3.0        2.5     0.625
3   spider          8.0           4.0       4.0        4.0     1.000
4    snake          NaN           NaN       NaN        5.0       NaN
method

align(other, join='outer', axis=None, level=None, copy=<no_default>, fill_value=None)

Align two objects on their axes with the specified join method.

Join method is specified for each axis Index.

Parameters
  • other (DataFrame or Series) The object to align with.
  • join ({'outer', 'inner', 'left', 'right'}, default 'outer') Type of alignment to be performed.
    • left: use only keys from left frame, preserve key order.
    • right: use only keys from right frame, preserve key order.
    • outer: use union of keys from both frames, sort keys lexicographically.
    • inner: use intersection of keys from both frames, preserve the order of the left keys.
  • axis (allowed axis of the other object, default None) Align on index (0), columns (1), or both (None).
  • level (int or level name, default None) Broadcast across a level, matching Index values on thepassed MultiIndex level.
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
  • fill_value (scalar, default np.nan) Value to use for missing values. Defaults to NaN, but can be any"compatible" value.
Returns (tuple of (Series/DataFrame, type of other))

Aligned objects.

See Also

Series.align : Align two objects on their axes with specified join method.DataFrame.align : Align two objects on their axes with specified join method.

Examples
>>> df = pd.DataFrame(...     [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2]
... )
>>> other = pd.DataFrame(
...     [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]],
...     columns=["A", "B", "C", "D"],
...     index=[2, 3, 4],
... )
>>> df
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
>>> other
    A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Align on columns:

>>> left, right = df.align(other, join="outer", axis=1)
>>> left
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
>>> right
    A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

We can also align on the index:

>>> left, right = df.align(other, join="outer", axis=0)
>>> left
    D    B    E    A
1  1.0  2.0  3.0  4.0
2  6.0  7.0  8.0  9.0
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN
>>> right
    A      B      C      D
1    NaN    NaN    NaN    NaN
2   10.0   20.0   30.0   40.0
3   60.0   70.0   80.0   90.0
4  600.0  700.0  800.0  900.0

Finally, the default axis=None will align on both index and columns:

>>> left, right = df.align(other, join="outer", axis=None)
>>> left
     A    B   C    D    E
1  4.0  2.0 NaN  1.0  3.0
2  9.0  7.0 NaN  6.0  8.0
3  NaN  NaN NaN  NaN  NaN
4  NaN  NaN NaN  NaN  NaN
>>> right
       A      B      C      D   E
1    NaN    NaN    NaN    NaN NaN
2   10.0   20.0   30.0   40.0 NaN
3   60.0   70.0   80.0   90.0 NaN
4  600.0  700.0  800.0  900.0 NaN
method

where(cond, other=<no_default>, inplace=False, axis=None, level=None)

Replace values where the condition is False.

This method allows conditional replacement of values. Where the condition evaluates to True, the original values are retained; where it evaluates to False, values are replaced with corresponding entries from other.

Parameters
  • cond (bool Series/DataFrame, array-like, or callable) Where cond is True, keep the original value. WhereFalse, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn't check it).
  • other (scalar, Series/DataFrame, or callable) Entries where cond is False are replaced withcorresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn't check it). If not specified, entries will be filled with the corresponding NULL value (np.nan for numpy dtypes, pd.NA for extension dtypes).
  • inplace (bool, default False) Whether to perform the operation in place on the data.
  • axis (int, default None) Alignment axis if needed. For Series this parameter isunused and defaults to 0.
  • level (int, default None) Alignment level if needed.
Returns (Series or DataFrame)

When applied to a Series, the function will return a Series,and when applied to a DataFrame, it will return a DataFrame.

See Also

:func:DataFrame.mask : Return an object of same shape as caller.:func:Series.mask : Return an object of same shape as caller.

Notes

The where method is an application of the if-then idiom. For each element in the caller, if cond is True the element is used; otherwise the corresponding element from other is used. If the axis of other does not align with axis of cond Series/DataFrame, the values of cond on misaligned index positions will be filled with False.

The signature for :func:Series.where or :func:DataFrame.where differs from :func:numpy.where. Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in :ref:indexing <indexing.where_mask>.

The dtype of the object takes precedence. The fill value is casted to the object's dtype, if this can be done losslessly.

Examples
>>> s = pd.Series(range(5))>>> s.where(s > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))
>>> t = pd.Series([True, False])
>>> s.where(t, 99)
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=["A", "B"])
>>> df
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0
>>> df.where(m, -df)
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
method

mask(cond, other=<no_default>, inplace=False, axis=None, level=None)

Replace values where the condition is True.

Parameters
  • cond (bool Series/DataFrame, array-like, or callable) Where cond is False, keep the original value. WhereTrue, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn't check it).
  • other (scalar, Series/DataFrame, or callable) Entries where cond is True are replaced withcorresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn't check it). If not specified, entries will be filled with the corresponding NULL value (np.nan for numpy dtypes, pd.NA for extension dtypes).
  • inplace (bool, default False) Whether to perform the operation in place on the data.
  • axis (int, default None) Alignment axis if needed. For Series this parameter isunused and defaults to 0.
  • level (int, default None) Alignment level if needed.
Returns (Series or DataFrame)

When applied to a Series, the function will return a Series,and when applied to a DataFrame, it will return a DataFrame.

See Also

:func:DataFrame.where : Return an object of same shape as caller.:func:Series.where : Return an object of same shape as caller.

Notes

The mask method is an application of the if-then idiom. For each element in the caller, if cond is False the element is used; otherwise the corresponding element from other is used. If the axis of other does not align with axis of cond Series/DataFrame, the values of cond on misaligned index positions will be filled with True.

The signature for :func:Series.where or :func:DataFrame.where differs from :func:numpy.where. Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in :ref:indexing <indexing.where_mask>.

The dtype of the object takes precedence. The fill value is casted to the object's dtype, if this can be done losslessly.

Examples
>>> s = pd.Series(range(5))>>> s.where(s > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))
>>> t = pd.Series([True, False])
>>> s.where(t, 99)
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=["A", "B"])
>>> df
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0
>>> df.where(m, -df)
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
method

truncate(before=None, after=None, axis=None, copy=<no_default>)

Truncate a Series or DataFrame before and after some index value.

This is a useful shorthand for boolean indexing based on index values above or below certain thresholds.

Parameters
  • before (date, str, int) Truncate all rows before this index value.
  • after (date, str, int) Truncate all rows after this index value.
  • axis ({0 or 'index', 1 or 'columns'}, optional) Axis to truncate. Truncates the index (rows) by default.For Series this parameter is unused and defaults to 0.
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
Returns (type of caller)

The truncated Series or DataFrame.

See Also

DataFrame.loc : Select a subset of a DataFrame by label.DataFrame.iloc : Select a subset of a DataFrame by position.

Notes

If the index being truncated contains only datetime values, before and after may be specified as strings instead of Timestamps.

Examples
>>> df = pd.DataFrame(...     {
...         "A": ["a", "b", "c", "d", "e"],
...         "B": ["f", "g", "h", "i", "j"],
...         "C": ["k", "l", "m", "n", "o"],
...     },
...     index=[1, 2, 3, 4, 5],
... )
>>> df
   A  B  C
1  a  f  k
2  b  g  l
3  c  h  m
4  d  i  n
5  e  j  o
>>> df.truncate(before=2, after=4)
   A  B  C
2  b  g  l
3  c  h  m
4  d  i  n

The columns of a DataFrame can be truncated.

>>> df.truncate(before="A", after="B", axis="columns")
   A  B
1  a  f
2  b  g
3  c  h
4  d  i
5  e  j

For Series, only rows can be truncated.

>>> df["A"].truncate(before=2, after=4)
2    b
3    c
4    d
Name: A, dtype: str

The index values in truncate can be datetimes or string dates.

>>> dates = pd.date_range("2016-01-01", "2016-02-01", freq="s")
>>> df = pd.DataFrame(index=dates, data={"A": 1})
>>> df.tail()
                     A
2016-01-31 23:59:56  1
2016-01-31 23:59:57  1
2016-01-31 23:59:58  1
2016-01-31 23:59:59  1
2016-02-01 00:00:00  1
>>> df.truncate(
...     before=pd.Timestamp("2016-01-05"), after=pd.Timestamp("2016-01-10")
... ).tail()
                     A
2016-01-09 23:59:56  1
2016-01-09 23:59:57  1
2016-01-09 23:59:58  1
2016-01-09 23:59:59  1
2016-01-10 00:00:00  1

Because the index is a DatetimeIndex containing only dates, we can specify before and after as strings. They will be coerced to Timestamps before truncation.

>>> df.truncate("2016-01-05", "2016-01-10").tail()
                     A
2016-01-09 23:59:56  1
2016-01-09 23:59:57  1
2016-01-09 23:59:58  1
2016-01-09 23:59:59  1
2016-01-10 00:00:00  1

Note that truncate assumes a 0 value for any unspecified time component (midnight). This differs from partial string slicing, which returns any partially matching dates.

>>> df.loc["2016-01-05":"2016-01-10", :].tail()
                     A
2016-01-10 23:59:55  1
2016-01-10 23:59:56  1
2016-01-10 23:59:57  1
2016-01-10 23:59:58  1
2016-01-10 23:59:59  1
method

tz_convert(tz, axis=0, level=None, copy=<no_default>)

Convert tz-aware axis to target time zone.

Parameters
  • tz (str or tzinfo object or None) Target time zone. Passing None will convert toUTC and remove the timezone information.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) The axis to convert
  • level (int, str, default None) If axis is a MultiIndex, convert a specific level. Otherwisemust be None.
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
Returns (Series/DataFrame)

Object with time zone converted axis.

Raises
  • TypeError If the axis is tz-naive.
See Also

DataFrame.tz_localize: Localize tz-naive index of DataFrame to target time zone.Series.tz_localize: Localize tz-naive index of Series to target time zone.

Examples

Change to another time zone:

>>> s = pd.Series(
...     [1],
...     index=pd.DatetimeIndex(["2018-09-15 01:30:00+02:00"]),
... )
>>> s.tz_convert("Asia/Shanghai")
2018-09-15 07:30:00+08:00    1
dtype: int64

Pass None to convert to UTC and get a tz-naive index:

>>> s = pd.Series([1], index=pd.DatetimeIndex(["2018-09-15 01:30:00+02:00"]))
>>> s.tz_convert(None)
2018-09-14 23:30:00    1
dtype: int64
method

tz_localize(tz, axis=0, level=None, copy=<no_default>, ambiguous='raise', nonexistent='raise')

Localize time zone naive index of a Series or DataFrame to target time zone.

This operation localizes the Index. To localize the values in a time zone naive Series, use :meth:Series.dt.tz_localize.

Parameters
  • tz (str or tzinfo or None) Time zone to localize. Passing None will remove thetime zone information and preserve local time.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) The axis to localize
  • level (int, str, default None) If axis ia a MultiIndex, localize a specific level. Otherwisemust be None.
  • copy (bool, default False) This keyword is now ignored; changing its value will have noimpact on the method.
    .. deprecated:: 3.0.0
    This keyword is ignored and will be removed in pandas 4.0. Since
    pandas 3.0, this method always returns a new object using a lazy
    copy mechanism that defers copies until necessary
    (Copy-on-Write). See the `user guide on Copy-on-Write
    <https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html>`__
    for more details.
    
  • ambiguous ('infer', bool, bool-ndarray, 'NaT', default 'raise') When clocks moved backward due to DST, ambiguous times may arise.For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.
    • 'infer' will attempt to infer fall dst-transition hours based on order
    • bool (or bool-ndarray) where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
    • 'NaT' will return NaT where there are ambiguous times
    • 'raise' will raise a ValueError if there are ambiguous times.
  • nonexistent (str, default 'raise') A nonexistent time does not exist in a particular timezonewhere clocks moved forward due to DST. Valid values are:
    • 'shift_forward' will shift the nonexistent time forward to the closest existing time
    • 'shift_backward' will shift the nonexistent time backward to the closest existing time
    • 'NaT' will return NaT where there are nonexistent times
    • timedelta objects will shift nonexistent times by the timedelta
    • 'raise' will raise a ValueError if there are nonexistent times.
Returns (Series/DataFrame)

Same type as the input, with time zone naive or aware index, depending ontz.

Raises
  • TypeError If the TimeSeries is tz-aware and tz is not None.
See Also

Series.dt.tz_localize: Localize the values in a time zone naive Series.Timestamp.tz_localize: Localize the Timestamp to a timezone.

Examples

Localize local times:

>>> s = pd.Series(
...     [1],
...     index=pd.DatetimeIndex(["2018-09-15 01:30:00"]),
... )
>>> s.tz_localize("CET")
2018-09-15 01:30:00+02:00    1
dtype: int64

Pass None to convert to tz-naive index and preserve local time:

>>> s = pd.Series([1], index=pd.DatetimeIndex(["2018-09-15 01:30:00+02:00"]))
>>> s.tz_localize(None)
2018-09-15 01:30:00    1
dtype: int64

Be careful with DST changes. When there is sequential data, pandas can infer the DST time:

>>> s = pd.Series(
...     range(7),
...     index=pd.DatetimeIndex(
...         [
...             "2018-10-28 01:30:00",
...             "2018-10-28 02:00:00",
...             "2018-10-28 02:30:00",
...             "2018-10-28 02:00:00",
...             "2018-10-28 02:30:00",
...             "2018-10-28 03:00:00",
...             "2018-10-28 03:30:00",
...         ]
...     ),
... )
>>> s.tz_localize("CET", ambiguous="infer")
2018-10-28 01:30:00+02:00    0
2018-10-28 02:00:00+02:00    1
2018-10-28 02:30:00+02:00    2
2018-10-28 02:00:00+01:00    3
2018-10-28 02:30:00+01:00    4
2018-10-28 03:00:00+01:00    5
2018-10-28 03:30:00+01:00    6
dtype: int64

In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous parameter to set the DST explicitly

>>> s = pd.Series(
...     range(3),
...     index=pd.DatetimeIndex(
...         [
...             "2018-10-28 01:20:00",
...             "2018-10-28 02:36:00",
...             "2018-10-28 03:46:00",
...         ]
...     ),
... )
>>> s.tz_localize("CET", ambiguous=np.array([True, True, False]))
2018-10-28 01:20:00+02:00    0
2018-10-28 02:36:00+02:00    1
2018-10-28 03:46:00+01:00    2
dtype: int64

If the DST transition causes nonexistent times, you can shift these dates forward or backward with a timedelta object or 'shift_forward' or 'shift_backward'.

>>> dti = pd.DatetimeIndex(
...     ["2015-03-29 02:30:00", "2015-03-29 03:30:00"], dtype="M8[ns]"
... )
>>> s = pd.Series(range(2), index=dti)
>>> s.tz_localize("Europe/Warsaw", nonexistent="shift_forward")
2015-03-29 03:00:00+02:00    0
2015-03-29 03:30:00+02:00    1
dtype: int64
>>> s.tz_localize("Europe/Warsaw", nonexistent="shift_backward")
2015-03-29 01:59:59.999999999+01:00    0
2015-03-29 03:30:00+02:00              1
dtype: int64
>>> s.tz_localize("Europe/Warsaw", nonexistent=pd.Timedelta("1h"))
2015-03-29 03:30:00+02:00    0
2015-03-29 03:30:00+02:00    1
dtype: int64
method

describe(percentiles=None, include=None, exclude=None)

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters
  • percentiles (list-like of numbers, optional) The percentiles to include in the output. All shouldfall between 0 and 1. The default, None, will automatically return the 25th, 50th, and 75th percentiles.
  • include ('all', list-like of dtypes or None (default), optional) A white list of data types to include in the result. Ignoredfor Series. Here are the options:
    • 'all' : All columns of the input will be included in the output.
    • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
    • None (default) : The result will include all numeric columns.
  • exclude (list-like of dtypes or None (default), optional,) A black list of data types to omit from the result. Ignoredfor Series. Here are the options:
    • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(exclude=['O'])). To exclude pandas categorical columns, use 'category'
    • None (default) : The result will exclude nothing.
Returns (Series or DataFrame)

Summary statistics of the Series or Dataframe provided.

See Also

DataFrame.count: Count number of non-NA/null observations.DataFrame.max: Maximum of the values in the object. DataFrame.min: Minimum of the values in the object. DataFrame.mean: Mean of the values. DataFrame.std: Standard deviation of the observations. DataFrame.select_dtypes: Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result's index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings), the result's index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value's frequency.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the DataFrame consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(["a", "a", "b", "c"])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series(
...     [
...         np.datetime64("2000-01-01"),
...         np.datetime64("2010-01-01"),
...         np.datetime64("2010-01-01"),
...     ]
... )
>>> s.describe()
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame(
...     {
...         "categorical": pd.Categorical(["d", "e", "f"]),
...         "numeric": [1, 2, 3],
...         "object": ["a", "b", "c"],
...     }
... )
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include="all")  # doctest: +SKIP
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])  # doctest: +SKIP
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=["category"])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  # doctest: +SKIP
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])  # doctest: +SKIP
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
method

pct_change(periods=1, fill_method=None, freq=None, **kwargs)

Fractional change between the current and a prior element.

Computes the fractional change from the immediately previous row by default. This is useful in comparing the fraction of change in a time series of elements.

.. note::

Despite the name of this method, it calculates fractional change
(also known as per unit change or relative change) and not
percentage change. If you need the percentage change, multiply
these values by 100.
Parameters
  • periods (int, default 1) Periods to shift for forming percent change.
  • fill_method (None) Must be None. This argument will be removed in a future version of pandas.
  • freq (DateOffset, timedelta, or str, optional) Increment to use from time series API (e.g. 'ME' or BDay()).
  • **kwargs Additional keyword arguments are passed intoDataFrame.shift or Series.shift.
Returns (Series or DataFrame)

The same type as the calling object.

See Also

Series.diff : Compute the difference of two elements in a Series.DataFrame.diff : Compute the difference of two elements in a DataFrame. Series.shift : Shift the index by some number of periods. DataFrame.shift : Shift the index by some number of periods.

Examples

Series

>>> s = pd.Series([90, 91, 85])
>>> s
0    90
1    91
2    85
dtype: int64
>>> s.pct_change()
0         NaN
1    0.011111
2   -0.065934
dtype: float64
>>> s.pct_change(periods=2)
0         NaN
1         NaN
2   -0.055556
dtype: float64

See the percentage change in a Series where filling NAs with last valid observation forward to next valid.

>>> s = pd.Series([90, 91, None, 85])
>>> s
0    90.0
1    91.0
2     NaN
3    85.0
dtype: float64
>>> s.ffill().pct_change()
0         NaN
1    0.011111
2    0.000000
3   -0.065934
dtype: float64

DataFrame

Percentage change in French franc, Deutsche Mark, and Italian lira from 1980-01-01 to 1980-03-01.

>>> df = pd.DataFrame(
...     {
...         "FR": [4.0405, 4.0963, 4.3149],
...         "GR": [1.7246, 1.7482, 1.8519],
...         "IT": [804.74, 810.01, 860.13],
...     },
...     index=["1980-01-01", "1980-02-01", "1980-03-01"],
... )
>>> df
                FR      GR      IT
1980-01-01  4.0405  1.7246  804.74
1980-02-01  4.0963  1.7482  810.01
1980-03-01  4.3149  1.8519  860.13
>>> df.pct_change()
                  FR        GR        IT
1980-01-01       NaN       NaN       NaN
1980-02-01  0.013810  0.013684  0.006549
1980-03-01  0.053365  0.059318  0.061876

Percentage of change in GOOG and APPL stock volume. Shows computing the percentage change between columns.

>>> df = pd.DataFrame(
...     {
...         "2016": [1769950, 30586265],
...         "2015": [1500923, 40912316],
...         "2014": [1371819, 41403351],
...     },
...     index=["GOOG", "APPL"],
... )
>>> df
          2016      2015      2014
GOOG   1769950   1500923   1371819
APPL  30586265  40912316  41403351
>>> df.pct_change(axis="columns", periods=-1)
          2016      2015  2014
GOOG  0.179241  0.094112   NaN
APPL -0.252395 -0.011860   NaN
method

rolling(window, min_periods=None, center=False, win_type=None, on=None, closed=None, step=None, method='single')

Provide rolling window calculations.

Parameters
  • window (int, timedelta, str, offset, or BaseIndexer subclass) Interval of the moving window.
    If an integer, the delta between the start and end of each window. The number of points in the window depends on the closed argument.
    If a timedelta, str, or offset, the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. To learn more about the offsets & frequency strings, please see :ref:this link<timeseries.offset_aliases>.
    If a BaseIndexer subclass, the window boundaries based on the defined get_window_bounds method. Additional rolling keyword arguments, namely min_periods, center, closed and step will be passed to get_window_bounds.
  • min_periods (int, default None) Minimum number of observations in window required to have a value;otherwise, result is np.nan.
    For a window that is specified by an offset, min_periods will default to 1.
    For a window that is specified by an integer, min_periods will default to the size of the window.
  • center (bool, default False) If False, set the window labels as the right edge of the window index.
    If True, set the window labels as the center of the window index.
  • win_type (str, default None) If None, all points are evenly weighted.
    If a string, it must be a valid scipy.signal window function <https://docs.scipy.org/doc/scipy/reference/signal.windows.html#module-scipy.signal.windows>__.
    Certain Scipy window types require additional parameters to be passed in the aggregation function. The additional parameters must match the keywords specified in the Scipy window type method signature.
  • on (str, optional) For a DataFrame, a column label or Index level on whichto calculate the rolling window, rather than the DataFrame's index.
    Provided integer column is ignored and excluded from result since an integer index is not used to calculate the rolling window.
  • closed (str, default None) Determines the inclusivity of points in the window
    If 'right', uses the window (first, last] meaning the last point is included in the calculations.
    If 'left', uses the window [first, last) meaning the first point is included in the calculations.
    If 'both', uses the window [first, last] meaning all points in the window are included in the calculations.
    If 'neither', uses the window (first, last) meaning the first and last points in the window are excluded from calculations.
    () and [] are referencing open and closed set notation respetively.
    Default None ('right').
  • step (int, default None) Evaluate the window at every step result, equivalent to slicing as[::step]. window must be an integer. Using a step argument other than None or 1 will produce a result with a different shape than the input.
  • method (str {'single', 'table'}, default 'single') ).
    ` .
Returns (pandas.api.typing.Window or pandas.api.typing.Rolling)

An instance of Window is returned if win_type is passed. Otherwise,an instance of Rolling is returned.

See Also

expanding : Provides expanding transformations.ewm : Provides exponential weighted functions.

Notes

See :ref:Windowing Operations <window.generic> for further usage details and examples.

Examples
>>> df = pd.DataFrame({"B": [0, 1, 2, np.nan, 4]})>>> df
     B
0  0.0
1  1.0
2  2.0
3  NaN
4  4.0

window

Rolling sum with a window length of 2 observations.

>>> df.rolling(2).sum()
     B
0  NaN
1  1.0
2  3.0
3  NaN
4  NaN

Rolling sum with a window span of 2 seconds.

>>> df_time = pd.DataFrame(
...     {"B": [0, 1, 2, np.nan, 4]},
...     index=[
...         pd.Timestamp("20130101 09:00:00"),
...         pd.Timestamp("20130101 09:00:02"),
...         pd.Timestamp("20130101 09:00:03"),
...         pd.Timestamp("20130101 09:00:05"),
...         pd.Timestamp("20130101 09:00:06"),
...     ],
... )
>>> df_time
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  2.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  4.0
>>> df_time.rolling("2s").sum()
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  4.0

Rolling sum with forward looking windows with 2 observations.

>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=2)
>>> df.rolling(window=indexer, min_periods=1).sum()
     B
0  1.0
1  3.0
2  2.0
3  4.0
4  4.0

min_periods

Rolling sum with a window length of 2 observations, but only needs a minimum of 1 observation to calculate a value.

>>> df.rolling(2, min_periods=1).sum()
     B
0  0.0
1  1.0
2  3.0
3  2.0
4  4.0

center

Rolling sum with the result assigned to the center of the window index.

>>> df.rolling(3, min_periods=1, center=True).sum()
     B
0  1.0
1  3.0
2  3.0
3  6.0
4  4.0
>>> df.rolling(3, min_periods=1, center=False).sum()
     B
0  0.0
1  1.0
2  3.0
3  3.0
4  6.0

step

Rolling sum with a window length of 2 observations, minimum of 1 observation to calculate a value, and a step of 2.

>>> df.rolling(2, min_periods=1, step=2).sum()
     B
0  0.0
2  3.0
4  4.0

win_type

Rolling sum with a window length of 2, using the Scipy 'gaussian' window type. std is required in the aggregation function.

>>> df.rolling(2, win_type="gaussian").sum(std=3)
          B
0        NaN
1   0.986207
2   2.958621
3        NaN
4        NaN

on

Rolling sum with a window length of 2 days.

>>> df = pd.DataFrame(
...     {
...         "A": [
...             pd.to_datetime("2020-01-01"),
...             pd.to_datetime("2020-01-01"),
...             pd.to_datetime("2020-01-02"),
...         ],
...         "B": [1, 2, 3],
...     },
...     index=pd.date_range("2020", periods=3),
... )
>>> df
                    A  B
2020-01-01 2020-01-01  1
2020-01-02 2020-01-01  2
2020-01-03 2020-01-02  3
>>> df.rolling("2D", on="A").sum()
                    A    B
2020-01-01 2020-01-01  1.0
2020-01-02 2020-01-01  3.0
2020-01-03 2020-01-02  6.0
method

expanding(min_periods=1, method='single')

Provide expanding window calculations.

An expanding window yields the value of an aggregation statistic with all the data available up to that point in time.

Parameters
  • min_periods (int, default 1) Minimum number of observations in window required to have a value;otherwise, result is np.nan.
  • method (str {'single', 'table'}, default 'single') Execute the rolling operation per single column or row ('single')or over the entire object ('table').
    This argument is only implemented when specifying engine='numba' in the method call.
Returns (pandas.api.typing.Expanding)

An instance of Expanding for further expanding window calculations,e.g. using the sum method.

See Also

rolling : Provides rolling window calculations.ewm : Provides exponential weighted functions.

Notes

See :ref:Windowing Operations <window.expanding> for further usage details and examples.

Examples
>>> df = pd.DataFrame({"B": [0, 1, 2, np.nan, 4]})>>> df
     B
0  0.0
1  1.0
2  2.0
3  NaN
4  4.0

min_periods

Expanding sum with 1 vs 3 observations needed to calculate a value.

>>> df.expanding(1).sum()
     B
0  0.0
1  1.0
2  3.0
3  3.0
4  7.0
>>> df.expanding(3).sum()
     B
0  NaN
1  NaN
2  3.0
3  3.0
4  7.0
method

ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, times=None, method='single')

Provide exponentially weighted (EW) calculations.

Exactly one of com, span, halflife, or alpha must be provided if times is not provided. If times is provided and adjust=True, halflife and one of com, span or alpha may be provided. If times is provided and adjust=False, halflife must be the only provided decay-specification parameter.

Parameters
  • com (float, optional) Specify decay in terms of center of mass
    :math:\alpha = 1 / (1 + com), for :math:com \geq 0.
  • span (float, optional) Specify decay in terms of span
    :math:\alpha = 2 / (span + 1), for :math:span \geq 1.
  • halflife (float, str, timedelta, optional) Specify decay in terms of half-life
    :math:\alpha = 1 - \exp\left(-\ln(2) / halflife\right), for :math:halflife > 0.
    If times is specified, a timedelta convertible unit over which an observation decays to half its value. Only applicable to mean(), and halflife value will not apply to the other functions.
  • alpha (float, optional) Specify smoothing factor :math:\alpha directly
    :math:0 < \alpha \leq 1.
  • min_periods (int, default 0) Minimum number of observations in window required to have a value;otherwise, result is np.nan.
  • adjust (bool, default True) Divide by decaying adjustment factor in beginning periods to accountfor imbalance in relative weightings (viewing EWMA as a moving average).
    • When adjust=True (default), the EW function is calculated using weights :math:w_i = (1 - \alpha)^i. For example, the EW moving average of the series [:math:x_0, x_1, ..., x_t] would be:
    .. math:: y_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ... + (1 - \alpha)^t x_0}{1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^t}
    • When adjust=False, the exponentially weighted function is calculated recursively:
    .. math:: \begin{split} y_0 &= x_0\ y_t &= (1 - \alpha) y_{t-1} + \alpha x_t, \end{split}
  • ignore_na (bool, default False) Ignore missing values when calculating weights.
    • When ignore_na=False (default), weights are based on absolute positions. For example, the weights of :math:x_0 and :math:x_2 used in calculating the final weighted average of [:math:x_0, None, :math:x_2] are :math:(1-\alpha)^2 and :math:1 if adjust=True, and :math:(1-\alpha)^2 and :math:\alpha if adjust=False.
    • When ignore_na=True, weights are based on relative positions. For example, the weights of :math:x_0 and :math:x_2 used in calculating the final weighted average of [:math:x_0, None, :math:x_2] are :math:1-\alpha and :math:1 if adjust=True, and :math:1-\alpha and :math:\alpha if adjust=False.
  • times (np.ndarray, Series, default None) .
    d .
    .
  • method (str {'single', 'table'}, default 'single') Execute the rolling operation per single column or row ('single')or over the entire object ('table').
    This argument is only implemented when specifying engine='numba' in the method call.
    Only applicable to mean()
Returns (pandas.api.typing.ExponentialMovingWindow)

An instance of ExponentialMovingWindow for further exponentially weighted (EW)calculations, e.g. using the mean method.

See Also

rolling : Provides rolling window calculations.expanding : Provides expanding transformations.

Notes

See :ref:Windowing Operations <window.exponentially_weighted> for further usage details and examples.

Examples
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]})>>> df
     B
0  0.0
1  1.0
2  2.0
3  NaN
4  4.0
>>> df.ewm(com=0.5).mean()
          B
0  0.000000
1  0.750000
2  1.615385
3  1.615385
4  3.670213
>>> df.ewm(alpha=2 / 3).mean()
          B
0  0.000000
1  0.750000
2  1.615385
3  1.615385
4  3.670213

adjust

>>> df.ewm(com=0.5, adjust=True).mean()
          B
0  0.000000
1  0.750000
2  1.615385
3  1.615385
4  3.670213
>>> df.ewm(com=0.5, adjust=False).mean()
          B
0  0.000000
1  0.666667
2  1.555556
3  1.555556
4  3.650794

ignore_na

>>> df.ewm(com=0.5, ignore_na=True).mean()
          B
0  0.000000
1  0.750000
2  1.615385
3  1.615385
4  3.225000
>>> df.ewm(com=0.5, ignore_na=False).mean()
          B
0  0.000000
1  0.750000
2  1.615385
3  1.615385
4  3.670213

times

Exponentially weighted mean with weights calculated with a timedelta halflife relative to times.

>>> times = ['2020-01-01', '2020-01-03', '2020-01-10', '2020-01-15', '2020-01-17']
>>> df.ewm(halflife='4 days', times=pd.DatetimeIndex(times)).mean()
          B
0  0.000000
1  0.585786
2  1.523889
3  1.523889
4  3.233686

Return index for first non-missing value or None, if no value is found.

See the :ref:User Guide <missing_data> for more information on which values are considered missing.

Returns (type of index)

Index of first non-missing value.

See Also

DataFrame.last_valid_index : Return index for last non-NA value or None, if no non-NA value is found. Series.last_valid_index : Return index for last non-NA value or None, if no non-NA value is found. DataFrame.isna : Detect missing values.

Examples

For Series:

>>> s = pd.Series([None, 3, 4])
>>> s.first_valid_index()
1
>>> s.last_valid_index()
2
>>> s = pd.Series([None, None])
>>> print(s.first_valid_index())
None
>>> print(s.last_valid_index())
None

If all elements in Series are NA/null, returns None.

>>> s = pd.Series()
>>> print(s.first_valid_index())
None
>>> print(s.last_valid_index())
None

If Series is empty, returns None.

For DataFrame:

>>> df = pd.DataFrame({"A": [None, None, 2], "B": [None, 3, 4]})
>>> df
     A      B
0  NaN    NaN
1  NaN    3.0
2  2.0    4.0
>>> df.first_valid_index()
1
>>> df.last_valid_index()
2
>>> df = pd.DataFrame({"A": [None, None, None], "B": [None, None, None]})
>>> df
     A      B
0  None   None
1  None   None
2  None   None
>>> print(df.first_valid_index())
None
>>> print(df.last_valid_index())
None

If all elements in DataFrame are NA/null, returns None.

>>> df = pd.DataFrame()
>>> df
Empty DataFrame
Columns: []
Index: []
>>> print(df.first_valid_index())
None
>>> print(df.last_valid_index())
None

If DataFrame is empty, returns None.

Return index for last non-missing value or None, if no value is found.

See the :ref:User Guide <missing_data> for more information on which values are considered missing.

Returns (type of index)

Index of last non-missing value.

See Also

DataFrame.first_valid_index : Return index for first non-NA value or None, if no non-NA value is found. Series.first_valid_index : Return index for first non-NA value or None, if no non-NA value is found. DataFrame.isna : Detect missing values.

Examples

For Series:

>>> s = pd.Series([None, 3, 4])
>>> s.first_valid_index()
1
>>> s.last_valid_index()
2
>>> s = pd.Series([None, None])
>>> print(s.first_valid_index())
None
>>> print(s.last_valid_index())
None

If all elements in Series are NA/null, returns None.

>>> s = pd.Series()
>>> print(s.first_valid_index())
None
>>> print(s.last_valid_index())
None

If Series is empty, returns None.

For DataFrame:

>>> df = pd.DataFrame({"A": [None, None, 2], "B": [None, 3, 4]})
>>> df
     A      B
0  NaN    NaN
1  NaN    3.0
2  2.0    4.0
>>> df.first_valid_index()
1
>>> df.last_valid_index()
2
>>> df = pd.DataFrame({"A": [None, None, None], "B": [None, None, None]})
>>> df
     A      B
0  None   None
1  None   None
2  None   None
>>> print(df.first_valid_index())
None
>>> print(df.last_valid_index())
None

If all elements in DataFrame are NA/null, returns None.

>>> df = pd.DataFrame()
>>> df
Empty DataFrame
Columns: []
Index: []
>>> print(df.first_valid_index())
None
>>> print(df.last_valid_index())
None

If DataFrame is empty, returns None.

classmethod

create(value)

Create a channel from a list.

The second dimension is identified by tuple. if all elements are tuple, then a channel is created directly. Otherwise, elements are converted to tuples first and channels are created then.

Examples
>>> Channel.create([1, 2, 3]) # 3 rows, 1 column>>> Channel.create([(1,2,3)]) # 1 row, 3 columns
Parameters
  • value (Union) The value to create a channel
Returns (DataFrame)

A channel (dataframe)

classmethod

from_glob(pattern, ftype='any', sortby='name', reverse=False)

Create a channel with a glob pattern

Parameters
  • ftype (str, optional) The file type, one of any, link, dir and file
  • sortby (str, optional) How the files should be sorted. One of name, mtime and size
  • reverse (bool, optional) Whether sort them in a reversed way.
Returns (DataFrame)

The channel

classmethod

a_from_glob(pattern, ftype='any', sortby='name', reverse=False)

Create a channel with a glob pattern asynchronously

Parameters
  • pattern (str) The glob pattern, supported: "dir1/dir2/*.txt"
  • ftype (str, optional) The file type, one of any, link, dir and file
  • sortby (str, optional) How the files should be sorted. One of name, mtime and size
  • reverse (bool, optional) Whether sort them in a reversed way.
Returns (DataFrame)

The channel

classmethod

from_pairs(pattern, ftype='any', sortby='name', reverse=False)

Create a width=2 channel with a glob pattern

Parameters
  • ftype (str, optional) The file type, one of any, link, dir and file
  • sortby (str, optional) How the files should be sorted. One of name, mtime and size
  • reverse (bool, optional) Whether sort them in a reversed way.
Returns (DataFrame)

The channel

classmethod

a_from_pairs(pattern, ftype='any', sortby='name', reverse=False)

Create a width=2 channel with a glob pattern

Parameters
  • ftype (str, optional) The file type, one of any, link, dir and file
  • sortby (str, optional) How the files should be sorted. One of name, mtime and size
  • reverse (bool, optional) Whether sort them in a reversed way.
Returns (DataFrame)

The channel

classmethod

from_csv(*args, **kwargs)

Create a channel from a csv file

Uses pandas.read_csv() to create a channel

Parameters
  • *args and
  • **kwargs Arguments passing to pandas.read_csv()
classmethod

from_excel(*args, **kwargs)

Create a channel from an excel file.

Uses pandas.read_excel() to create a channel

Parameters
  • *args and
  • **kwargs Arguments passing to pandas.read_excel()
classmethod

from_table(*args, **kwargs)

Create a channel from a table file.

Uses pandas.read_table() to create a channel

Parameters
  • *args and
  • **kwargs Arguments passing to pandas.read_table()
function

pipen.channel.expand_dir(data, col=0, pattern='*', ftype='any', sortby='name', reverse=False)

Expand a Channel according to the files in ,other cols will keep the same.

This is only applicable to a 1-row channel.

Examples
>>> ch = channel.create([('./', 1)])>>> ch >> expand()
>>> [['./a', 1], ['./b', 1], ['./c', 1]]
Parameters
  • col (str | int, optional) the index or name of the column used to expand
  • pattern (str, optional) use a pattern to filter the files/dirs, default: *
  • ftype (str, optional) the type of the files/dirs to include
    • - 'dir', 'file', 'link' or 'any' (default)
  • sortby (str, optional) how the list is sorted
    • - 'name' (default), 'mtime', 'size'
  • reverse (bool, optional) reverse sort.
Returns (DataFrame)

The expanded channel

function

pipen.channel.collapse_files(data, col=0)

Collapse a Channel according to the files in ,other cols will use the values in row 0.

Note that other values in other rows will be discarded.

Examples
>>> ch = channel.create([['./a', 1], ['./b', 1], ['./c', 1]])>>> ch >> collapse()
>>> [['.', 1]]
Parameters
  • data (DataFrame) The original channel
  • col (str | int, optional) the index or name of the column used to collapse on
Returns (DataFrame)

The collapsed channel