TidyPy4DS: Bringing Tidyverse-Style Data Cleaning to pandas

Lenardar

2026-03-10

Why I Built TidyPy4DS

Anyone who does data analysis has probably switched back and forth between pandas and tidyverse.

If you’ve spent a lot of time in R, dplyr / tidyr / stringr feel natural: column selectors, across() for batch transforms, consistent entry points for reshaping, string handling, and missing value treatment.

Switch to Python and pandas is powerful, but a few things still come up regularly:

Batch-selecting columns by prefix, type, or condition doesn’t have a unified approach
Applying the same operation to a set of columns means hand-rolling a dictionary in assign()
Writing cleaning steps that are stable, reusable, and readable leads to accumulating one-off lambdas
Muscle memory from tidyverse has nowhere to go

So I built TidyPy4DS.

It’s not trying to rewrite pandas or build another complex framework. It fills one specific gap: column selection, batch transforms, string handling, reshape operations, and a more natural .pipe() chaining experience.

Project: https://github.com/lenardar/TidyPy4DS

What Problem This Solves

In one sentence: function names follow tidyverse conventions; behavior follows Python / pandas idioms.

Specifically:

No SQL DSL
No lazy evaluation engine
No replacing groupby, merge, or assign — those are already well-handled by pandas
Just collecting the most frequently repeated, most easily scattered operations into a consistent interface

The core capabilities currently in the project:

Selector system: starts_with, contains, numeric, where, etc.
Batch transforms: mutate_across
Batch renaming: rename_with
Summarizing: summarize
Structure overview: glimpse
Conditional mapping: case_when, if_else, recode
Reshape: pivot_longer, pivot_wider
Missing values: drop_na, fill_na, replace_na
Header cleaning: clean_names, remove_empty, row_to_names

A Direct Comparison: Pure pandas vs. tidypy

Starting with a common dataset:

import pandas as pd

df = pd.DataFrame({
    "employee_id": [101, 102, 103, 104, 105],
    "dept": ["Sales", "Sales", "Tech", "Tech", "HR"],
    "score_math": [92.0, None, 88.0, 79.0, None],
    "score_eng": [85.0, 90.0, None, 82.0, 87.0],
    "name": [" Alice ", "Bob ", " Carol", "David", " Frank "],
})

Goal:

Keep ID, grouping, score columns, and name
Fill missing values with column medians
Strip the score_ prefix from column names
Strip leading/trailing whitespace from names
Assign a letter grade based on math score
Summarize by group

Pure pandas

pandas_result = (
    df.loc[:, ["employee_id", "dept", "score_math", "score_eng", "name"]]
    .assign(
        score_math=lambda x: x["score_math"].fillna(x["score_math"].median()),
        score_eng=lambda x: x["score_eng"].fillna(x["score_eng"].median()),
    )
    .rename(columns={"score_math": "math", "score_eng": "eng"})
    .assign(
        name=lambda x: x["name"].str.strip(),
        level=lambda x: pd.Series(
            ["A" if v >= 90 else "B" if v >= 80 else "C" for v in x["math"]],
            index=x.index,
        ),
    )
)

This is fine — it’s standard pandas. But notice:

Score columns are explicitly named twice
Column names before and after renaming both need to be kept in sync manually
As more similar columns are added, the number of places to change grows

tidypy

from tidypy.tidy import (
    case_when,
    clean_names,
    mutate_across,
    rename_with,
    select,
    starts_with,
)

tidypy_result = (
    clean_names(df)
    .pipe(
        select,
        "employee_id",
        "dept",
        starts_with("score_"),
        "name",
    )
    .pipe(
        mutate_across,
        starts_with("score_"),
        lambda s: s.fillna(s.median()),
    )
    .pipe(
        rename_with,
        lambda c: c.replace("score_", ""),
        starts_with("score_"),
    )
    .assign(
        name=lambda x: x["name"].str.strip(),
        level=lambda x: case_when(
            (x["math"] >= 90, "A"),
            (x["math"] >= 80, "B"),
            default="C",
        ),
    )
)

This isn’t magically shorter than pandas.

The real difference: you’re now expressing “which columns belong together” and “what to do to those columns” as separate concerns.

In a small dataset, this might just feel slightly cleaner. Once you add a score_logic column, the gap becomes concrete:

The pure pandas version usually requires updating multiple explicit column names
The selector version usually just needs the new column to match the score_ pattern — the main pipeline doesn’t need to change

The difference is even more pronounced with real-world data that isn’t a clean DataFrame to begin with — like Excel exports with:

Column names containing spaces, brackets, and punctuation
Actual headers buried in row 1
Interspersed fully empty rows and columns

The janitor-style functions handle this:

from tidypy.tidy import clean_names, remove_empty, row_to_names

raw = pd.DataFrame(
    [
        ["Patient ID", "Score Math", None],
        [1, 90, None],
        [2, 88, None],
        [None, None, None],
    ]
)

result = (
    raw
    .pipe(row_to_names, row=0)
    .pipe(remove_empty, axis="both")
    .pipe(clean_names)
)

Each of these is simple on its own. But they consistently appear exactly in the gap between “pandas can obviously do this” and “I end up rewriting this from scratch every time.”

The Core Design: Get Column Selection Right First

The most important abstraction in this project isn’t mutate_across() — it’s ColSelector.

All helpers return a selector object. The actual column resolution happens when the selector is combined with a specific DataFrame.

1
2
3

select(df, numeric() | starts_with("id"))
select(df, everything() - contains("tmp"))
mutate_across(df, where(lambda s: s.isnull().any()), lambda s: s.fillna(0))

The benefits:

Helpers can inspect column names, dtypes, or actual data values
Selectors compose
Column selection logic can be reused rather than scattered

The rules I landed on:

| is union
- is exclusion
Results are deduplicated and order-preserving
Unresolvable columns raise immediately — no silent failures

Once this layer is stable, select, mutate_across, rename_with, and pivot_longer all follow naturally.

One Thing I Deliberately Didn’t Do: Bare-Name NSE

If you know tidyverse, your first question might be:

Can I write mutate(df, total=a + b) like in R?

The short answer: technically possible, not worth it.

In R, that experience is backed by tidy evaluation. Python has no equivalent language mechanism. The only paths are eval, AST rewriting, or proxy objects.

The problem isn’t writing it — it’s what you get:

Unintuitive error messages
Poor debuggability
Inconsistency with the pandas ecosystem
Once edge cases appear, it becomes fragile syntax magic

My tradeoff:

No bare-name NSE
Keep assign(...) for a small number of explicit new columns
Use mutate_across(...) for batch transforms
Use case_when(...), if_else(...), recode(...) for conditional mapping and recoding

Less flashy, more stable.

glimpse(): The Thing I Always Wanted

In a notebook, head() shows values and info() shows structure, but neither is quite what I want.

So I added glimpse():

1
2
3

from tidypy.tidy import glimpse

glimpse(df)

It shows:

Row count and column count
Each column’s dtype
Non-null count, missing count
Unique value count
A few sample values as preview

With additional options:

1 2	glimpse(df, cols=starts_with("score_")) print(glimpse(df, as_text=True, display=False))

Small thing, but consistently useful in actual notebook work.

Current State

TidyPy4DS has a working implementation. The infrastructure is in place:

Bilingual README
Bilingual function documentation
Bilingual notebook examples
unittest test suite
GitHub Actions CI
A batch of janitor-style utility functions

Example notebooks are in three categories:

Why tidypy: direct comparison with pure pandas
Core APIs: selectors, mutate_across, summarize
Reshape and missing values: pivot_longer, pivot_wider, separate, unite, missing value handling

The project is still early, but the direction is clear:

The selector system should be stable
API surface should be small and explicit
No unnecessary wrapping
Don’t fight Python’s natural idioms
Single-file implementation for now, but internally organized for later decomposition

Closing Thought

This project isn’t fighting against pandas. It’s acknowledging something:

pandas is powerful, but some high-frequency cleaning operations are just not as smooth as they could be.

If a thin helper layer can collect those operations into a more consistent, reusable, readable interface — it earns its place.

If you find yourself switching mentally between pandas and tidyverse thinking, TidyPy4DS might fill exactly the gap you’re feeling.

What I’ve come to see as its most valuable part isn’t “porting tidyverse to Python” — it’s collecting those operations you write every day but always write in a slightly scattered way, into a small set of clean tools.

Project: https://github.com/lenardar/TidyPy4DS