TidyPy4DS: Bringing Tidyverse-Style Data Cleaning to pandas

Why I Built TidyPy4DS

Anyone who does data analysis has probably switched back and forth between pandas and tidyverse.

If you’ve spent a lot of time in R, dplyr / tidyr / stringr feel natural: column selectors, across() for batch transforms, consistent entry points for reshaping, string handling, and missing value treatment.

Switch to Python and pandas is powerful, but a few things still come up regularly:

  • Batch-selecting columns by prefix, type, or condition doesn’t have a unified approach
  • Applying the same operation to a set of columns means hand-rolling a dictionary in assign()
  • Writing cleaning steps that are stable, reusable, and readable leads to accumulating one-off lambdas
  • Muscle memory from tidyverse has nowhere to go

So I built TidyPy4DS.

It’s not trying to rewrite pandas or build another complex framework. It fills one specific gap: column selection, batch transforms, string handling, reshape operations, and a more natural .pipe() chaining experience.

Project: https://github.com/lenardar/TidyPy4DS

What Problem This Solves

In one sentence: function names follow tidyverse conventions; behavior follows Python / pandas idioms.

Specifically:

  • No SQL DSL
  • No lazy evaluation engine
  • No replacing groupby, merge, or assign — those are already well-handled by pandas
  • Just collecting the most frequently repeated, most easily scattered operations into a consistent interface

The core capabilities currently in the project:

  • Selector system: starts_with, contains, numeric, where, etc.
  • Batch transforms: mutate_across
  • Batch renaming: rename_with
  • Summarizing: summarize
  • Structure overview: glimpse
  • Conditional mapping: case_when, if_else, recode
  • Reshape: pivot_longer, pivot_wider
  • Missing values: drop_na, fill_na, replace_na
  • Header cleaning: clean_names, remove_empty, row_to_names

A Direct Comparison: Pure pandas vs. tidypy

Starting with a common dataset:

1
2
3
4
5
6
7
8
9
import pandas as pd

df = pd.DataFrame({
"employee_id": [101, 102, 103, 104, 105],
"dept": ["Sales", "Sales", "Tech", "Tech", "HR"],
"score_math": [92.0, None, 88.0, 79.0, None],
"score_eng": [85.0, 90.0, None, 82.0, 87.0],
"name": [" Alice ", "Bob ", " Carol", "David", " Frank "],
})

Goal:

  • Keep ID, grouping, score columns, and name
  • Fill missing values with column medians
  • Strip the score_ prefix from column names
  • Strip leading/trailing whitespace from names
  • Assign a letter grade based on math score
  • Summarize by group

Pure pandas

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pandas_result = (
df.loc[:, ["employee_id", "dept", "score_math", "score_eng", "name"]]
.assign(
score_math=lambda x: x["score_math"].fillna(x["score_math"].median()),
score_eng=lambda x: x["score_eng"].fillna(x["score_eng"].median()),
)
.rename(columns={"score_math": "math", "score_eng": "eng"})
.assign(
name=lambda x: x["name"].str.strip(),
level=lambda x: pd.Series(
["A" if v >= 90 else "B" if v >= 80 else "C" for v in x["math"]],
index=x.index,
),
)
)

This is fine — it’s standard pandas. But notice:

  • Score columns are explicitly named twice
  • Column names before and after renaming both need to be kept in sync manually
  • As more similar columns are added, the number of places to change grows

tidypy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from tidypy.tidy import (
case_when,
clean_names,
mutate_across,
rename_with,
select,
starts_with,
)

tidypy_result = (
clean_names(df)
.pipe(
select,
"employee_id",
"dept",
starts_with("score_"),
"name",
)
.pipe(
mutate_across,
starts_with("score_"),
lambda s: s.fillna(s.median()),
)
.pipe(
rename_with,
lambda c: c.replace("score_", ""),
starts_with("score_"),
)
.assign(
name=lambda x: x["name"].str.strip(),
level=lambda x: case_when(
(x["math"] >= 90, "A"),
(x["math"] >= 80, "B"),
default="C",
),
)
)

This isn’t magically shorter than pandas.

The real difference: you’re now expressing “which columns belong together” and “what to do to those columns” as separate concerns.

In a small dataset, this might just feel slightly cleaner. Once you add a score_logic column, the gap becomes concrete:

  • The pure pandas version usually requires updating multiple explicit column names
  • The selector version usually just needs the new column to match the score_ pattern — the main pipeline doesn’t need to change

The difference is even more pronounced with real-world data that isn’t a clean DataFrame to begin with — like Excel exports with:

  • Column names containing spaces, brackets, and punctuation
  • Actual headers buried in row 1
  • Interspersed fully empty rows and columns

The janitor-style functions handle this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from tidypy.tidy import clean_names, remove_empty, row_to_names

raw = pd.DataFrame(
[
["Patient ID", "Score Math", None],
[1, 90, None],
[2, 88, None],
[None, None, None],
]
)

result = (
raw
.pipe(row_to_names, row=0)
.pipe(remove_empty, axis="both")
.pipe(clean_names)
)

Each of these is simple on its own. But they consistently appear exactly in the gap between “pandas can obviously do this” and “I end up rewriting this from scratch every time.”

The Core Design: Get Column Selection Right First

The most important abstraction in this project isn’t mutate_across() — it’s ColSelector.

All helpers return a selector object. The actual column resolution happens when the selector is combined with a specific DataFrame.

1
2
3
select(df, numeric() | starts_with("id"))
select(df, everything() - contains("tmp"))
mutate_across(df, where(lambda s: s.isnull().any()), lambda s: s.fillna(0))

The benefits:

  • Helpers can inspect column names, dtypes, or actual data values
  • Selectors compose
  • Column selection logic can be reused rather than scattered

The rules I landed on:

  • | is union
  • - is exclusion
  • Results are deduplicated and order-preserving
  • Unresolvable columns raise immediately — no silent failures

Once this layer is stable, select, mutate_across, rename_with, and pivot_longer all follow naturally.

One Thing I Deliberately Didn’t Do: Bare-Name NSE

If you know tidyverse, your first question might be:

Can I write mutate(df, total=a + b) like in R?

The short answer: technically possible, not worth it.

In R, that experience is backed by tidy evaluation. Python has no equivalent language mechanism. The only paths are eval, AST rewriting, or proxy objects.

The problem isn’t writing it — it’s what you get:

  • Unintuitive error messages
  • Poor debuggability
  • Inconsistency with the pandas ecosystem
  • Once edge cases appear, it becomes fragile syntax magic

My tradeoff:

  • No bare-name NSE
  • Keep assign(...) for a small number of explicit new columns
  • Use mutate_across(...) for batch transforms
  • Use case_when(...), if_else(...), recode(...) for conditional mapping and recoding

Less flashy, more stable.

glimpse(): The Thing I Always Wanted

In a notebook, head() shows values and info() shows structure, but neither is quite what I want.

So I added glimpse():

1
2
3
from tidypy.tidy import glimpse

glimpse(df)

It shows:

  • Row count and column count
  • Each column’s dtype
  • Non-null count, missing count
  • Unique value count
  • A few sample values as preview

With additional options:

1
2
glimpse(df, cols=starts_with("score_"))
print(glimpse(df, as_text=True, display=False))

Small thing, but consistently useful in actual notebook work.

Current State

TidyPy4DS has a working implementation. The infrastructure is in place:

  • Bilingual README
  • Bilingual function documentation
  • Bilingual notebook examples
  • unittest test suite
  • GitHub Actions CI
  • A batch of janitor-style utility functions

Example notebooks are in three categories:

  • Why tidypy: direct comparison with pure pandas
  • Core APIs: selectors, mutate_across, summarize
  • Reshape and missing values: pivot_longer, pivot_wider, separate, unite, missing value handling

The project is still early, but the direction is clear:

  • The selector system should be stable
  • API surface should be small and explicit
  • No unnecessary wrapping
  • Don’t fight Python’s natural idioms
  • Single-file implementation for now, but internally organized for later decomposition

Closing Thought

This project isn’t fighting against pandas. It’s acknowledging something:

pandas is powerful, but some high-frequency cleaning operations are just not as smooth as they could be.

If a thin helper layer can collect those operations into a more consistent, reusable, readable interface — it earns its place.

If you find yourself switching mentally between pandas and tidyverse thinking, TidyPy4DS might fill exactly the gap you’re feeling.

What I’ve come to see as its most valuable part isn’t “porting tidyverse to Python” — it’s collecting those operations you write every day but always write in a slightly scattered way, into a small set of clean tools.

Project: https://github.com/lenardar/TidyPy4DS