Why I Built TidyPy4DS
Anyone who does data analysis has probably switched back and forth between pandas and tidyverse.
If you’ve spent a lot of time in R, dplyr / tidyr / stringr feel natural: column selectors, across() for batch transforms, consistent entry points for reshaping, string handling, and missing value treatment.
Switch to Python and pandas is powerful, but a few things still come up regularly:
- Batch-selecting columns by prefix, type, or condition doesn’t have a unified approach
- Applying the same operation to a set of columns means hand-rolling a dictionary in
assign() - Writing cleaning steps that are stable, reusable, and readable leads to accumulating one-off lambdas
- Muscle memory from tidyverse has nowhere to go
So I built TidyPy4DS.
It’s not trying to rewrite pandas or build another complex framework. It fills one specific gap: column selection, batch transforms, string handling, reshape operations, and a more natural .pipe() chaining experience.
Project: https://github.com/lenardar/TidyPy4DS
What Problem This Solves
In one sentence: function names follow tidyverse conventions; behavior follows Python / pandas idioms.
Specifically:
- No SQL DSL
- No lazy evaluation engine
- No replacing
groupby,merge, orassign— those are already well-handled by pandas - Just collecting the most frequently repeated, most easily scattered operations into a consistent interface
The core capabilities currently in the project:
- Selector system:
starts_with,contains,numeric,where, etc. - Batch transforms:
mutate_across - Batch renaming:
rename_with - Summarizing:
summarize - Structure overview:
glimpse - Conditional mapping:
case_when,if_else,recode - Reshape:
pivot_longer,pivot_wider - Missing values:
drop_na,fill_na,replace_na - Header cleaning:
clean_names,remove_empty,row_to_names
A Direct Comparison: Pure pandas vs. tidypy
Starting with a common dataset:
1 | import pandas as pd |
Goal:
- Keep ID, grouping, score columns, and name
- Fill missing values with column medians
- Strip the
score_prefix from column names - Strip leading/trailing whitespace from names
- Assign a letter grade based on math score
- Summarize by group
Pure pandas
1 | pandas_result = ( |
This is fine — it’s standard pandas. But notice:
- Score columns are explicitly named twice
- Column names before and after renaming both need to be kept in sync manually
- As more similar columns are added, the number of places to change grows
tidypy
1 | from tidypy.tidy import ( |
This isn’t magically shorter than pandas.
The real difference: you’re now expressing “which columns belong together” and “what to do to those columns” as separate concerns.
In a small dataset, this might just feel slightly cleaner. Once you add a score_logic column, the gap becomes concrete:
- The pure pandas version usually requires updating multiple explicit column names
- The selector version usually just needs the new column to match the
score_pattern — the main pipeline doesn’t need to change
The difference is even more pronounced with real-world data that isn’t a clean DataFrame to begin with — like Excel exports with:
- Column names containing spaces, brackets, and punctuation
- Actual headers buried in row 1
- Interspersed fully empty rows and columns
The janitor-style functions handle this:
1 | from tidypy.tidy import clean_names, remove_empty, row_to_names |
Each of these is simple on its own. But they consistently appear exactly in the gap between “pandas can obviously do this” and “I end up rewriting this from scratch every time.”
The Core Design: Get Column Selection Right First
The most important abstraction in this project isn’t mutate_across() — it’s ColSelector.
All helpers return a selector object. The actual column resolution happens when the selector is combined with a specific DataFrame.
1 | select(df, numeric() | starts_with("id")) |
The benefits:
- Helpers can inspect column names, dtypes, or actual data values
- Selectors compose
- Column selection logic can be reused rather than scattered
The rules I landed on:
|is union-is exclusion- Results are deduplicated and order-preserving
- Unresolvable columns raise immediately — no silent failures
Once this layer is stable, select, mutate_across, rename_with, and pivot_longer all follow naturally.
One Thing I Deliberately Didn’t Do: Bare-Name NSE
If you know tidyverse, your first question might be:
Can I write
mutate(df, total=a + b)like in R?
The short answer: technically possible, not worth it.
In R, that experience is backed by tidy evaluation. Python has no equivalent language mechanism. The only paths are eval, AST rewriting, or proxy objects.
The problem isn’t writing it — it’s what you get:
- Unintuitive error messages
- Poor debuggability
- Inconsistency with the pandas ecosystem
- Once edge cases appear, it becomes fragile syntax magic
My tradeoff:
- No bare-name NSE
- Keep
assign(...)for a small number of explicit new columns - Use
mutate_across(...)for batch transforms - Use
case_when(...),if_else(...),recode(...)for conditional mapping and recoding
Less flashy, more stable.
glimpse(): The Thing I Always Wanted
In a notebook, head() shows values and info() shows structure, but neither is quite what I want.
So I added glimpse():
1 | from tidypy.tidy import glimpse |
It shows:
- Row count and column count
- Each column’s dtype
- Non-null count, missing count
- Unique value count
- A few sample values as preview
With additional options:
1 | glimpse(df, cols=starts_with("score_")) |
Small thing, but consistently useful in actual notebook work.
Current State
TidyPy4DS has a working implementation. The infrastructure is in place:
- Bilingual README
- Bilingual function documentation
- Bilingual notebook examples
unittesttest suite- GitHub Actions CI
- A batch of janitor-style utility functions
Example notebooks are in three categories:
- Why tidypy: direct comparison with pure pandas
- Core APIs: selectors,
mutate_across,summarize - Reshape and missing values:
pivot_longer,pivot_wider,separate,unite, missing value handling
The project is still early, but the direction is clear:
- The selector system should be stable
- API surface should be small and explicit
- No unnecessary wrapping
- Don’t fight Python’s natural idioms
- Single-file implementation for now, but internally organized for later decomposition
Closing Thought
This project isn’t fighting against pandas. It’s acknowledging something:
pandas is powerful, but some high-frequency cleaning operations are just not as smooth as they could be.
If a thin helper layer can collect those operations into a more consistent, reusable, readable interface — it earns its place.
If you find yourself switching mentally between pandas and tidyverse thinking, TidyPy4DS might fill exactly the gap you’re feeling.
What I’ve come to see as its most valuable part isn’t “porting tidyverse to Python” — it’s collecting those operations you write every day but always write in a slightly scattered way, into a small set of clean tools.