BuckarooCompare — Diff Your DataFrames¶
When you change a pipeline, how do you know what changed in the output? When you migrate a table from one database to another, how do you verify the data matches? When two teams produce different versions of the same report, where are the differences?
You diff them. But df1.equals(df2) returns a single boolean, and
df1.compare(df2) only works if the DataFrames have identical shapes and
indexes. Real-world comparisons are messier: rows may be reordered, columns
may be added or removed, and the join key might not be the index.
Buckaroo’s col_join_dfs function handles all of this and renders the
result as a color-coded interactive table where differences jump out
visually.
Quick start¶
from buckaroo.compare import col_join_dfs
import pandas as pd
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'score': [88.5, 92.1, 75.3, 96.7],
})
df2 = pd.DataFrame({
'id': [1, 2, 3, 5],
'name': ['Alice', 'Robert', 'Charlie', 'Eve'],
'score': [88.5, 92.1, 80.0, 81.0],
})
merged_df, column_config_overrides, eqs = col_join_dfs(
df1, df2,
join_columns=['id'],
how='outer'
)
The function returns three things:
merged_df: The joined DataFrame with all rows from both inputs, plus hidden metadata columns for diff state
column_config_overrides: A dict of buckaroo styling config that color-codes each cell based on whether it matches, differs, or is missing from one side
eqs: A summary dict showing the diff count per column — how many rows differ for each column
How the diff works¶
col_join_dfs performs a pd.merge on the join columns, then for each
data column:
Creates a hidden
{col}|df2column with the df2 valueCreates a hidden
{col}|eqcolumn encoding the combined state: is the row in df1 only, df2 only, both-and-matching, or both-and-different?Generates a
color_map_configthat maps these states to colors
The color scheme:
State |
Color |
Meaning |
|---|---|---|
df1 only |
Pink |
Row exists in df1 but not df2 |
df2 only |
Green |
Row exists in df2 but not df1 |
Match |
Light blue |
Row in both, values identical |
Diff |
Dark blue |
Row in both, values differ |
Join key columns are highlighted in purple so you can immediately see what was used for matching.
The eqs summary¶
The third return value tells you at a glance where the differences are:
>>> eqs
{
'id': {'diff_count': 'join_key'},
'name': {'diff_count': 2}, # 2 rows differ
'score': {'diff_count': 1}, # 1 row differs
}
Special values:
"join_key"— this column was used for matching, not compared"df_1"— column only exists in df1"df_2"— column only exists in df2An integer — number of rows where values differ
Using it with the server¶
The buckaroo server exposes a /load_compare endpoint that loads two
files, runs the diff, and pushes the styled result to any connected browser:
curl -X POST http://localhost:8888/load_compare \
-H "Content-Type: application/json" \
-d '{
"session": "my-session",
"path1": "/data/report_v1.csv",
"path2": "/data/report_v2.csv",
"join_columns": ["id"],
"how": "outer"
}'
The response includes the diff summary:
{
"session": "my-session",
"rows": 5,
"columns": ["id", "name", "score"],
"eqs": {
"id": {"diff_count": "join_key"},
"name": {"diff_count": 2},
"score": {"diff_count": 1}
}
}
The browser view updates immediately with the color-coded merged table. Hover over any differing cell to see the df2 value in a tooltip.
Multi-column joins¶
merged_df, overrides, eqs = col_join_dfs(
df1, df2,
join_columns=['region', 'date'],
how='inner'
)
Composite join keys work naturally. Both region and date will be
highlighted in purple.
Use cases¶
- Data migration validation
Migrating from Postgres to Snowflake? Export both tables, diff them. The color coding immediately shows which rows are missing and which values changed.
- Pipeline output comparison
Changed a transform? Diff the before and after. The
eqssummary tells you exactly which columns were affected and by how many rows.- A/B test result inspection
Compare experiment vs control DataFrames on a user ID join key. See which metrics actually differ.
- Schema evolution
When df2 has columns that df1 doesn’t (or vice versa), those columns are marked as
"df_1"or"df_2"in the eqs summary, so you can see schema changes alongside data changes.
Integration with datacompy¶
The docs/example-notebooks/datacompy_app.py example shows how to use
datacompy for metadata-rich
comparison (column matching stats, row-level match rates) while using
buckaroo for the visual rendering.
This gives you the best of both: datacompy’s statistical summary plus buckaroo’s interactive, color-coded table view.
Limitations¶
Join columns must be unique in each DataFrame (no many-to-many joins). If duplicates are detected,
col_join_dfsraises aValueError.Column names cannot contain
|df2or__buckaroo_merge(these are used internally).Very large DataFrames (>100K rows) will work but the browser may be slow to render the full color-coded table.