BuckarooCompare — Diff Your DataFrames¶

When you change a pipeline, how do you know what changed in the output? When you migrate a table from one database to another, how do you verify the data matches? When two teams produce different versions of the same report, where are the differences?

You diff them. But df1.equals(df2) returns a single boolean, and df1.compare(df2) only works if the DataFrames have identical shapes and indexes. Real-world comparisons are messier: rows may be reordered, columns may be added or removed, and the join key might not be the index.

Buckaroo’s col_join_dfs function handles all of this and renders the result as a color-coded interactive table where differences jump out visually.

Quick start¶

from buckaroo.compare import col_join_dfs
import pandas as pd

df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'score': [88.5, 92.1, 75.3, 96.7],
})

df2 = pd.DataFrame({
    'id': [1, 2, 3, 5],
    'name': ['Alice', 'Robert', 'Charlie', 'Eve'],
    'score': [88.5, 92.1, 80.0, 81.0],
})

merged_df, column_config_overrides, eqs = col_join_dfs(
    df1, df2,
    join_columns=['id'],
    how='outer'
)

The function returns three things:

merged_df: The joined DataFrame with all rows from both inputs, plus hidden metadata columns for diff state
column_config_overrides: A dict of buckaroo styling config that color-codes each cell based on whether it matches, differs, or is missing from one side
eqs: A summary dict showing the diff count per column — how many rows differ for each column

How the diff works¶

col_join_dfs performs a pd.merge on the join columns, then for each data column:

Creates a hidden {col}|df2 column with the df2 value
Creates a hidden {col}|eq column encoding the combined state: is the row in df1 only, df2 only, both-and-matching, or both-and-different?
Generates a color_map_config that maps these states to colors

The color scheme:

State	Color	Meaning
df1 only	Pink	Row exists in df1 but not df2
df2 only	Green	Row exists in df2 but not df1
Match	Light blue	Row in both, values identical
Diff	Dark blue	Row in both, values differ

Join key columns are highlighted in purple so you can immediately see what was used for matching.

The eqs summary¶

The third return value tells you at a glance where the differences are:

>>> eqs
{
    'id': {'diff_count': 'join_key'},
    'name': {'diff_count': 2},      # 2 rows differ
    'score': {'diff_count': 1},      # 1 row differs
}

Special values:

"join_key" — this column was used for matching, not compared
"df_1" — column only exists in df1
"df_2" — column only exists in df2
An integer — number of rows where values differ

Using it with the server¶

The buckaroo server exposes a /load_compare endpoint that loads two files, runs the diff, and pushes the styled result to any connected browser:

curl -X POST http://localhost:8888/load_compare \
  -H "Content-Type: application/json" \
  -d '{
    "session": "my-session",
    "path1": "/data/report_v1.csv",
    "path2": "/data/report_v2.csv",
    "join_columns": ["id"],
    "how": "outer"
  }'

The response includes the diff summary:

{
  "session": "my-session",
  "rows": 5,
  "columns": ["id", "name", "score"],
  "eqs": {
    "id": {"diff_count": "join_key"},
    "name": {"diff_count": 2},
    "score": {"diff_count": 1}
  }
}

The browser view updates immediately with the color-coded merged table. Hover over any differing cell to see the df2 value in a tooltip.

Multi-column joins¶

merged_df, overrides, eqs = col_join_dfs(
    df1, df2,
    join_columns=['region', 'date'],
    how='inner'
)

Composite join keys work naturally. Both region and date will be highlighted in purple.

Use cases¶

Data migration validation: Migrating from Postgres to Snowflake? Export both tables, diff them. The color coding immediately shows which rows are missing and which values changed.
Pipeline output comparison: Changed a transform? Diff the before and after. The eqs summary tells you exactly which columns were affected and by how many rows.
A/B test result inspection: Compare experiment vs control DataFrames on a user ID join key. See which metrics actually differ.
Schema evolution: When df2 has columns that df1 doesn’t (or vice versa), those columns are marked as "df_1" or "df_2" in the eqs summary, so you can see schema changes alongside data changes.

Integration with datacompy¶

The docs/example-notebooks/datacompy_app.py example shows how to use datacompy for metadata-rich comparison (column matching stats, row-level match rates) while using buckaroo for the visual rendering.

This gives you the best of both: datacompy’s statistical summary plus buckaroo’s interactive, color-coded table view.

Limitations¶

Join columns must be unique in each DataFrame (no many-to-many joins). If duplicates are detected, col_join_dfs raises a ValueError.
Column names cannot contain |df2 or __buckaroo_merge (these are used internally).
Very large DataFrames (>100K rows) will work but the browser may be slow to render the full color-coded table.