BuckarooCompare — Diff Your DataFrames

When you change a pipeline, how do you know what changed in the output? When you migrate a table from one database to another, how do you verify the data matches? When two teams produce different versions of the same report, where are the differences?

You diff them. But df1.equals(df2) returns a single boolean, and df1.compare(df2) only works if the DataFrames have identical shapes and indexes. Real-world comparisons are messier: rows may be reordered, columns may be added or removed, and the join key might not be the index.

Buckaroo’s col_join_dfs function handles all of this and renders the result as a color-coded interactive table where differences jump out visually.

Quick start

from buckaroo.compare import col_join_dfs
import pandas as pd

df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'score': [88.5, 92.1, 75.3, 96.7],
})

df2 = pd.DataFrame({
    'id': [1, 2, 3, 5],
    'name': ['Alice', 'Robert', 'Charlie', 'Eve'],
    'score': [88.5, 92.1, 80.0, 81.0],
})

merged_df, column_config_overrides, eqs = col_join_dfs(
    df1, df2,
    join_columns=['id'],
    how='outer'
)

The function returns three things:

  1. merged_df: The joined DataFrame with all rows from both inputs, plus hidden metadata columns for diff state

  2. column_config_overrides: A dict of buckaroo styling config that color-codes each cell based on whether it matches, differs, or is missing from one side

  3. eqs: A summary dict showing the diff count per column — how many rows differ for each column

How the diff works

col_join_dfs performs a pd.merge on the join columns, then for each data column:

  • Creates a hidden {col}|df2 column with the df2 value

  • Creates a hidden {col}|eq column encoding the combined state: is the row in df1 only, df2 only, both-and-matching, or both-and-different?

  • Generates a color_map_config that maps these states to colors

The color scheme:

State

Color

Meaning

df1 only

Pink

Row exists in df1 but not df2

df2 only

Green

Row exists in df2 but not df1

Match

Light blue

Row in both, values identical

Diff

Dark blue

Row in both, values differ

Join key columns are highlighted in purple so you can immediately see what was used for matching.

The eqs summary

The third return value tells you at a glance where the differences are:

>>> eqs
{
    'id': {'diff_count': 'join_key'},
    'name': {'diff_count': 2},      # 2 rows differ
    'score': {'diff_count': 1},      # 1 row differs
}

Special values:

  • "join_key" — this column was used for matching, not compared

  • "df_1" — column only exists in df1

  • "df_2" — column only exists in df2

  • An integer — number of rows where values differ

Using it with the server

The buckaroo server exposes a /load_compare endpoint that loads two files, runs the diff, and pushes the styled result to any connected browser:

curl -X POST http://localhost:8888/load_compare \
  -H "Content-Type: application/json" \
  -d '{
    "session": "my-session",
    "path1": "/data/report_v1.csv",
    "path2": "/data/report_v2.csv",
    "join_columns": ["id"],
    "how": "outer"
  }'

The response includes the diff summary:

{
  "session": "my-session",
  "rows": 5,
  "columns": ["id", "name", "score"],
  "eqs": {
    "id": {"diff_count": "join_key"},
    "name": {"diff_count": 2},
    "score": {"diff_count": 1}
  }
}

The browser view updates immediately with the color-coded merged table. Hover over any differing cell to see the df2 value in a tooltip.

Multi-column joins

merged_df, overrides, eqs = col_join_dfs(
    df1, df2,
    join_columns=['region', 'date'],
    how='inner'
)

Composite join keys work naturally. Both region and date will be highlighted in purple.

Use cases

Data migration validation

Migrating from Postgres to Snowflake? Export both tables, diff them. The color coding immediately shows which rows are missing and which values changed.

Pipeline output comparison

Changed a transform? Diff the before and after. The eqs summary tells you exactly which columns were affected and by how many rows.

A/B test result inspection

Compare experiment vs control DataFrames on a user ID join key. See which metrics actually differ.

Schema evolution

When df2 has columns that df1 doesn’t (or vice versa), those columns are marked as "df_1" or "df_2" in the eqs summary, so you can see schema changes alongside data changes.

Integration with datacompy

The docs/example-notebooks/datacompy_app.py example shows how to use datacompy for metadata-rich comparison (column matching stats, row-level match rates) while using buckaroo for the visual rendering.

This gives you the best of both: datacompy’s statistical summary plus buckaroo’s interactive, color-coded table view.

Limitations

  • Join columns must be unique in each DataFrame (no many-to-many joins). If duplicates are detected, col_join_dfs raises a ValueError.

  • Column names cannot contain |df2 or __buckaroo_merge (these are used internally).

  • Very large DataFrames (>100K rows) will work but the browser may be slow to render the full color-coded table.