BuckarooCompare — Diff Your DataFrames ======================================= When you change a pipeline, how do you know what changed in the output? When you migrate a table from one database to another, how do you verify the data matches? When two teams produce different versions of the same report, where are the differences? You diff them. But ``df1.equals(df2)`` returns a single boolean, and ``df1.compare(df2)`` only works if the DataFrames have identical shapes and indexes. Real-world comparisons are messier: rows may be reordered, columns may be added or removed, and the join key might not be the index. Buckaroo's ``col_join_dfs`` function handles all of this and renders the result as a color-coded interactive table where differences jump out visually. Quick start ----------- .. code-block:: python from buckaroo.compare import col_join_dfs import pandas as pd df1 = pd.DataFrame({ 'id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'score': [88.5, 92.1, 75.3, 96.7], }) df2 = pd.DataFrame({ 'id': [1, 2, 3, 5], 'name': ['Alice', 'Robert', 'Charlie', 'Eve'], 'score': [88.5, 92.1, 80.0, 81.0], }) merged_df, column_config_overrides, eqs = col_join_dfs( df1, df2, join_columns=['id'], how='outer' ) The function returns three things: 1. **merged_df**: The joined DataFrame with all rows from both inputs, plus hidden metadata columns for diff state 2. **column_config_overrides**: A dict of buckaroo styling config that color-codes each cell based on whether it matches, differs, or is missing from one side 3. **eqs**: A summary dict showing the diff count per column — how many rows differ for each column How the diff works ------------------ ``col_join_dfs`` performs a ``pd.merge`` on the join columns, then for each data column: - Creates a hidden ``{col}|df2`` column with the df2 value - Creates a hidden ``{col}|eq`` column encoding the combined state: is the row in df1 only, df2 only, both-and-matching, or both-and-different? - Generates a ``color_map_config`` that maps these states to colors The color scheme: .. list-table:: :header-rows: 1 * - State - Color - Meaning * - df1 only - Pink - Row exists in df1 but not df2 * - df2 only - Green - Row exists in df2 but not df1 * - Match - Light blue - Row in both, values identical * - Diff - Dark blue - Row in both, values differ Join key columns are highlighted in purple so you can immediately see what was used for matching. The eqs summary --------------- The third return value tells you at a glance where the differences are: .. code-block:: python >>> eqs { 'id': {'diff_count': 'join_key'}, 'name': {'diff_count': 2}, # 2 rows differ 'score': {'diff_count': 1}, # 1 row differs } Special values: - ``"join_key"`` — this column was used for matching, not compared - ``"df_1"`` — column only exists in df1 - ``"df_2"`` — column only exists in df2 - An integer — number of rows where values differ Using it with the server ------------------------ The buckaroo server exposes a ``/load_compare`` endpoint that loads two files, runs the diff, and pushes the styled result to any connected browser: .. code-block:: bash curl -X POST http://localhost:8888/load_compare \ -H "Content-Type: application/json" \ -d '{ "session": "my-session", "path1": "/data/report_v1.csv", "path2": "/data/report_v2.csv", "join_columns": ["id"], "how": "outer" }' The response includes the diff summary: .. code-block:: json { "session": "my-session", "rows": 5, "columns": ["id", "name", "score"], "eqs": { "id": {"diff_count": "join_key"}, "name": {"diff_count": 2}, "score": {"diff_count": 1} } } The browser view updates immediately with the color-coded merged table. Hover over any differing cell to see the df2 value in a tooltip. Multi-column joins ------------------ .. code-block:: python merged_df, overrides, eqs = col_join_dfs( df1, df2, join_columns=['region', 'date'], how='inner' ) Composite join keys work naturally. Both ``region`` and ``date`` will be highlighted in purple. Use cases --------- **Data migration validation** Migrating from Postgres to Snowflake? Export both tables, diff them. The color coding immediately shows which rows are missing and which values changed. **Pipeline output comparison** Changed a transform? Diff the before and after. The ``eqs`` summary tells you exactly which columns were affected and by how many rows. **A/B test result inspection** Compare experiment vs control DataFrames on a user ID join key. See which metrics actually differ. **Schema evolution** When df2 has columns that df1 doesn't (or vice versa), those columns are marked as ``"df_1"`` or ``"df_2"`` in the eqs summary, so you can see schema changes alongside data changes. Integration with datacompy -------------------------- The ``docs/example-notebooks/datacompy_app.py`` example shows how to use `datacompy `_ for metadata-rich comparison (column matching stats, row-level match rates) while using buckaroo for the visual rendering. This gives you the best of both: datacompy's statistical summary plus buckaroo's interactive, color-coded table view. Limitations ----------- - Join columns must be unique in each DataFrame (no many-to-many joins). If duplicates are detected, ``col_join_dfs`` raises a ``ValueError``. - Column names cannot contain ``|df2`` or ``__buckaroo_merge`` (these are used internally). - Very large DataFrames (>100K rows) will work but the browser may be slow to render the full color-coded table.