The Dastardly DataFrame Dataset ================================ Every DataFrame viewer works fine on ``pd.DataFrame({'a': [1, 2, 3]})``. The question is what happens when the data gets weird. Buckaroo ships a collection of deliberately tricky DataFrames called the **Dastardly DataFrame Dataset** (DDD). These are the DataFrames that break other viewers — the ones with MultiIndex columns, NaN mixed with infinity, columns literally named ``index``, integers too large for JavaScript, and types that most tools pretend don't exist. This page shows each one rendered live in buckaroo's static embed. No Jupyter kernel, no server — just HTML and JavaScript. If you can see the tables below, the static embedding system is working. Why this matters ---------------- If you build dashboards, you choose what data goes into your table. You control the types, the column names, the index. But if you're doing exploratory data analysis — loading CSVs from vendors, joining tables from different systems, debugging a pipeline that produces unexpected output — you don't control any of that. The data is what it is. ``df.head()`` hides the problem. It shows you 5 rows and lets you believe everything is fine. Buckaroo is built for the opposite workflow: show you everything, especially the parts that are surprising. The Dastardly DataFrames ------------------------ Each section below shows the exact function from ``buckaroo.ddd_library`` that creates the DataFrame, explains why it's tricky, and renders it live in a buckaroo static embed. .. code-block:: bash pip install buckaroo .. code-block:: python from buckaroo.ddd_library import * Infinity and NaN ~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def df_with_infinity() -> pd.DataFrame: return pd.DataFrame({'a': [np.nan, np.inf, np.inf * -1]}) df_with_infinity() Three values, three completely different things: a missing value, positive infinity, and negative infinity. Many viewers display all three as blank or "NaN". Buckaroo distinguishes them. This also tests whether summary stats (mean, min, max) handle infinity correctly — they should, because ``np.inf`` is a valid float, not missing data. .. raw:: html Really Big Numbers ~~~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def df_with_really_big_number() -> pd.DataFrame: return pd.DataFrame({"col1": [9999999999999999999, 1]}) df_with_really_big_number() Python integers have arbitrary precision. JavaScript's ``Number`` type has 53 bits of integer precision (``Number.MAX_SAFE_INTEGER`` = 9007199254740991). The value 9999999999999999999 exceeds this — if you naively convert it to a JS number, it silently rounds to 10000000000000000000. Buckaroo detects values above ``MAX_SAFE_INTEGER`` and preserves them as strings to maintain exact precision. This matters for database primary keys, blockchain transaction IDs, and any system that uses 64-bit integers. .. raw:: html Column Named "index" ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def df_with_col_named_index() -> pd.DataFrame: return pd.DataFrame({ 'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"], 'index': ["7777", "ooooo", "--- -", "33333", "assdf"]}) df_with_col_named_index() When you call ``df.reset_index()``, pandas creates a column called ``index``. Many widgets break because they confuse this column with the DataFrame's actual index. Buckaroo handles the ambiguity by internally renaming columns to ``a, b, c...`` and mapping back via ``orig_col_name``. .. raw:: html Named Index ~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def get_df_with_named_index() -> pd.DataFrame: """someone put the effort into naming the index, you'd probably want to display that""" return pd.DataFrame( {'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]}, index=pd.Index([10, 20, 30, 40, 50], name='foo')) get_df_with_named_index() Someone took the time to name this index ``foo``. That name carries meaning — it might be a join key, a time series frequency, or a categorical grouping. Buckaroo displays named indexes as a distinct pinned column so the name is visible. .. raw:: html MultiIndex Columns ~~~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def get_multiindex_with_names_cols_df(rows=15) -> pd.DataFrame: cols = pd.MultiIndex.from_tuples( [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c')], names=['level_a', 'level_b']) return pd.DataFrame( [["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]] * rows, columns=cols) get_multiindex_with_names_cols_df(rows=6) Hierarchical column headers are common after ``.pivot_table()`` and ``.groupby().agg()``. Most viewers either crash or flatten them into ugly tuple strings like ``('foo', 'a')``. Buckaroo flattens them into readable headers while preserving the level information. .. raw:: html MultiIndex on Rows ~~~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def get_multiindex_index_df() -> pd.DataFrame: row_index = pd.MultiIndex.from_tuples([ ('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), ('baz', 'a')]) return pd.DataFrame({ 'foo_col': [10, 20, 30, 40, 50, 60], 'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]}, index=row_index) get_multiindex_index_df() Multi-level row indexes are the counterpart to MultiIndex columns. They appear after ``.groupby()`` without ``.reset_index()``, or when loading data from hierarchical sources. The tricky part: each index level becomes an additional column that has to be displayed alongside the data columns without breaking the column count. This DataFrame also has a ``None`` in the last row of ``bar_col`` — a missing string value mixed with non-missing strings. .. raw:: html Three-Level MultiIndex ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def get_multiindex3_index_df() -> pd.DataFrame: row_index = pd.MultiIndex.from_tuples([ ('foo', 'a', 3), ('foo', 'b', 2), ('bar', 'a', 1), ('bar', 'b', 3), ('bar', 'c', 5), ('baz', 'a', 6)]) return pd.DataFrame({ 'foo_col': [10, 20, 30, 40, 50, 60], 'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]}, index=row_index) get_multiindex3_index_df() If two levels are hard, three levels are harder. This exercises the column-renaming logic that has to handle an arbitrary number of index levels without collision. .. raw:: html MultiIndex on Both Axes ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def get_multiindex_with_names_both() -> pd.DataFrame: row_index = pd.MultiIndex.from_tuples([ ('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), ('baz', 'a')], names=['index_name_1', 'index_name_2']) cols = pd.MultiIndex.from_tuples( [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), ('baz', 'a')], names=['level_a', 'level_b']) return pd.DataFrame([ [10, 20, 30, 40, 50, 60]] * 6, columns=cols, index=row_index) get_multiindex_with_names_both() The boss fight: hierarchical headers on both axes, with named levels on both sides. This is what ``pd.pivot_table()`` produces on complex groupings. Everything about column counting, index handling, and header rendering gets tested simultaneously. .. raw:: html Weird Types (Pandas) ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def df_with_weird_types() -> pd.DataFrame: """DataFrame with unusual dtypes that historically broke rendering. Exercises: categorical, timedelta, period, interval.""" return pd.DataFrame({ 'categorical': pd.Categorical( ['red', 'green', 'blue', 'red', 'green']), 'timedelta': pd.to_timedelta( ['1 days 02:03:04', '0 days 00:00:01', '365 days', '0 days 00:00:00.001', '0 days 00:00:00.000100']), 'period': pd.Series( pd.period_range('2021-01', periods=5, freq='M')), 'interval': pd.Series( pd.arrays.IntervalArray.from_breaks([0, 1, 2, 3, 4, 5])), 'int_col': [10, 20, 30, 40, 50], }) df_with_weird_types() Four types that most viewers ignore: - **Categorical**: Has a fixed set of allowed values. Not a string. - **Timedelta**: A duration, not a timestamp. "1 day, 2 hours, 3 minutes, 4 seconds" is a single value. - **Period**: A span of time ("January 2021"), not a point in time. - **Interval**: A range like ``(0, 1]``. Common in ``pd.cut()`` output. Buckaroo detects each type and applies the appropriate formatter. Timedeltas display as human-readable durations ("1d 2h 3m 4s"), not raw microsecond counts. .. raw:: html Weird Types (Polars) ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # from buckaroo/ddd_library.py def pl_df_with_weird_types(): """Polars DataFrame with unusual dtypes that historically broke rendering. Exercises: Duration (#622), Time, Categorical, Decimal, Binary.""" import datetime as dt import polars as pl return pl.DataFrame({ 'duration': pl.Series([100_000, 3_723_000_000, 86_400_000_000, 500, 60_000_000], dtype=pl.Duration('us')), 'time': [dt.time(14, 30), dt.time(9, 15, 30), dt.time(0, 0, 1), dt.time(23, 59, 59), dt.time(12, 0)], 'categorical': pl.Series( ['red', 'green', 'blue', 'red', 'green'] ).cast(pl.Categorical), 'decimal': pl.Series( ['100.50', '200.75', '0.01', '99999.99', '3.14'] ).cast(pl.Decimal(10, 2)), 'binary': [b'hello', b'world', b'\x00\x01\x02', b'test', b'\xff\xfe'], 'int_col': [10, 20, 30, 40, 50], }) pl_df_with_weird_types() Polars has its own set of tricky types: - **Duration**: Microsecond-precision time spans. Was completely blank before issue `#622 `_. - **Time**: Time-of-day without a date component. - **Decimal**: Fixed-precision decimal (not float). Important for financial data. - **Binary**: Raw bytes. Displayed as hex strings. Buckaroo renders both pandas and polars DataFrames with the same viewer. If you're migrating from pandas to polars, buckaroo moves with you. .. raw:: html What's happening under the hood -------------------------------- Every table on this page is a **static embedding** of the full buckaroo widget. There is no Python kernel running. Here's what happened: 1. A Python script called ``buckaroo.artifact.to_html()`` on each DataFrame 2. The function serialized the data to base64-encoded Parquet (compact binary) 3. The summary stats (dtype, mean, histogram, etc.) were computed and serialized 4. Everything was embedded in an HTML file as a JSON ``