The Dastardly DataFrame Dataset
================================
Every DataFrame viewer works fine on ``pd.DataFrame({'a': [1, 2, 3]})``.
The question is what happens when the data gets weird.
Buckaroo ships a collection of deliberately tricky DataFrames called the
**Dastardly DataFrame Dataset** (DDD). These are the DataFrames that break
other viewers — the ones with MultiIndex columns, NaN mixed with infinity,
columns literally named ``index``, integers too large for JavaScript, and
types that most tools pretend don't exist.
This page shows each one rendered live in buckaroo's static embed. No
Jupyter kernel, no server — just HTML and JavaScript. If you can see the
tables below, the static embedding system is working.
Why this matters
----------------
If you build dashboards, you choose what data goes into your table. You
control the types, the column names, the index. But if you're doing
exploratory data analysis — loading CSVs from vendors, joining tables from
different systems, debugging a pipeline that produces unexpected output —
you don't control any of that. The data is what it is.
``df.head()`` hides the problem. It shows you 5 rows and lets you believe
everything is fine. Buckaroo is built for the opposite workflow: show you
everything, especially the parts that are surprising.
The Dastardly DataFrames
------------------------
Each section below shows the exact function from ``buckaroo.ddd_library``
that creates the DataFrame, explains why it's tricky, and renders it live
in a buckaroo static embed.
.. code-block:: bash
pip install buckaroo
.. code-block:: python
from buckaroo.ddd_library import *
Infinity and NaN
~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def df_with_infinity() -> pd.DataFrame:
return pd.DataFrame({'a': [np.nan, np.inf, np.inf * -1]})
df_with_infinity()
Three values, three completely different things: a missing value, positive
infinity, and negative infinity. Many viewers display all three as blank or
"NaN". Buckaroo distinguishes them.
This also tests whether summary stats (mean, min, max) handle infinity
correctly — they should, because ``np.inf`` is a valid float, not missing
data.
.. raw:: html
Really Big Numbers
~~~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def df_with_really_big_number() -> pd.DataFrame:
return pd.DataFrame({"col1": [9999999999999999999, 1]})
df_with_really_big_number()
Python integers have arbitrary precision. JavaScript's ``Number`` type has
53 bits of integer precision (``Number.MAX_SAFE_INTEGER`` = 9007199254740991).
The value 9999999999999999999 exceeds this — if you naively convert it to a
JS number, it silently rounds to 10000000000000000000.
Buckaroo detects values above ``MAX_SAFE_INTEGER`` and preserves them as
strings to maintain exact precision. This matters for database primary keys,
blockchain transaction IDs, and any system that uses 64-bit integers.
.. raw:: html
Column Named "index"
~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def df_with_col_named_index() -> pd.DataFrame:
return pd.DataFrame({
'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"],
'index': ["7777", "ooooo", "--- -", "33333", "assdf"]})
df_with_col_named_index()
When you call ``df.reset_index()``, pandas creates a column called ``index``.
Many widgets break because they confuse this column with the DataFrame's
actual index. Buckaroo handles the ambiguity by internally renaming columns
to ``a, b, c...`` and mapping back via ``orig_col_name``.
.. raw:: html
Named Index
~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def get_df_with_named_index() -> pd.DataFrame:
"""someone put the effort into naming the index,
you'd probably want to display that"""
return pd.DataFrame(
{'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]},
index=pd.Index([10, 20, 30, 40, 50], name='foo'))
get_df_with_named_index()
Someone took the time to name this index ``foo``. That name carries meaning —
it might be a join key, a time series frequency, or a categorical grouping.
Buckaroo displays named indexes as a distinct pinned column so the name is
visible.
.. raw:: html
MultiIndex Columns
~~~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def get_multiindex_with_names_cols_df(rows=15) -> pd.DataFrame:
cols = pd.MultiIndex.from_tuples(
[('foo', 'a'), ('foo', 'b'), ('bar', 'a'),
('bar', 'b'), ('bar', 'c')],
names=['level_a', 'level_b'])
return pd.DataFrame(
[["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]] * rows,
columns=cols)
get_multiindex_with_names_cols_df(rows=6)
Hierarchical column headers are common after ``.pivot_table()`` and
``.groupby().agg()``. Most viewers either crash or flatten them into ugly
tuple strings like ``('foo', 'a')``. Buckaroo flattens them into readable
headers while preserving the level information.
.. raw:: html
MultiIndex on Rows
~~~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def get_multiindex_index_df() -> pd.DataFrame:
row_index = pd.MultiIndex.from_tuples([
('foo', 'a'), ('foo', 'b'),
('bar', 'a'), ('bar', 'b'), ('bar', 'c'),
('baz', 'a')])
return pd.DataFrame({
'foo_col': [10, 20, 30, 40, 50, 60],
'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]},
index=row_index)
get_multiindex_index_df()
Multi-level row indexes are the counterpart to MultiIndex columns. They
appear after ``.groupby()`` without ``.reset_index()``, or when loading
data from hierarchical sources. The tricky part: each index level becomes
an additional column that has to be displayed alongside the data columns
without breaking the column count.
This DataFrame also has a ``None`` in the last row of ``bar_col`` — a missing
string value mixed with non-missing strings.
.. raw:: html
Three-Level MultiIndex
~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def get_multiindex3_index_df() -> pd.DataFrame:
row_index = pd.MultiIndex.from_tuples([
('foo', 'a', 3), ('foo', 'b', 2),
('bar', 'a', 1), ('bar', 'b', 3), ('bar', 'c', 5),
('baz', 'a', 6)])
return pd.DataFrame({
'foo_col': [10, 20, 30, 40, 50, 60],
'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]},
index=row_index)
get_multiindex3_index_df()
If two levels are hard, three levels are harder. This exercises the
column-renaming logic that has to handle an arbitrary number of index levels
without collision.
.. raw:: html
MultiIndex on Both Axes
~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def get_multiindex_with_names_both() -> pd.DataFrame:
row_index = pd.MultiIndex.from_tuples([
('foo', 'a'), ('foo', 'b'),
('bar', 'a'), ('bar', 'b'), ('bar', 'c'),
('baz', 'a')],
names=['index_name_1', 'index_name_2'])
cols = pd.MultiIndex.from_tuples(
[('foo', 'a'), ('foo', 'b'), ('bar', 'a'),
('bar', 'b'), ('bar', 'c'), ('baz', 'a')],
names=['level_a', 'level_b'])
return pd.DataFrame([
[10, 20, 30, 40, 50, 60]] * 6,
columns=cols, index=row_index)
get_multiindex_with_names_both()
The boss fight: hierarchical headers on both axes, with named levels on
both sides. This is what ``pd.pivot_table()`` produces on complex groupings.
Everything about column counting, index handling, and header rendering gets
tested simultaneously.
.. raw:: html
Weird Types (Pandas)
~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def df_with_weird_types() -> pd.DataFrame:
"""DataFrame with unusual dtypes that historically broke rendering.
Exercises: categorical, timedelta, period, interval."""
return pd.DataFrame({
'categorical': pd.Categorical(
['red', 'green', 'blue', 'red', 'green']),
'timedelta': pd.to_timedelta(
['1 days 02:03:04', '0 days 00:00:01',
'365 days', '0 days 00:00:00.001',
'0 days 00:00:00.000100']),
'period': pd.Series(
pd.period_range('2021-01', periods=5, freq='M')),
'interval': pd.Series(
pd.arrays.IntervalArray.from_breaks([0, 1, 2, 3, 4, 5])),
'int_col': [10, 20, 30, 40, 50],
})
df_with_weird_types()
Four types that most viewers ignore:
- **Categorical**: Has a fixed set of allowed values. Not a string.
- **Timedelta**: A duration, not a timestamp. "1 day, 2 hours, 3 minutes,
4 seconds" is a single value.
- **Period**: A span of time ("January 2021"), not a point in time.
- **Interval**: A range like ``(0, 1]``. Common in ``pd.cut()`` output.
Buckaroo detects each type and applies the appropriate formatter. Timedeltas
display as human-readable durations ("1d 2h 3m 4s"), not raw microsecond
counts.
.. raw:: html
Weird Types (Polars)
~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
# from buckaroo/ddd_library.py
def pl_df_with_weird_types():
"""Polars DataFrame with unusual dtypes that historically broke
rendering. Exercises: Duration (#622), Time, Categorical,
Decimal, Binary."""
import datetime as dt
import polars as pl
return pl.DataFrame({
'duration': pl.Series([100_000, 3_723_000_000,
86_400_000_000, 500, 60_000_000],
dtype=pl.Duration('us')),
'time': [dt.time(14, 30), dt.time(9, 15, 30),
dt.time(0, 0, 1), dt.time(23, 59, 59),
dt.time(12, 0)],
'categorical': pl.Series(
['red', 'green', 'blue', 'red', 'green']
).cast(pl.Categorical),
'decimal': pl.Series(
['100.50', '200.75', '0.01', '99999.99', '3.14']
).cast(pl.Decimal(10, 2)),
'binary': [b'hello', b'world', b'\x00\x01\x02',
b'test', b'\xff\xfe'],
'int_col': [10, 20, 30, 40, 50],
})
pl_df_with_weird_types()
Polars has its own set of tricky types:
- **Duration**: Microsecond-precision time spans. Was completely blank before
issue `#622 `_.
- **Time**: Time-of-day without a date component.
- **Decimal**: Fixed-precision decimal (not float). Important for financial data.
- **Binary**: Raw bytes. Displayed as hex strings.
Buckaroo renders both pandas and polars DataFrames with the same viewer. If
you're migrating from pandas to polars, buckaroo moves with you.
.. raw:: html
What's happening under the hood
--------------------------------
Every table on this page is a **static embedding** of the full buckaroo
widget. There is no Python kernel running. Here's what happened:
1. A Python script called ``buckaroo.artifact.to_html()`` on each DataFrame
2. The function serialized the data to base64-encoded Parquet (compact binary)
3. The summary stats (dtype, mean, histogram, etc.) were computed and serialized
4. Everything was embedded in an HTML file as a JSON ``