How Types and Data Move from Engine to Browser

You have a DataFrame in Python. Moments later it’s rendered in a browser — scrollable, formatted, with histograms in the summary row. What happened in between?

This article traces the full path: column renaming, type coercion, Parquet encoding, base64 transport, hyparquet decoding, and finally the displayer/formatter system that turns raw values into what you see on screen.

Column renaming: why everything becomes a, b, c

The very first thing buckaroo does when serializing a DataFrame is rename every column. The original column "revenue" becomes a. "cost" becomes b. The 27th column becomes aa, then ab, ac, and so on — base-26 using lowercase ASCII.

# buckaroo/df_util.py
def to_chars(n: int) -> str:
    digits = to_digits(n, 26)
    return "".join(map(lambda x: chr(x + 97), digits))

def old_col_new_col(df):
    return [(orig, to_chars(i)) for i, orig in enumerate(df.columns)]

Why? Three reasons:

  1. Column names can be anything. Tuples (from MultiIndex), integers, strings with spaces and special characters, even a column literally called "index". Parquet column names must be strings. AG-Grid field names should be simple identifiers. Renaming to a, b, c sidesteps every edge case at once.

  2. Collision avoidance. When a DataFrame has a column named "index" and we need to serialize the actual index as a column too, there’s a name collision. Renaming to short opaque names means the index columns (index, index_a, index_b for MultiIndex levels) never collide with data columns.

  3. Smaller payloads. The column name is repeated in every row of the JSON/Parquet output. "a" is smaller than "quarterly_revenue_usd".

The original name is preserved in the column_config that travels alongside the data. On the JS side, each column’s header_name (or col_path for MultiIndex) tells AG-Grid what to display in the header. The user never sees a, b, c — they see the real names.

# In styling_core.py — fix_column_config maps col→header_name
base_cc['col_name'] = col        # "a"
base_cc['header_name'] = str(orig_col_name)  # "revenue"

Cleaning before serialization

Python’s type system is richer than what Parquet (or JSON) can express directly. Before writing to Parquet, buckaroo coerces the awkward types:

Python type

Becomes

Why

pd.Period (e.g. “2021-01”)

str

Parquet has no period type

pd.Interval (e.g. (0, 1])

str

Parquet has no interval type

pd.Timedelta

str (e.g. “1 days 02:03:04”)

fastparquet can’t encode timedeltas

bytes (e.g. from pl.Binary)

hex string (e.g. "68656c6c6f")

Parquet object columns need strings

PyArrow-backed strings

object dtype

fastparquet needs object, not ArrowDtype

Timezone-naive datetimes

UTC datetimes

Avoids ambiguous serialization

For the main DataFrame, this happens in to_parquet() (serialization_utils.py). The function also calls prepare_df_for_serialization() which does the column rename and flattens MultiIndex levels into regular columns (index_a, index_b, etc.).

Summary stats have an additional wrinkle: each column’s stats dict contains mixed types (strings like "int64" for dtype, floats for mean, lists for histogram bins). fastparquet can’t handle mixed-type columns, so sd_to_parquet_b64() JSON-encodes every cell value first, making each column a pure string column. The JS side knows to JSON.parse each cell back.

# Every cell becomes a JSON string before parquet encoding
def _json_encode_cell(val):
    return json.dumps(_make_json_safe(val), default=str)

Parquet encoding and base64 transport

buckaroo uses fastparquet with a custom JSON codec to write the DataFrame to an in-memory Parquet file. Categorical and object columns get JSON-encoded within the Parquet file (fastparquet’s object_encoding='json').

The raw Parquet bytes are then base64-encoded into an ASCII string:

def to_parquet_b64(df):
    raw_bytes = to_parquet(df)
    return base64.b64encode(raw_bytes).decode('ascii')

The result is a tagged payload:

{"format": "parquet_b64", "data": "UEFSMQ..."}

This travels over the wire — via Jupyter’s comm protocol, a WebSocket, or embedded directly in an HTML <script> tag for static embeds. The format tag lets the JS side know it needs to decode Parquet rather than expecting raw JSON arrays.

Why Parquet instead of JSON? Parquet is a columnar binary format — it’s typically 5–10x smaller than the equivalent JSON for numeric data, and it preserves type information (int64 vs float64 vs string) that JSON discards.

hyparquet: decoding Parquet in the browser

On the JavaScript side, hyparquet is a pure-JS Parquet reader. No WASM, no server — it reads the binary format directly in the browser.

// resolveDFData.ts
const buf = b64ToArrayBuffer(val.data);         // base64 → ArrayBuffer
const metadata = parquetMetadata(buf);           // read parquet footer
parquetRead({
    file: buf,
    metadata,
    rowFormat: 'object',
    onComplete: (data) => {
        result = data.map(parseParquetRow);      // JSON.parse each cell
    },
});

The parseParquetRow step handles two things the raw Parquet decode doesn’t:

  1. JSON-encoded cells (from summary stats): each string cell gets JSON.parse’d back to its real type — numbers, arrays, objects.

  2. BigInt safety: hyparquet decodes Parquet INT64 columns as JavaScript BigInt. If the value fits in Number.MAX_SAFE_INTEGER (2^53 - 1), it’s converted to a regular Number. Otherwise it’s stringified to preserve precision — this is why 9999999999999999999 displays correctly instead of silently rounding.

Results are cached (LRU, 8 entries) so switching between summary stats views doesn’t re-decode the same Parquet bytes.

Displayers and formatters: the last mile

At this point we have rows of data (DFData) and a column_config that describes how each column should look. The column_config for each column includes a displayer_args object that names a displayer — this is the bridge between “raw value” and “what the user sees in the cell.”

The Python side picks the displayer based on summary stats:

# In a StylingAnalysis subclass
def style_column(cls, col, col_meta):
    dtype = col_meta.get('dtype')
    if dtype == 'float64':
        return {'displayer_args': {
            'displayer': 'float',
            'min_fraction_digits': 2,
            'max_fraction_digits': 4}}
    elif dtype == 'timedelta64[ns]':
        return {'displayer_args': {'displayer': 'duration'}}
    ...

The JS side receives this config and dispatches to the right formatter:

// Displayer.ts — getFormatter() is the dispatcher
switch (fArgs.displayer) {
    case "integer":  return getIntegerFormatter(fArgs);
    case "float":    return getFloatFormatter(fArgs);
    case "string":   return getStringFormatter(fArgs);
    case "boolean":  return booleanFormatter;
    case "duration": return getDurationFormatter();
    case "obj":      return getObjectFormatter(fArgs);
    ...
}

Each formatter is an AG-Grid ValueFormatterFunc — it receives the raw cell value and returns the display string. Some highlights:

  • Integers get thousands separators via Intl.NumberFormat and right-padding for alignment.

  • Floats get configurable decimal places, also via Intl.NumberFormat, with padding to align decimal points across rows.

  • Durations parse pandas timedelta strings ("1 days 02:03:04") and render as "1d 2h 3m 4s", with sub-second precision down to microseconds.

  • Booleans display as Python-convention True/False, not JS-convention true/false.

  • Objects (dicts, lists, None) get a recursive Python-like repr: { 'key': value }, [ 1, 2, 3 ], None.

For richer displays, there are cell renderers instead of formatters — these return React components rather than strings. Histograms, charts, links, images, and SVGs all use this path.

// Cell renderers return React components
case "histogram": return HistogramCell;
case "linkify":   return LinkCellRenderer;
case "chart":     return getChartCell(crArgs);

The full pipeline

Putting it all together, here’s the journey of a single cell value — say, a pd.Timedelta of “1 day, 2 hours, 3 minutes, 4 seconds”:

Python                          Wire              Browser
──────                          ────              ───────
pd.Timedelta('1d 2h 3m 4s')
    │
    ▼
rename columns (a, b, c...)
    │
    ▼
coerce to str: "1 days 02:03:04"
    │
    ▼
write to Parquet (fastparquet)
    │
    ▼
base64 encode ──────────────► {"format": "parquet_b64",
                               "data": "UEFSMQ..."}
                                    │
                                    ▼
                              b64 → ArrayBuffer
                                    │
                                    ▼
                              hyparquet.parquetRead()
                                    │
                                    ▼
                              parseParquetRow() → "1 days 02:03:04"
                                    │
                                    ▼
                              getDurationFormatter()
                                    │
                                    ▼
                              formatDuration() → "1d 2h 3m 4s"
                                    │
                                    ▼
                              AG-Grid renders: │ 1d 2h 3m 4s │

The column header shows the original name from header_name in the config. The user sees a human-readable duration in a column with its real name. Everything in between — the rename, the coercion, the binary encoding, the BigInt handling — is invisible.

That’s the point. The pipeline exists so that every type, every edge case, every weird DataFrame gets displayed correctly without the user having to think about it.