How Types and Data Move from Engine to Browser¶
You have a DataFrame in Python. Moments later it’s rendered in a browser — scrollable, formatted, with histograms in the summary row. What happened in between?
This article traces the full path: column renaming, type coercion, Parquet encoding, base64 transport, hyparquet decoding, and finally the displayer/formatter system that turns raw values into what you see on screen.
Column renaming: why everything becomes a, b, c¶
The very first thing buckaroo does when serializing a DataFrame is
rename every column. The original column "revenue" becomes a.
"cost" becomes b. The 27th column becomes aa, then ab,
ac, and so on — base-26 using lowercase ASCII.
# buckaroo/df_util.py
def to_chars(n: int) -> str:
digits = to_digits(n, 26)
return "".join(map(lambda x: chr(x + 97), digits))
def old_col_new_col(df):
return [(orig, to_chars(i)) for i, orig in enumerate(df.columns)]
Why? Three reasons:
Column names can be anything. Tuples (from MultiIndex), integers, strings with spaces and special characters, even a column literally called
"index". Parquet column names must be strings. AG-Grid field names should be simple identifiers. Renaming toa, b, csidesteps every edge case at once.Collision avoidance. When a DataFrame has a column named
"index"and we need to serialize the actual index as a column too, there’s a name collision. Renaming to short opaque names means the index columns (index,index_a,index_bfor MultiIndex levels) never collide with data columns.Smaller payloads. The column name is repeated in every row of the JSON/Parquet output.
"a"is smaller than"quarterly_revenue_usd".
The original name is preserved in the column_config that travels
alongside the data. On the JS side, each column’s header_name
(or col_path for MultiIndex) tells AG-Grid what to display in the
header. The user never sees a, b, c — they see the real names.
# In styling_core.py — fix_column_config maps col→header_name
base_cc['col_name'] = col # "a"
base_cc['header_name'] = str(orig_col_name) # "revenue"
Cleaning before serialization¶
Python’s type system is richer than what Parquet (or JSON) can express directly. Before writing to Parquet, buckaroo coerces the awkward types:
Python type |
Becomes |
Why |
|---|---|---|
|
|
Parquet has no period type |
|
|
Parquet has no interval type |
|
|
fastparquet can’t encode timedeltas |
|
hex string (e.g. |
Parquet object columns need strings |
PyArrow-backed strings |
|
fastparquet needs object, not ArrowDtype |
Timezone-naive datetimes |
UTC datetimes |
Avoids ambiguous serialization |
For the main DataFrame, this happens in to_parquet()
(serialization_utils.py). The function also calls
prepare_df_for_serialization() which does the column rename and
flattens MultiIndex levels into regular columns (index_a,
index_b, etc.).
Summary stats have an additional wrinkle: each column’s stats dict
contains mixed types (strings like "int64" for dtype, floats for
mean, lists for histogram bins). fastparquet can’t handle mixed-type
columns, so sd_to_parquet_b64() JSON-encodes every cell value first,
making each column a pure string column. The JS side knows to
JSON.parse each cell back.
# Every cell becomes a JSON string before parquet encoding
def _json_encode_cell(val):
return json.dumps(_make_json_safe(val), default=str)
Parquet encoding and base64 transport¶
buckaroo uses fastparquet with a custom JSON codec to write the
DataFrame to an in-memory Parquet file. Categorical and object columns
get JSON-encoded within the Parquet file (fastparquet’s object_encoding='json').
The raw Parquet bytes are then base64-encoded into an ASCII string:
def to_parquet_b64(df):
raw_bytes = to_parquet(df)
return base64.b64encode(raw_bytes).decode('ascii')
The result is a tagged payload:
{"format": "parquet_b64", "data": "UEFSMQ..."}
This travels over the wire — via Jupyter’s comm protocol, a WebSocket,
or embedded directly in an HTML <script> tag for static embeds. The
format tag lets the JS side know it needs to decode Parquet rather than
expecting raw JSON arrays.
Why Parquet instead of JSON? Parquet is a columnar binary format — it’s typically 5–10x smaller than the equivalent JSON for numeric data, and it preserves type information (int64 vs float64 vs string) that JSON discards.
hyparquet: decoding Parquet in the browser¶
On the JavaScript side, hyparquet is a pure-JS Parquet reader. No WASM, no server — it reads the binary format directly in the browser.
// resolveDFData.ts
const buf = b64ToArrayBuffer(val.data); // base64 → ArrayBuffer
const metadata = parquetMetadata(buf); // read parquet footer
parquetRead({
file: buf,
metadata,
rowFormat: 'object',
onComplete: (data) => {
result = data.map(parseParquetRow); // JSON.parse each cell
},
});
The parseParquetRow step handles two things the raw Parquet decode
doesn’t:
JSON-encoded cells (from summary stats): each string cell gets
JSON.parse’d back to its real type — numbers, arrays, objects.BigInt safety: hyparquet decodes Parquet INT64 columns as JavaScript
BigInt. If the value fits inNumber.MAX_SAFE_INTEGER(2^53 - 1), it’s converted to a regularNumber. Otherwise it’s stringified to preserve precision — this is why9999999999999999999displays correctly instead of silently rounding.
Results are cached (LRU, 8 entries) so switching between summary stats views doesn’t re-decode the same Parquet bytes.
Displayers and formatters: the last mile¶
At this point we have rows of data (DFData) and a column_config
that describes how each column should look. The column_config for
each column includes a displayer_args object that names a
displayer — this is the bridge between “raw value” and “what the
user sees in the cell.”
The Python side picks the displayer based on summary stats:
# In a StylingAnalysis subclass
def style_column(cls, col, col_meta):
dtype = col_meta.get('dtype')
if dtype == 'float64':
return {'displayer_args': {
'displayer': 'float',
'min_fraction_digits': 2,
'max_fraction_digits': 4}}
elif dtype == 'timedelta64[ns]':
return {'displayer_args': {'displayer': 'duration'}}
...
The JS side receives this config and dispatches to the right formatter:
// Displayer.ts — getFormatter() is the dispatcher
switch (fArgs.displayer) {
case "integer": return getIntegerFormatter(fArgs);
case "float": return getFloatFormatter(fArgs);
case "string": return getStringFormatter(fArgs);
case "boolean": return booleanFormatter;
case "duration": return getDurationFormatter();
case "obj": return getObjectFormatter(fArgs);
...
}
Each formatter is an AG-Grid ValueFormatterFunc — it receives the
raw cell value and returns the display string. Some highlights:
Integers get thousands separators via
Intl.NumberFormatand right-padding for alignment.Floats get configurable decimal places, also via
Intl.NumberFormat, with padding to align decimal points across rows.Durations parse pandas timedelta strings (
"1 days 02:03:04") and render as"1d 2h 3m 4s", with sub-second precision down to microseconds.Booleans display as Python-convention
True/False, not JS-conventiontrue/false.Objects (dicts, lists, None) get a recursive Python-like repr:
{ 'key': value },[ 1, 2, 3 ],None.
For richer displays, there are cell renderers instead of formatters — these return React components rather than strings. Histograms, charts, links, images, and SVGs all use this path.
// Cell renderers return React components
case "histogram": return HistogramCell;
case "linkify": return LinkCellRenderer;
case "chart": return getChartCell(crArgs);
The full pipeline¶
Putting it all together, here’s the journey of a single cell value —
say, a pd.Timedelta of “1 day, 2 hours, 3 minutes, 4 seconds”:
Python Wire Browser
────── ──── ───────
pd.Timedelta('1d 2h 3m 4s')
│
▼
rename columns (a, b, c...)
│
▼
coerce to str: "1 days 02:03:04"
│
▼
write to Parquet (fastparquet)
│
▼
base64 encode ──────────────► {"format": "parquet_b64",
"data": "UEFSMQ..."}
│
▼
b64 → ArrayBuffer
│
▼
hyparquet.parquetRead()
│
▼
parseParquetRow() → "1 days 02:03:04"
│
▼
getDurationFormatter()
│
▼
formatDuration() → "1d 2h 3m 4s"
│
▼
AG-Grid renders: │ 1d 2h 3m 4s │
The column header shows the original name from header_name in the
config. The user sees a human-readable duration in a column with its
real name. Everything in between — the rename, the coercion, the binary
encoding, the BigInt handling — is invisible.
That’s the point. The pipeline exists so that every type, every edge case, every weird DataFrame gets displayed correctly without the user having to think about it.