So You Want to Write a DataFrame Viewer¶

You want to write a better viewer for tabular data. That’s great, the world needs better interfaces in this space, and there is so much that can be improved on. Here are some of the biggest design decisions and their potential side effects, along with projects that chose different routes. There are many closed source data table viewers with various levels of capability. It seems like every new notebook hosting environment feels compelled to build their own dataframe viewer. In this article I will draw on my own experience creating Buckaroo, as well as observations from looking at popular open source table viewers like Perspective, Great Tables, DTale, Marimo, iTables, ipydatagrid, Panel Tabulator, and Streamlit’s st.dataframe.

I have run into each one of these issues while building buckaroo.

Use-case questions¶

Before starting, think about what use case you are looking to solve for. Are you trying to build tables for relatively static display (PDF to Huggingface data browser)? Do you want to serve dashboards (a limited set of interactions with users willing to customize heavily and specifically for styling)? Do you want to facilitate interactive use in an IDE like environment (VSCode notebooks, some internal data bench)? Do you want to work in notebook environments? What size datasets do you expect your users to work with? What performance expectations do your users have? Do you want users to be able to customize the experience? Without writing JS? Do you want to deal with streaming data? Do you want to allow editing of data?

Processing: server-side or browser-based¶

The biggest decision to make when building a table viewer is what to do with the data. Do you want the entire dataset to reside in the browser or do you want to leave it on the server and page the currently viewed section back and forth to the browser. Both approaches have their place.

Browser based approaches are much cheaper to serve at scale. Browsers have improved significantly in the past decade and there are many applications that put over a gigabyte of data into the browser with no ill effects. Further with HTTP range requests, the full dataset doesn’t even have to be loaded at once. Apache Arrow and Parquet make this approach more performant and attractive. This approach scales with little cost because S3 and Cloudflare are incredibly performant and inexpensive compared to spinning up server infrastructure.

Browser based approaches fall down with datasets over 1 GB. Additionally 1 GB is about the total limit of memory use that you want a single page to have, so if you have multiple dataframes that you want to display simultaneously, keep that in mind. Finally, browser based solutions require using browser based analytics engines instead of familiar tools like pandas and polars. Apache Arrow is packageable into a WebAssembly module, but packaging it into a JS build is tricky.

Server based solutions are more familiar as traditional web apps, sometimes with some twists. Server based solutions excel for very large datasets that are backed by analytics engines. If your 10 GB table is already in a relational database, let the database do the sorting, and only send over the limited rows that are being displayed. Server based solutions with persistent connections also allow many more tables to be displayed simultaneously while limiting browser memory usage. If you have infrastructure built around analytics pipelines in traditional environments, server side solutions are often the better way to go. Sorting and histograms in particular can be hard to implement identically in different numerical engines.

The downsides of a server based approach are that you always need to have the server running to make the table work. At the small end this means you can’t simply host an artifact with your table in it. You can’t serve a Jupyter notebook statically in a GitHub repo. If you intend to host an analytics system with your table, you now need server infrastructure to back it. Server infrastructure connected to a relational database or data warehouse is one level of expense — it is even more expensive (in terms of memory and CPU) to host Python-based analytics server-side.

Serializing data¶

For buckaroo, serializing data to JSON was the slowest part of the initial render (not true anymore, because of better lazy fetching). Serializing dataframes is hard. There are multiple numerical Python (Arrow, computation) concepts that don’t have direct equivalents in JS or JSON. Notably infinity and NaN aren’t valid in JSON. Furthermore datetime handling across JSON requires a processing layer — you will either encode strings or millisecond offsets, either requiring a metadata layer that can then be interpreted. Then there are common Python datatypes like timedelta that have no native JS equivalent.

Next we get to the difficulty of serializing pandas data structures. Pandas indexes which apply to rows and columns occur in a variety of formats. Multi-level indexes can be challenging for display — they have to be special-cased in your display code regardless of how they are serialized. Pandas columns can also be named in a variety of ways, including as numerics or strings.

These different dataframe configurations are challenging because they are hard to completely anticipate. In my experience, when a user constructed a dataframe with an unexpected structure, it was one of the most likely things to blow up buckaroo with a JS typing error. There were also exceptions thrown through most of the pandas processing code.

Polars is a bit easier in this regard. Polars eschews having an index.

Many of these issues exist when serializing to a binary format like Feather or Parquet, but are a bit different. With Feather/Parquet, make sure Python objects and lists serialize properly. Also if you want a single-file static HTML export to work, you will need to base64 encode the binary data. True binary-to-binary transfer requires a network connection.

The table viewer component¶

There are many table components, so much so that there is a site dedicated to tracking their popularity. Increasing in complexity you have everything from static HTML, to jQuery-based libraries, to modern table grids, to AG-Grid, to extreme custom-coded frontend libraries. HTML-based tables allow simple customizability along with a great story for static export to the widest list of targets. jQuery-based libraries (limited table rows, pagination) are relatively simple to use and limit complexity — previously they were much easier to package into the Jupyter frontend environment than full JS build chains.

Then there are modern table libraries that aren’t AG-Grid. React-data-grid, angular-grid, tanstack-table, handsome-table. These libraries might be familiar. They have a straightforward licensing story. They also tend to have rough edges, limited adoption, and they tend to be abandoned. I haven’t investigated these packages as much.

Next up is AG-Grid. AG-Grid is the reliable gold standard for tables, under active development for over a decade. AG-Grid has a full commercial company behind it, along with a permissively licensed community edition. From my experience they haven’t kneecapped the community edition in favor of the commercial edition, and aim to have the community edition as the best free table widget on the market. The tool is extensively documented with working examples. The company is completely unresponsive to bug reports from non-paying users in my experience. I chose AG-Grid after listening to an interview with their founder on the JS Jabber podcast.

Then there are custom table widgets like Perspective, glide-data-grid, and whatever you cooked up yourself. Perspective has a very impressive table, and I suspect it has better performance than AG-Grid. It is minimally documented and doesn’t have the wide community adoption that generates Stack Overflow guidance. glide-data-grid is an impressive piece of software, rendering to canvas. It is solo-maintained by its creator at Glide Apps — actively developed but quietly, with Streamlit as its biggest downstream consumer.

If you are writing your own table, congrats. You will have ultimate control over your user experience. You won’t have to worry about dependencies on isEven or other npm trash. You will have a very complex core piece to maintain. At a minimum I’d recommend thoroughly investigating other widgets to see how they approached problems.

The notebook environment¶

There are many different notebook environments. Jupyter Notebook, Google Colab, VSCode notebooks, classic notebooks (before Notebook 7.0), Marimo, Jupyter running on WASM (JupyterLite). All have slight differences that become especially significant for frontend code. Styling works differently, loading JavaScript is a bit different. Anywidget was developed to make all of this easier, and it does. Before anywidget, this section would have been much longer.

Even determining what environment you are running in is challenging. This will come up when users file bugs. widget_utils.py is my function for determining which Jupyter environment I’m running in.

Conclusion¶

I’m not suggesting that you avoid creating a table for the Jupyter environment. I am suggesting that you understand how broad a task it is, and the ways it could fail.

Comparison of open source DataFrame viewers¶

Name	Server / Browser	JSON / Numeric	Static Export	Jupyter Compatible	Dynamic	Table Viewer	Built on Anywidget?
Buckaroo	Server	Numeric	Yes	Yes	Yes	AG-Grid	Yes
ipydatagrid	Server	JSON	No	Yes	Yes	Lumino DataGrid (canvas)	No
Perspective	Both	Arrow	Yes	Yes	Yes	Custom	No
iTables	Browser	JSON	Yes	Yes	No	datatables (jQuery based)	Optional
Great Tables	Browser	HTML	Yes	Yes	No	HTML	No
DTale	Server	JSON	No	Yes	Yes	react-virtualized	No
Mito	Server	JSON	No	Yes	Yes	Endo (custom)	No
Marimo	Server	JSON	Yes	No	Yes	tanstack-table	No
Panel Tabulator	Both	JSON	Yes	Yes	Yes	Tabulator.js	No
Streamlit	Server	Arrow	No	No	Yes	glide-data-grid (canvas)	No
quak	Server	Arrow	No	Yes	Yes	Custom HTML	Yes

So You Want to Write a DataFrame Viewer¶

Use-case questions¶

Processing: server-side or browser-based¶

Serializing data¶

The table viewer component¶

The notebook environment¶

Other questions¶

Conclusion¶

Comparison of open source DataFrame viewers¶

Buckaroo

Navigation

Related Topics