Why Buckaroo Uses Depot for CI¶

Depot sponsors Buckaroo’s CI infrastructure. I ran a controlled benchmark — 21 runs across different scenarios — to understand exactly what that sponsorship buys. The results surprised me.

The problem with GitHub Actions¶

GitHub Actions is slow, but not in the way I expected. The jobs themselves are fine — the runners are fast enough. The problem is queueing. When you have a 23-job pipeline and GitHub is busy, your jobs don’t start simultaneously. They trickle in one at a time over minutes.

When your CI takes 10 minutes because of queueing, you stop pushing small changes. You batch things up. You skip the test run “just this once.” You merge without waiting for green because you’ve already context-switched to something else. Slow CI makes you write worse code.

What Buckaroo’s CI does¶

Buckaroo is a DataFrame viewer with a Python backend and TypeScript/React frontend. It deploys to 8 environments — Jupyter, Marimo, JupyterLite (WASM), Marimo WASM, VSCode, Google Colab, static embeds, and a standalone server (used for MCP). I can’t manually test each environment on every code change.

The CI pipeline runs 23 jobs across 2 waves:

Wave 1 (no dependencies): lint, JS build + test, wheel build, Python tests across 4 versions with two dependency strategies (8 matrix jobs), styling screenshots, docs build
Wave 2 (needs the built wheel): 6 Playwright integration suites (Storybook, JupyterLab, Marimo, WASM Marimo, Server, Static Embed), MCP integration, smoke tests, TestPyPI publish

Three months ago this pipeline had 3 jobs.

LLMs changed the equation¶

LLM coding changed the way I approach devops. Claude made it possible to get my Playwright integration tests to a place where I trust them to run reliably. But LLMs also make testing more important than ever. When Claude makes a change across 5 files, I need to know in minutes — not 10 minutes — whether it broke something. The tighter the feedback loop, the more ambitious the changes I can attempt.

The benchmark¶

I ran the same 23-job pipeline on both Depot and GitHub Actions runners across 21 runs over a Sunday night and Monday morning, covering cold cache, warm cache, parallel, and sequential scenarios. All runs used 2-CPU Linux runners.

Reproduction scripts are in the buckaroo repo:

# Critical path for a single run
bash scripts/ci_critical_path.sh <run-id>

# List runs for a PR or branch
bash scripts/ci_list_runs.sh <pr-number-or-branch>

# Full timing data as JSON (pipe to ci_timing_table.py)
bash scripts/ci_all_timings.sh <run-id> [<run-id> ...] \
  | python3 scripts/ci_timing_table.py --labels "Run 1" "Run 2" ...

# Launch paired cold-cache benchmark runs
bash scripts/cold_cache_benchmark.sh

The results¶

Critical path time (excluding the non-blocking Windows job):

Scenario	Mean	Std Dev	Min	Max	n
GitHub, Sunday night, 1 PR	3m09s	—	3m09s	3m09s	1
GitHub, Monday, cold, 3 parallel	9m15s	±30s	8m49s	9m49s	3
GitHub, Monday, warm, 3 parallel	8m09s	±158s	5m06s	11m11s	6
GitHub, Monday, warm, sequential	5m19s	±62s	4m25s	6m28s	3
Depot, Monday, cold, 3 parallel	3m53s	±2s	3m50s	3m55s	3
Depot, Monday, warm, 3 parallel	4m08s	±23s	3m38s	4m32s	6

Aggregated across all Monday runs:

Runner	Mean	Std Dev	Min	Max	n
GitHub Actions	7m46s	±143s	4m25s	11m11s	12
Depot	4m03s	±20s	3m38s	4m32s	9

Depot’s standard deviation is ±20 seconds. GitHub’s is ±143 seconds.

What’s actually happening¶

Each Depot runner takes a few seconds longer to provision than a GitHub runner that’s already available — there’s a fixed overhead per machine spin-up. That makes individual job durations slightly longer on Depot. But it doesn’t matter because Depot provisions all runners in parallel. GitHub provisions them sequentially from a shared pool, so you wait for each one.

“Wave 1 stagger” is the time between the first and last Wave 1 job starting — it measures how long the runner takes to provision all the parallel jobs:

Depot: 14–35 seconds. All jobs start within half a minute.
GitHub, Monday morning: 90–447 seconds. Jobs trickle in over 1.5–7 minutes as runners become available.

On a Sunday night with one PR, GitHub’s stagger was 1 second — identical to Depot. The difference only shows up under load on Monday morning.

Cache performance is close. Depot reads caches ~30% faster (2.8s vs 4.1s per step), but GitHub writes caches ~3x faster (0.8s vs 2.1s per step on Monday). Cache writes happen in post-job cleanup steps and don’t affect the critical path. Neither difference materially changes the overall timing.

What Depot actually gave me¶

Three things, in order of importance:

Consistent provisioning. Depot provisions all runners within 20 seconds, every time. GitHub ranges from instant to 7 minutes depending on load. When you’re pushing 10 times a day and iterating with an LLM, unpredictable queue times kill your flow.
Confidence to invest in CI. Because I knew the infrastructure was solid, I actually spent time making CI better — removing unnecessary setup steps, parallelizing into two waves, tuning the pipeline. When your CI infrastructure feels like a liability, you don’t invest in it — you avoid it.

Before and after¶

On December 24, 2025 — the day Depot’s CTO responded to my sponsorship request — Buckaroo’s CI had 3 jobs: lint, Python tests, and a wheel build.

Since then I’ve added 20 new jobs:

6 Playwright integration suites — Storybook, JupyterLab, Marimo, WASM Marimo, Server, and Static Embed. These catch real bugs — “it renders in Jupyter but is blank in Marimo” is the kind of thing I don’t want to eyeball on every PR.
Python tests across 4 versions with two dependency strategies (min pinned + max latest) — 8 matrix jobs total
MCP integration tests — verifying the MCP server works against the built wheel
Smoke tests for each optional extras group
Styling screenshot comparisons — before/after captures on every PR
Docs build + link checker
TestPyPI publish on every PR with an install command in the PR comment

The critical path completes in about 4 minutes on Depot. The Windows job runs longer but is non-blocking (continue-on-error: true).

Testing against dependency versions¶

Depending on pandas, PyArrow, and polars simultaneously is tricky. A new pandas release can change default string dtype behavior. A polars update can change how Duration columns serialize. PyArrow versions affect Parquet compatibility.

Buckaroo runs two sets of test suites: the regular suite tests against the minimum pinned versions in pyproject.toml, and the “Max Versions” suite tests against the latest releases of every dependency. This runs across Python 3.11 through 3.14. The goal is to catch compatibility issues before users do.

This strategy only works if the test suite is fast enough to run both configurations on every push. On slow CI, you’d run one and hope for the best.

The scariest part¶

The scariest part of switching to Depot wasn’t Depot itself — it was that their open source program requires a GitHub organization. Buckaroo lived at paddymul/buckaroo under my personal account. To use Depot I had to create the buckaroo-data organization and transfer the repository there.

I was terrified of losing my GitHub stars. That sounds vain, but stars are the main signal to potential users that a project is real. Losing them would set the project back.

It turns out GitHub’s repository transfer preserves everything — stars, issues, pull requests, forks, watchers. It even sets up URL redirects from the old path. The transfer itself took seconds. But I didn’t know that going in, and I spent more time worrying about it than about any technical aspect of the Depot migration.

If you’re in the same situation: do the transfer. You won’t lose anything.

What I’d tell other open source maintainers¶

If your CI takes more than 5 minutes and you’ve been meaning to fix it but haven’t, Depot’s open source sponsorship program is worth applying to. The switch is straightforward — change the runs-on label in your workflow YAML, everything else stays the same. If you need to create an organization and transfer your repo, that’s painless too — stars and all metadata carry over.

The real value isn’t raw speed — individual jobs run at about the same pace. It’s that your jobs all start at once instead of queueing. That consistency changes your behavior. You push more often, you test more things, you catch problems earlier. Slow CI is a tax on every decision you make. Removing that tax compounds.