Skip to main content

Git was designed for how humans use repos. Agents use repos completely differently. I spent the last few months building something for the second use case.

Ledge is a git server rebuilt for agent workloads. Point a stock git client at it -- no plugins, no protocol changes. Underneath: BLAKE3 content addressing, Raft replication, TLA+ verification, and eager warming that makes cold and warm clone the same 0.13s. Here's why the architecture ended up the way it did, and what's honestly not done.

June 14, 2026

Let me explain what I actually built and what the design decisions were, because the README is honest about what it does but doesn't explain why the architecture ended up the way it did.

The short version: Ledge is a git server rebuilt for agent workloads. You point a stock git client at it -- no plugins, no protocol changes, git clone http://localhost:3000/ws/<id> works today. Underneath, it's content-addressed with BLAKE3, replicated with Raft, formally verified in TLA+, and designed around the assumption that the clients are agents, not humans.


The problem with git at agent scale.

Git's storage model was designed for a specific usage pattern: one developer, one local repo, periodic commits, maybe one active branch at a time. The server (if there is one) is mostly read-heavy. You push occasionally. You clone occasionally. The write pattern is human-paced.

Agents use repos differently. Hundreds of parallel forks of the same base state. Ephemeral workspaces that exist for the duration of a task and then get discarded. Write-heavy cycles where the agent is committing checkpoints every few minutes. Many tenants (many independent agent instances) sharing the same infrastructure. The pattern is: create workspace, clone base, write fast, push, discard -- at machine pace, not human pace.

Standard git servers -- GitHub, GitLab, Gitea -- are optimized for the human pattern. When you hit them with the machine pattern, the things that are slow get exposed.

The specific thing that's slow: pack computation happens at clone time. When you git clone, the server runs git upload-pack, which computes a delta-compressed packfile from the objects you need, and streams it to you. For a repo with meaningful history, this takes time. For a warm server serving the same popular ref repeatedly, this computation is redundant -- you're computing the same packfile for every clone.

This is the problem Ledge solves first.


Eager warming: move computation to push time, not clone time.

When you push to Ledge, it runs pack computation immediately. The packfile for the uploaded tip is precomputed and cached, keyed by the want-set (the set of object hashes the client wants). When a clone arrives requesting the same tip, the response is the cached pack. No computation at serve time.

The result: cold clone and warm clone are the same latency. 0.13 seconds. The same computation that git runs at clone time runs at push time in Ledge, and the result is stored. The first clone is as fast as the hundredth.

For the agent pattern -- create workspace, clone immediately, run task, discard -- this is the difference between "clone is in the critical path" and "clone is off the critical path." At 0.13 seconds vs 0.31 seconds, the delta looks small. Across hundreds of parallel agent instances cloning simultaneously under time pressure, the aggregate matters.

The upload-pack response is memoized by want-set hash. Same set of objects requested = same response returned from cache. Different want-set = compute and cache the new packfile. The cache is warm for the common case (latest main branch tip) and lazy-computes for the uncommon case.


Dual namespace: one artifact, two addressing schemes.

This was the hardest design decision and the one I spent the most time on.

Git uses SHA-1 for object addressing. Everything in git -- commits, trees, blobs, tags -- is addressed by SHA-1 of its content. The entire git ecosystem (clients, servers, CI, tooling) assumes SHA-1. You can't just replace it.

But SHA-1 is broken as a content-addressing scheme. Collision attacks are practical. For an infrastructure system that needs to guarantee content integrity -- "this pack contains exactly what it says it contains" -- SHA-1 is the wrong primitive. BLAKE3 is the right one: faster than SHA-1, cryptographically sound, and designed for exactly this use case.

The solution: don't replace SHA-1. Add BLAKE3 on top.

Ledge writes real git v2 packfiles -- git verify-pack accepts them, git unpack-objects accepts them, every git client works against them unchanged. The packfile is stored as-is. A sidecar index maps BLAKE3 object IDs to byte offsets within the pack. One artifact, two address spaces. You can address any object by its git SHA-1 (for compatibility) or by its BLAKE3 hash (for integrity verification). The BLAKE3↔offset bridge index adds about 3% to total on-disk storage. That's the content-addressing tax. I think it's worth it.

The practical consequence: when a client clones, they get a real git packfile. They verify it using SHA-1 (which is what git does). Internally, Ledge's replication and content-verification layer uses BLAKE3. Both consumers get what they need from the same artifact.


Workspaces: ephemeral, lease-backed forks.

The agent workflow is: take a base state (main branch at some commit), fork it, work on the fork, possibly push back, discard the fork. This is a specific resource management problem that git servers don't handle well because they were designed for long-lived branches.

Ledge's workspace model: each workspace is a named fork of a base ref, backed by a lease with an expiry. You create a workspace via API, get back a workspace ID, clone from that workspace's URL. The workspace has its own ref namespace. You can push to it without affecting the base. When the lease expires (or you explicitly delete it), the workspace refs are cleaned up via mark-and-sweep GC.

The GC traverses the ref graph, marks all objects reachable from live workspace refs and live base refs, and sweeps unreachable objects. Workspaces that have been deleted or whose leases have expired contribute no reachable objects. Their pack data gets collected.

This is the right model for agent scale because it makes the lifecycle explicit: create, use, expire. The server doesn't accumulate branches that nobody is maintaining anymore. The cleanup is automatic.


Raft replication and TLA+ verification.

Ledge replicates the ref store using openraft -- a production Raft implementation in Rust. Refs are the critical state: they're what determines what git fetch and git clone return. The object store (the actual pack data) is content-addressed and append-only, so it doesn't need consensus -- the same content at the same hash is identical anywhere. Only refs need linearizable updates.

The Raft state machine provides linearizable compare-and-swap on refs. You can atomically update a ref from expected_sha to new_sha and fail if someone else moved it first. Leader failover loses no committed data. The cluster handles single-node failures without manual intervention.

The formal/ directory in the repo is TLA+ specifications. Five things verified: the ref store's consistency invariants, the cross-shard 2PC protocol (for operations that span multiple ref shards), distributed GC correctness (no live objects collected, all unreachable objects eventually collected), the sharding protocol, and reachability (you can always get from a ref to the objects it points to). TLA+ model checking can't verify every possible execution at production scale -- the state space is too large -- but it can catch structural bugs in the protocol design before they show up as data loss at 3am.

I want to be honest about what the TLA+ verification does and doesn't give you. It gives you confidence that the protocol design is correct -- that the state machine transitions preserve the invariants you care about. It doesn't give you confidence that the Rust implementation matches the spec. That requires fuzzing, chaos testing, and runtime. The formal/ specs are a design tool, not a deployment guarantee.


What's not done.

The README is honest about this and I want to be too.

Multi-host is untested on real networks. Every Raft/cluster/chaos test run has been single-host Docker. The Raft implementation is sound in theory. What happens with real network partitions, real clock skew, and real packet loss between nodes is not yet measured. Treat multi-node as experimental.

Incremental fetch doesn't do have-line negotiation. When you git fetch, a standard server negotiates with the client to find the minimal set of objects to transfer -- "I have these commits, you need these commits, here's just the delta." Ledge currently sends the full closure of the wanted tips. The client deduplicates locally, so correctness is fine. But the wire transfer is not incremental. For agents that are frequently fetching from a repo they already have most of, this is inefficient. It's on the roadmap.

No SSH transport, no LFS, no shallow clone. HTTP-only for now. LFS requires a separate object store protocol that's orthogonal to the git wire protocol. Shallow/partial/sparse clone involves complex negotiation that isn't implemented. These are real limitations for some use cases.

No external security audit. The tenant isolation has documented sharp edges in SECURITY.md. I'm not claiming this is safe to expose to untrusted multi-tenant workloads yet.


I built this because the agent infrastructure problem is real and the storage layer for it doesn't exist yet in a well-designed form. Git is everywhere. Every agent that touches code is already using git mental models. The right answer is not "build a new storage system that doesn't speak git." It's "speak git on the surface and rebuild what's underneath for the workload that matters."

The 0.13 second clone is what agent-scale storage should feel like. The eager warming, the want-set memoization, the workspace lifecycle -- these are the decisions that fall out of designing for machines instead of humans.

267 commits. Rust, TLA+, Cap'n Proto, openraft. Weeks old.

the repo is at github.com/v-code01/ledge.

the core works. the edges are honest.

if you're building agent infrastructure and thinking about how your agents checkpoint and share state across forks, the workspace model is worth reading. it's in docs/. the lease-backed ephemeral fork primitive is the piece i haven't seen described cleanly elsewhere and i think it's the right abstraction for the pattern.


P.S. The dual-namespace decision -- one packfile, two address schemes, BLAKE3 sidecar index -- is the thing i'd do differently if i was starting over. Not the decision itself, which I still think is right. The implementation: the sidecar index format is custom and not yet standardized, which means it's not interoperable with anything else that might want to address git objects by BLAKE3. There's an emerging discussion in the git community about SHA-256 transition (git already has experimental SHA-256 support) and BLAKE3 isn't part of that conversation yet. If I was building this today I'd either commit to SHA-256 compatibility (which has a real migration path) or make the sidecar format extensible enough to support both. The current format is correct but isolated. That's the technical debt I'm most aware of.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.