Skip to main content

About

i started in trading. not the kind where you have opinions about the market -- the kind where your system processes 25TB before the bell rings or you lose actual money. sub-millisecond latency. real consequences. that is where i learned what production means. not the conference talk version. the version where something breaks at 2am and the P&L moves in the wrong direction.

i have been a founding engineer, an ML platform builder, and the person who gets paged when production systems need to be rearchitected without downtime. every role had the same job -- make the system reliable enough that nobody has to think about it.

the layers between the model and the user -- inference runtimes, GPU scheduling, observability, cost attribution -- those are the parts i care about. not because they are glamorous. because they are the parts that determine whether an AI product is viable at scale... or just a prototype with good funding.

i am currently a founding engineer at a generative AI startup building the ML platform from zero. training pipelines, inference serving, custom CUDA kernels, multi-region Kubernetes. the whole stack.

Tools

i reach for CUDA before Python when latency is the constraint. the kernel is where the real work happens -- everything above it is just scheduling.

for serving: vLLM for standard inference, custom PagedAttention extensions when the access patterns break vLLM's assumptions. TensorRT-LLM for transformer layers where the fusion matters.

Kubernetes for orchestration. not because it's simple -- it isn't -- but because the failure modes are documented and the escape hatches exist.

Rust when i need systems-level control without the undefined behavior tax. Go for the infrastructure glue. C++ when i'm talking directly to the GPU driver.

Prometheus + OpenTelemetry everywhere. a system you can't measure is a system you can't trust.