Work
Four years of building the infrastructure behind AI systems, trading platforms, and ML pipelines.
Founding AI Infrastructure & Systems Engineer
4MINDS · 4minds.ai
Production inference, training pipelines, GPU scheduling across multi-region Kubernetes. Custom CUDA kernels where the off-the-shelf runtimes couldn't hit latency targets.
Python, Kubernetes, PyTorch, Ray, vLLM, TensorRT, TensorRT-LLM, torch.compile, CUDA, Custom CUDA Kernels, TransformerEngine, FlashAttention, Nsight Compute, Nsight Systems, ArgoCD, Helm, Kustomize, Prometheus, OpenTelemetry, Grafana, AWS, Docker, GitOps, CI/CD, GPU Scheduling, Mixed Precision, ONNX Runtime
Machine Learning Engineer
GoodRx
Rearchitected batch systems into real-time streaming. Built an observability platform from scratch and presented it to exec leadership. Optimized SageMaker endpoints until inference costs stopped being a line item anyone questioned.
Apache Airflow, Python, AWS, SageMaker, gRPC, Databricks, Kubernetes, Docker, Helm, Terraform, Prometheus, OpenTelemetry, Distributed Tracing, CI/CD Pipelines, MLflow, Model Serving, ETL Pipelines, SQL, Load Balancing, IAM
ML Engineer, Quantitative Research
Tier-1 Market Making Firm
25TB of market data. Every day. Sub-millisecond latency. I built the tick-level processing system behind $2M+ in annual trading decisions. Cut order execution latency by 78%.
C++, Python, Apache Kafka, Apache Spark, Low-Latency Networking, GPU Profiling, TLS, DNS, Network Optimization, Real-Time Analytics, gRPC, Bash
Data Engineer
VHN
Seven business units with zero interoperability. I wired ML platforms into legacy Teradata and Oracle systems. Cross-system compatibility up 65%. Data quality up 85%.
Python, SQL, Teradata, Oracle, ETL, Data Pipelines, Data Governance, Java
Proprietary Work
Closed source. Patent pending.
WMServe
Production inference for video world models. Custom spatiotemporal PagedAttention. Sub-50ms latency at 10K+ concurrent requests. 99.99% availability. 85%+ GPU utilization. The first system that makes world models fast enough for robotics control loops.
Go, CUDA C++, Python, PagedAttention, FlashAttention, Kubernetes, gRPC, Raft Consensus, OpenTelemetry, GPU Memory Management, Kernel Fusion, Occupancy Optimization, Model Serving Architecture, Quantization (FP16), Nsight Compute
FlowLLM
Custom hypervisor for AI inference. No Linux kernel. No CUDA driver. No Python runtime. Direct GPU control in Rust and Assembly. 95% overhead reduction. 15-70 microsecond stack latency. Boots in 50 microseconds. Linux takes 30 seconds.
Rust, Assembly, CUDA, Bare Metal, GPU Programming, Warp-Level Primitives, GPU Memory Management, Custom CUDA Kernels, Nsight Systems, Profiling
APEX
GPU-native vector database. 3.5M queries per second per GPU. 1.8 microsecond p50 latency. 500K inserts per second. 10x cheaper than cloud vector providers. Built from first principles on tensor cores.
CUDA, Tensor Cores, Rust, NVLink, GPUDirect, Lock-Free Algorithms, GPU FinOps, Kernel Fusion, Occupancy Optimization, Custom CUDA Kernels
SchemaForge
Declarative database infrastructure. No migrations. Bidirectional state convergence with SMT-verified invariants. O(n log n) complexity guarantees. Parallel DDL via dependency graph. Production-tested at FAANG scale.
Rust, SMT Solver, PostgreSQL, Formal Verification, Graph Theory, CI/CD, Distributed Systems
Open Source
PHANTOM
codeMulti-agent LLM serving for Apple Silicon's unified memory. existing systems were designed for discrete GPUs where weights must be copied over PCIe. on M-series chips, CPU, GPU, and Neural Engine share one physical pool -- that copy is unnecessary. PHANTOM eliminates it. 10 agents sharing a 50-page document: prefix stored once, not 10 times. DualRadixTree copy-on-write KV cache. MESI coherence formally specified in TLA+. formally verified scheduler. M0 proven: zero-copy GPU pipeline working end to end.
Rust, Apple Silicon, Metal, Unified Memory, TLA+, Formal Verification, Multi-Agent Systems, KV Cache, Copy-on-Write, Neural Engine
NEMESIS
codeAutonomous GPU cluster orchestration. Replaces on-call SRE judgment with a hierarchy of specialized agents that perceive hardware degradation before it becomes failure. Topology-aware scheduling, not just GPU counts. Heals running training jobs without restart using NCCL 2.27 Communicator Shrink. Validated against the Alibaba Cluster Trace dataset. Every benchmark reproducible from a single command.
Rust, Python, NCCL, Kubernetes, GPU Scheduling, Distributed Systems, Multi-Agent Systems, Fault Tolerance
TASFT
codeTask-Aware Sparse Fine-Tuning. Co-trains LoRA adapters with block-sparse attention gates. 2-5x decode throughput at 70-85% sparsity. 676 tests passing. Cuts inference costs without pretending accuracy doesn't matter.
Python, PyTorch, LoRA/QLoRA, CUDA, FlashAttention-2, Block-Sparse Attention, vLLM, Quantization, Model Compilation, Transformer Architecture Optimization, Mixed Precision, Gradient Checkpointing
KubeBalance
codeKubernetes scheduler plugin. Network topology-aware, cost-based, and performance-driven pod placement. The scheduler your cluster should have shipped with.
Go, Kubernetes, Docker, Helm, GPU Scheduling, Cold-Start Optimization, Multi-Region, Ingress, Load Balancing
AirflowLLM
codeGenerate production-ready Airflow DAGs from natural language. 45 tokens/sec on CodeLlama 7B. ~700ms on an M2 Pro. No API calls. No cloud dependency. Your DAGs, your machine.
Python, Apache Airflow, LLMs, Ollama, vLLM, Model Serving
EdgeTrain
codeNeural network training in the browser. WebGPU compute shaders. No server. No Python. The model trains on your GPU, in your tab.
TypeScript, WebGPU, WGSL
SimTextGuard
codeAI-generated text detection in C++. Jaccard similarity against known AI responses. Fast enough to run inline on submission.
C++, NLP, Pybind11
PokerGenius
codePoker AI. Monte Carlo tree search, neural hand evaluation, adaptive opponent modeling. Game theory applied to a game most people think is about luck.
Python, Game Theory, Monte Carlo, Neural Networks