Research / Build / Publish
Sri Harsha Gouru
I take systems apart until they explain themselves.
I work on local AI systems, runtime tooling, training experiments, and protocol-heavy software.
Most of what I publish starts with a question I can't let go of. Lately that has meant watching models at inference time, tracing browser AI traffic, and testing what Apple Silicon can actually do when you stop treating it like a black box.
The through-line is simple: understand what is really happening, then build from there.
Writing
all posts →I built a local monitor for my own browser traffic and used ChatGPT, Claude, Gemini, and Grok normally for 10 days. What fell out was a more concrete picture of how much of the experience is conversation, and how much is telemetry.
I split the public tool out of my private AI traffic research repo and open-sourced the reusable part: a local CDP-based auditor for browser AI products.
After months of experiments on Apple's NPU, the answer isn't "the biggest model you can squeeze in." It's which architectures match the hardware, and which software overheads stop wasting your time.
Selected work
all projects →Real-time visualization of LLM internals — trace token generation, attention patterns, hidden states, and probability distributions as they happen. Built to understand what's actually going on inside these models.
Experiments for OpenAI's 16MB language model challenge. Focused on compact architectures, training dynamics, evaluation, and practical iteration on tiny models under hard artifact constraints.
Benchmarking and experimentation setup for local LLM inference. Covers KV cache behavior, quantization tradeoffs, batching, FlashAttention, speculative decoding, and serving paths.
Write-up on getting training loops running on Apple's Neural Engine and what it took to make the surrounding CPU pipeline fast enough to matter.
Local tooling for auditing browser-based AI traffic through Chrome DevTools Protocol. Captures requests, classifies telemetry, compares streaming protocols, and turns noisy web app behavior into something queryable.