№ 038How to make LLMs cheaper without breaking them
Most teams overpay for LLM inference by 10-100×. We benchmarked quantization formats on Llama and Gemma models, deployed W4A16 with GKE, and cut costs to $0.50/1M tokens.
Field notes from production — written by the engineers who shipped it.
Fig. 01 — gemini · in production
№ 038Most teams overpay for LLM inference by 10-100×. We benchmarked quantization formats on Llama and Gemma models, deployed W4A16 with GKE, and cut costs to $0.50/1M tokens.
№ 037AI agents are in production everywhere. But the framework you pick determines how fast you ship, how much control you have, and whether it survives real workloads. Here's what the numbers say.
№ 036While benchmarking 6 agent frameworks, Semantic Kernel crashed with a hard ValidationException on AWS Bedrock. Here's exactly what happened, why it happens, and the fix.
A short note that frames how we read the issue. Engineering writing earns its keep when it trades velocity for clarity — when the writer pauses long enough to know what they actually believe.
“How I wired Gemini Enterprise to our internal RMS: five debugging layers, one Cloud Run proxy, and what finally made it work.”— Bhagyashree .S. Bothra Jain, from What the Gemini Enterprise Demo Doesn't Show You