Details
- Latitude's AI Dungeon runs large-scale mixture-of-experts (MoE) models on DeepInfra's platform, reducing costs from $0.20 to $0.05 per million tokens—a 75% drop.
- DeepInfra leverages NVIDIA Blackwell GPUs, NVFP4 quantization, and TensorRT-LLM for optimizations, enabling up to 20x cost reductions for MoE models compared to dense equivalents.
- AI Dungeon generates real-time narrative text and imagery, allowing players to select from multiple AI continuations, with reliable scaling for traffic spikes.
- From NVIDIA Hopper platform, DeepInfra achieved 2x cost cut to $0.10 per million tokens on Blackwell, then NVFP4 further halved it to $0.05, totaling 4x improvement.
- Open-weight MoE models provide flexibility for Latitude to test and deploy optimal configurations for creative storytelling, speed, and efficiency without infrastructure overhead.
- Broader context: DeepInfra among leaders like Baseten, Fireworks AI, Together AI using Blackwell for up to 10x token cost reductions versus Hopper.
Impact
DeepInfra's deployment of Blackwell for Latitude exemplifies how NVIDIA's hardware-software stack is compressing AI inference economics, pressuring rivals like AWS Inferentia or Google TPUs that lag in MoE-optimized throughput. By slashing costs 4x from Hopper while preserving accuracy via NVFP4 and TensorRT-LLM, it widens access to frontier open-source models for consumer apps like AI Dungeon, accelerating adoption in gaming and interactive AI where margins are thin. This aligns with the surge in agentic workloads—now 50% of queries per OpenRouter data—favoring low-latency MoE architectures over dense models. Competitors such as Fireworks AI and Together AI mirror these gains, signaling a market shift toward pay-as-you-go inference at sub-$0.10/million tokens, easing GPU bottlenecks. Over 12-24 months, expect intensified R&D into hybrid MoE-Mamba designs like Nemotron 3 Nano, funneling funding to open ecosystems and Rubin platform previews promising another 10x efficiency leap.
