What's Josh Ambati's tech stack?

React, Next.js, TypeScript, Python, Java, Spring Boot, Node.js on the application side. For AI — LLM orchestration, multi-model routing, prompt engineering, and inference optimization. Infrastructure includes AWS, Docker, Kubernetes, PostgreSQL, MongoDB, and Redis.

Is Josh Ambati open to relocation?

Currently based in Chicago, IL and open to opportunities there as well as remote roles. For the right opportunity, absolutely open to relocation.

What kind of roles is Josh Ambati looking for?

Full Stack Developer, AI Engineer, Software Engineer, MLOps, or Backend Engineer roles where he can build production systems and have real ownership over what he ships.

Inkrant is an AI product company built from scratch by Josh Ambati. The flagship product — Schools by Inkrant — is an AI-native operating system for schools, handling admissions, academics, attendance, payments, and communication with 15+ AI-powered features.

Does Josh Ambati have experience with AI/ML in production?

Yes. He has built and shipped multi-model inference systems, dynamic model routing (LLaMA, GPT-4o, Gemini Flash, Mixtral), memory-augmented generation, voice AI agents, and prompt pipelines — all running in production serving real users.

What sets Josh Ambati apart from other candidates?

He built an entire product company from zero — database design to deployment, AI architecture to user-facing features. He doesn't just write code, he builds systems. He thinks like an owner, ships fast, and cares deeply about the end result.

All Posts

Engineering7 min read

Multi-Model Routing: Why Your AI Architecture Probably Needs It

February 20, 2026

Most AI products start the same way: pick a model, write some prompts, ship it.

This works until it doesn't. You hit latency issues on simple queries. Costs balloon as usage scales. Complex reasoning tasks get unreliable answers from smaller models.

The solution isn't a bigger model. It's a smarter architecture.

At Peterson Technology Partners, I built a multi-model inference system that dynamically routes requests based on three factors: latency requirements, cost constraints, and reasoning complexity.

Every incoming request gets classified by complexity. Simple lookups, formatting tasks, and straightforward Q&A go to fast, lightweight models. Multi-step reasoning, analysis, and generation tasks route to larger, more capable models.

The routing isn't static. It's a scoring system that weighs the three factors differently based on the use case. For real-time interview automation, latency dominates — we use Gemini Flash. For end-of-day summarization, cost and quality dominate — we use Mixtral.

The results speak for themselves: faster response times for simple queries, better quality for complex ones, and lower overall compute costs. Not because we found a magical model, but because we matched the right model to the right task.

If you're building AI products at scale, stop thinking about which model to use. Start thinking about when to use which model.

All posts