Ilovedevops

Ilovedevops

Rate Limiting LLM APIs: Beyond the Basics

Why your token bucket implementation keeps failing, and the production patterns that actually prevent throttling errors at scale

Maxine Meurer's avatar
Maxine Meurer
Mar 04, 2026
∙ Paid

Your LLM application is getting throttled.

Again.

You implemented rate limiting months ago - token bucket algorithm, exponential backoff, all of the works.

It worked fine in testing.

But in production, you’re seeing 429 errors, users are complaining about slow responses, your retry logic is hammering the API, and you’re burning API credits on failed requests.

Here’s the problem: basic rate limiting strategies assume APIs work like traditional REST endpoints.

They don’t.

LLM APIs have multi-dimensional rate limits, unpredictable latency, and cost structures that punish naive retry logic.

This is the guide I wish existed when I was debugging production LLM throttling at scale.


If you are working with LLMs beyond demos, this exact problem comes up fast.

I go deeper into how token-based limits, function calling, and guardrails behave in real systems in LLMs for Humans.

👉 LLMs For Humans

It is written for engineers who want to understand what is really happening under the hood, not just prompt tips.

And if you are earlier in your DevOps journey or trying to level up into roles where you are actually responsible for systems like this, I put together the DevOps Career Switch Blueprint.

👉 DevOps Career Switch Blueprint

It covers how to think like an operator, how to talk about real-world failures, and how to position that experience when interviewing.


Why Basic Rate Limiting Fails for LLM APIs

Most engineers implement rate limiting like this:

Simple token bucket

  • Track requests per minute

  • Reject requests when limit hit

  • Wait and retry

This works for normal APIs, but it completely breaks for LLM APIs.

The Multi-Dimensional Limit Problem

LLM APIs don’t just limit requests per minute, they limit:

  • Requests per minute (RPM): You can make X API calls per minute

  • Tokens per minute (TPM): You can process Y tokens per minute (input + output)

  • Tokens per day (TPD): Daily token quota

  • Concurrent requests: Maximum simultaneous requests

A single large request can burn your entire token budget for the minute.

So while you’re tracking RPM (Requests per Minute) and hitting TPM (Token per minute) limits, your rate limiter says you have capacity but the API returns 429.

Real world example:

  • Limit: 10,000 RPM, 2,000,000 TPM

  • 50 requests/minute (well under RPM)

  • Each request is 50,000 tokens (prompt + response)

  • 50 requests × 50,000 tokens = 2,500,000 tokens

You’ve exceeded TPM while staying under RPM, the API starts throttling, and your rate limiter has no idea why

The Latency Unpredictability Problem

Traditional APIs respond in milliseconds, while LLM APIs respond in seconds to minutes. This means the rate limiter can’t predict when capacity will free up.

Real world scenario:

  • Send 10 requests

  • 9 complete in 3 seconds

  • 1 takes 45 seconds (large context, complex query). That one request blocks capacity for the entire window

  • New requests queue up

  • Latency cascades across your entire system

The Retry Amplification Problem

Basic retry logic: if request fails, wait and retry. With LLMs, this creates disasters:

What happens:

  • You hit rate limit, get 429

  • 100 queued requests all retry after 1 second

  • All 100 hit rate limit again (you’re still over limit)

  • All 100 retry after 2 seconds with exponential backoff

  • Thundering herd problem

  • API provider may temporarily ban you for abuse

You’ve now turned one throttling event into a cascading failure. Congratulations!


The rest of this article covers:

  • Token-aware rate limiting (tracking both RPM and TPM)

  • Predictive capacity management

  • Smart retry strategies that don’t amplify load

  • Multi-tier priority queues for different request types

  • Cross-service rate limit coordination

  • Real-world implementation patterns with code examples

  • Monitoring and debugging throttling issues

  • Cost optimization through intelligent batching

User's avatar

Continue reading this post for free, courtesy of Maxine Meurer.

Or purchase a paid subscription.
© 2026 Maxine Meurer · Publisher Privacy ∙ Publisher Terms
Substack · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture