> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mavera.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Rate Limits in Production

> Proactive throttling, request queuing, concurrency control, and header-based rate limiting for high-volume Mavera integrations

## When to Use This

You're running a production integration that:

* Makes **many requests per minute** (near or above your tier limit)
* Uses **concurrent workers** (background jobs, async handlers)
* Hits **endpoint-specific limits** (e.g. Mave max 10 concurrent, Focus Groups max 5)
* Needs to **avoid 429s** proactively instead of only reacting with retries

This cookbook covers:

* **Proactive throttling** using `X-RateLimit-*` headers
* **Token bucket** style rate limiting to smooth request rate
* **Concurrency limiting** (semaphores) for endpoint-specific caps
* **Request queuing** when you must process many items without bursting
* **Production checklist** and metrics to monitor

***

## Rate Limit Recap

| Tier         | Requests/minute |
| ------------ | --------------- |
| Starter      | 60              |
| Basic        | 120             |
| Professional | 240             |
| Enterprise   | 600             |

**Endpoint-specific concurrency limits:**

| Endpoint          | Max concurrent |
| ----------------- | -------------- |
| `/mave/chat`      | 10             |
| `/focus-groups`   | 5              |
| `/video-analyses` | 3              |

Every response includes:

| Header                  | Meaning                               |
| ----------------------- | ------------------------------------- |
| `X-RateLimit-Limit`     | Max requests per minute for your tier |
| `X-RateLimit-Remaining` | Requests left in current 60s window   |
| `X-RateLimit-Reset`     | Unix timestamp when window resets     |
| `Retry-After`           | (On 429) seconds to wait before retry |

***

## Pattern 1: Proactive Throttling from Headers

**Don't burst to the limit.** After each request, read `X-RateLimit-Remaining`. If it's low (e.g. \< 10), slow down before you hit 429.

<CodeGroup>
  ```python Python theme={"dark"}
  import time
  import requests

  API_KEY = "mvra_live_your_key_here"
  HEADERS = {"Authorization": f"Bearer {API_KEY}"}
  BASE = "https://app.mavera.io/api/v1"

  # Safety margin: start backing off when remaining drops below this
  LOW_REMAINING_THRESHOLD = 10
  MIN_INTERVAL = 0.5  # Minimum seconds between requests

  def request_with_throttle(method: str, path: str, **kwargs) -> requests.Response:
      """Make request and throttle if we're approaching the limit."""
      resp = requests.request(method, f"{BASE}{path}", headers=HEADERS, **kwargs)

      remaining = int(resp.headers.get("X-RateLimit-Remaining", 999))
      limit = int(resp.headers.get("X-RateLimit-Limit", 60))

      if remaining < LOW_REMAINING_THRESHOLD and remaining > 0:
          # Spread remaining requests over the reset window
          reset = int(resp.headers.get("X-RateLimit-Reset", time.time() + 60))
          wait = max(0, (reset - time.time()) / max(1, remaining))
          wait = min(wait, 30)  # Cap at 30s
          time.sleep(wait)

      return resp
  ```

  ```javascript JavaScript theme={"dark"}
  const BASE = "https://app.mavera.io/api/v1";
  const LOW_REMAINING_THRESHOLD = 10;

  async function requestWithThrottle(path, options = {}) {
    const resp = await fetch(`${BASE}${path}`, {
      ...options,
      headers: {
        Authorization: `Bearer ${process.env.MAVERA_API_KEY}`,
        ...options.headers,
      },
    });

    const remaining = parseInt(resp.headers.get("X-RateLimit-Remaining") || "999", 10);
    const reset = parseInt(resp.headers.get("X-RateLimit-Reset") || "0", 10);

    if (remaining < LOW_REMAINING_THRESHOLD && remaining > 0 && reset > 0) {
      const wait = Math.min(
        Math.max(0, (reset * 1000 - Date.now()) / Math.max(1, remaining)),
        30000
      );
      await new Promise((r) => setTimeout(r, wait));
    }

    return resp;
  }
  ```
</CodeGroup>

***

## Pattern 2: Token Bucket (Smooth Request Rate)

A **token bucket** lets you maintain a steady request rate instead of bursting. Refill tokens over time; consume one per request. If no tokens, wait.

<CodeGroup>
  ```python Python theme={"dark"}
  import time
  import threading

  class TokenBucket:
      """Thread-safe token bucket for rate limiting."""

      def __init__(self, rate: float, capacity: int = None):
          """
          rate: tokens per second (e.g. 2.0 for 120/min)
          capacity: max tokens (defaults to rate * 60 for 1 minute burst)
          """
          self.rate = rate
          self.capacity = capacity or int(rate * 60)
          self.tokens = float(self.capacity)
          self.last_refill = time.monotonic()
          self._lock = threading.Lock()

      def acquire(self, tokens: int = 1) -> float:
          """Consume tokens; block until available. Returns wait time in seconds."""
          with self._lock:
              now = time.monotonic()
              elapsed = now - self.last_refill
              self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
              self.last_refill = now

              if self.tokens >= tokens:
                  self.tokens -= tokens
                  return 0.0

              need = tokens - self.tokens
              wait = need / self.rate
              self.tokens = 0
              return wait

      def wait_and_acquire(self, tokens: int = 1):
          """Block until tokens available, then consume."""
          wait = self.acquire(tokens)
          if wait > 0:
              time.sleep(wait)


  # Usage: 2 req/sec ≈ 120/min (Basic tier)
  bucket = TokenBucket(rate=2.0)

  def chat_with_bucket(messages, persona_id):
      bucket.wait_and_acquire()
      return client.responses.create(
          model="mavera-1",
          input=messages,
          extra_body={"persona_id": persona_id},
      )
  ```

  ```javascript JavaScript theme={"dark"}
  class TokenBucket {
    constructor(rate, capacity = null) {
      this.rate = rate;
      this.capacity = capacity ?? Math.floor(rate * 60);
      this.tokens = this.capacity;
      this.lastRefill = Date.now() / 1000;
      this.lock = Promise.resolve();
    }

    async acquire(tokens = 1) {
      const wait = () => {
        const now = Date.now() / 1000;
        const elapsed = now - this.lastRefill;
        this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.rate);
        this.lastRefill = now;

        if (this.tokens >= tokens) {
          this.tokens -= tokens;
          return 0;
        }
        const need = tokens - this.tokens;
        return need / this.rate;
      };

      return new Promise((resolve) => {
        const w = wait();
        if (w <= 0) return resolve(0);
        setTimeout(resolve, w * 1000, w);
      });
    }

    async waitAndAcquire(tokens = 1) {
      const wait = await this.acquire(tokens);
      if (wait > 0) await new Promise((r) => setTimeout(r, wait * 1000));
    }
  }

  const bucket = new TokenBucket(2.0);

  async function chatWithBucket(messages, personaId) {
    await bucket.waitAndAcquire();
    return client.responses.create({
      model: "mavera-1",
      input: messages,
      persona_id: personaId,
    });
  }
  ```
</CodeGroup>

***

## Pattern 3: Concurrency Limiting (Semaphore)

For endpoints with **max concurrent** limits (Mave: 10, Focus Groups: 5, Video: 3), use a semaphore so you never exceed that many in-flight requests.

<CodeGroup>
  ```python Python theme={"dark"}
  import asyncio
  import httpx

  # Mave allows max 10 concurrent
  MAVE_SEMAPHORE = asyncio.Semaphore(10)

  async def mave_chat_with_concurrency_limit(message: str, thread_id: str = None):
      async with MAVE_SEMAPHORE:
          async with httpx.AsyncClient(timeout=120.0) as client:
              payload = {"message": message}
              if thread_id:
                  payload["thread_id"] = thread_id
              resp = await client.post(
                  "https://app.mavera.io/api/v1/mave/chat",
                  headers={"Authorization": f"Bearer {API_KEY}"},
                  json=payload,
              )
              resp.raise_for_status()
              return resp.json()
  ```

  ```javascript JavaScript theme={"dark"}
  // Using p-limit (npm install p-limit)
  const pLimit = require("p-limit");

  const maveLimit = pLimit(10);

  async function maveChatWithLimit(message, threadId = null) {
    return maveLimit(async () => {
      const resp = await fetch("https://app.mavera.io/api/v1/mave/chat", {
        method: "POST",
        headers: {
          Authorization: `Bearer ${process.env.MAVERA_API_KEY}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify(threadId ? { message, thread_id: threadId } : { message }),
      });
      return resp.json();
    });
  }
  ```
</CodeGroup>

<Tip>
  Combine semaphore with token bucket: semaphore for concurrency, token bucket for overall request rate. E.g. 10 concurrent Mave requests, but only 4 new Mave requests per minute across all workers.
</Tip>

***

## Pattern 4: Request Queue for Batch Processing

When you have a list of items to process (e.g. 500 chat requests), push them into a queue and process at a controlled rate. Prevents spikes and respects limits.

<CodeGroup>
  ```python Python theme={"dark"}
  import asyncio
  import queue
  import threading

  def process_queue_sync(items, process_one, rate_per_minute=60, max_workers=4):
      """
      Process items through a queue with rate limiting.
      process_one(item) -> result for each item.
      """
      q = queue.Queue()
      for item in items:
          q.put(item)

      interval = 60.0 / rate_per_minute
      results = []
      lock = threading.Lock()

      def worker():
          while True:
              try:
                  item = q.get_nowait()
              except queue.Empty:
                  break

              start = time.time()
              try:
                  out = process_one(item)
                  with lock:
                      results.append(out)
              except Exception as e:
                  with lock:
                      results.append({"error": str(e)})

              elapsed = time.time() - start
              sleep_for = max(0, interval - elapsed)
              time.sleep(sleep_for)
              q.task_done()

      threads = [threading.Thread(target=worker) for _ in range(max_workers)]
      for t in threads:
          t.start()
      for t in threads:
          t.join()

      return results
  ```

  ```javascript JavaScript theme={"dark"}
  // Using p-queue (npm install p-queue)
  const pQueue = require("p-queue");

  const queue = new pQueue({
    concurrency: 4,
    interval: 60000, // 1 min
    intervalCap: 60, // 60 per interval
  });

  async function processBatch(items, processOne) {
    const results = await Promise.all(
      items.map((item) => queue.add(() => processOne(item)))
    );
    return results;
  }
  ```
</CodeGroup>

***

## Pattern 5: Exponential Backoff on 429 (Reactive)

When you **do** get a 429, respect `Retry-After` and use exponential backoff. Combine with jitter to avoid thundering herd.

<CodeGroup>
  ```python Python theme={"dark"}
  import time
  import random

  def backoff_on_429(func, max_retries=5):
      """Retry on 429 with exponential backoff and Retry-After."""
      for attempt in range(max_retries + 1):
          try:
              return func()
          except requests.HTTPError as e:
              if e.response.status_code != 429 or attempt == max_retries:
                  raise

              retry_after = int(e.response.headers.get("Retry-After", 30))
              base = retry_after * (2 ** attempt)
              jitter = random.uniform(0, 5)
              wait = min(base + jitter, 300)  # Cap 5 min
              time.sleep(wait)
  ```
</CodeGroup>

***

## Production Checklist

<AccordionGroup>
  <Accordion title="Know your tier">
    Confirm your rate limit (60/120/240/600 rpm). Design throttling for 80–90% of that to leave headroom.
  </Accordion>

  <Accordion title="Respect endpoint concurrency">
    Mave (10), Focus Groups (5), Video (3). Use semaphores or equivalent.
  </Accordion>

  <Accordion title="Log rate limit headers">
    Log `X-RateLimit-Remaining` and `X-RateLimit-Reset` periodically. Alert when remaining \< 5 frequently.
  </Accordion>

  <Accordion title="Batch when possible">
    Single chat with full conversation history instead of many single-message calls.
  </Accordion>

  <Accordion title="Test under load">
    Run load tests at 90% of your limit. Verify you get headers and throttle correctly.
  </Accordion>
</AccordionGroup>

***

## Metrics to Monitor

| Metric                 | What to track                                                      |
| ---------------------- | ------------------------------------------------------------------ |
| `rate_limit_remaining` | From `X-RateLimit-Remaining`; alert if often \< 5                  |
| `rate_limit_reset`     | From `X-RateLimit-Reset`; for dashboards                           |
| `requests_per_minute`  | Your actual throughput                                             |
| `429_count`            | Number of rate limit errors; should be near 0 with good throttling |
| `retry_count`          | Retries due to 429; indicates throttling may need tuning           |

***

## See Also

<CardGroup cols={2}>
  <Card title="Rate Limits Guide" icon="gauge" href="/guides/rate-limits">
    Tiers, headers, and basic handling
  </Card>

  <Card title="Error Handling Patterns" icon="exclamation-triangle" href="/cookbooks/error-handling-patterns">
    Retry logic and backoff
  </Card>

  <Card title="Credits" icon="coins" href="/guides/credits">
    Credit usage (separate from rate limits)
  </Card>

  <Card title="Contact Sales" icon="envelope" href="mailto:sales@mavera.io">
    Higher limits for Enterprise
  </Card>
</CardGroup>
