> ## Documentation Index > Fetch the complete documentation index at: https://docs.mavera.io/llms.txt > Use this file to discover all available pages before exploring further. # Rate Limits in Production > Proactive throttling, request queuing, concurrency control, and header-based rate limiting for high-volume Mavera integrations ## When to Use This You're running a production integration that: * Makes **many requests per minute** (near or above your tier limit) * Uses **concurrent workers** (background jobs, async handlers) * Hits **endpoint-specific limits** (e.g. Mave max 10 concurrent, Focus Groups max 5) * Needs to **avoid 429s** proactively instead of only reacting with retries This cookbook covers: * **Proactive throttling** using `X-RateLimit-*` headers * **Token bucket** style rate limiting to smooth request rate * **Concurrency limiting** (semaphores) for endpoint-specific caps * **Request queuing** when you must process many items without bursting * **Production checklist** and metrics to monitor *** ## Rate Limit Recap | Tier | Requests/minute | | ------------ | --------------- | | Starter | 60 | | Basic | 120 | | Professional | 240 | | Enterprise | 600 | **Endpoint-specific concurrency limits:** | Endpoint | Max concurrent | | ----------------- | -------------- | | `/mave/chat` | 10 | | `/focus-groups` | 5 | | `/video-analyses` | 3 | Every response includes: | Header | Meaning | | ----------------------- | ------------------------------------- | | `X-RateLimit-Limit` | Max requests per minute for your tier | | `X-RateLimit-Remaining` | Requests left in current 60s window | | `X-RateLimit-Reset` | Unix timestamp when window resets | | `Retry-After` | (On 429) seconds to wait before retry | *** ## Pattern 1: Proactive Throttling from Headers **Don't burst to the limit.** After each request, read `X-RateLimit-Remaining`. If it's low (e.g. \< 10), slow down before you hit 429. ```python Python theme={"dark"} import time import requests API_KEY = "mvra_live_your_key_here" HEADERS = {"Authorization": f"Bearer {API_KEY}"} BASE = "https://app.mavera.io/api/v1" # Safety margin: start backing off when remaining drops below this LOW_REMAINING_THRESHOLD = 10 MIN_INTERVAL = 0.5 # Minimum seconds between requests def request_with_throttle(method: str, path: str, **kwargs) -> requests.Response: """Make request and throttle if we're approaching the limit.""" resp = requests.request(method, f"{BASE}{path}", headers=HEADERS, **kwargs) remaining = int(resp.headers.get("X-RateLimit-Remaining", 999)) limit = int(resp.headers.get("X-RateLimit-Limit", 60)) if remaining < LOW_REMAINING_THRESHOLD and remaining > 0: # Spread remaining requests over the reset window reset = int(resp.headers.get("X-RateLimit-Reset", time.time() + 60)) wait = max(0, (reset - time.time()) / max(1, remaining)) wait = min(wait, 30) # Cap at 30s time.sleep(wait) return resp ``` ```javascript JavaScript theme={"dark"} const BASE = "https://app.mavera.io/api/v1"; const LOW_REMAINING_THRESHOLD = 10; async function requestWithThrottle(path, options = {}) { const resp = await fetch(`${BASE}${path}`, { ...options, headers: { Authorization: `Bearer ${process.env.MAVERA_API_KEY}`, ...options.headers, }, }); const remaining = parseInt(resp.headers.get("X-RateLimit-Remaining") || "999", 10); const reset = parseInt(resp.headers.get("X-RateLimit-Reset") || "0", 10); if (remaining < LOW_REMAINING_THRESHOLD && remaining > 0 && reset > 0) { const wait = Math.min( Math.max(0, (reset * 1000 - Date.now()) / Math.max(1, remaining)), 30000 ); await new Promise((r) => setTimeout(r, wait)); } return resp; } ``` *** ## Pattern 2: Token Bucket (Smooth Request Rate) A **token bucket** lets you maintain a steady request rate instead of bursting. Refill tokens over time; consume one per request. If no tokens, wait. ```python Python theme={"dark"} import time import threading class TokenBucket: """Thread-safe token bucket for rate limiting.""" def __init__(self, rate: float, capacity: int = None): """ rate: tokens per second (e.g. 2.0 for 120/min) capacity: max tokens (defaults to rate * 60 for 1 minute burst) """ self.rate = rate self.capacity = capacity or int(rate * 60) self.tokens = float(self.capacity) self.last_refill = time.monotonic() self._lock = threading.Lock() def acquire(self, tokens: int = 1) -> float: """Consume tokens; block until available. Returns wait time in seconds.""" with self._lock: now = time.monotonic() elapsed = now - self.last_refill self.tokens = min(self.capacity, self.tokens + elapsed * self.rate) self.last_refill = now if self.tokens >= tokens: self.tokens -= tokens return 0.0 need = tokens - self.tokens wait = need / self.rate self.tokens = 0 return wait def wait_and_acquire(self, tokens: int = 1): """Block until tokens available, then consume.""" wait = self.acquire(tokens) if wait > 0: time.sleep(wait) # Usage: 2 req/sec ≈ 120/min (Basic tier) bucket = TokenBucket(rate=2.0) def chat_with_bucket(messages, persona_id): bucket.wait_and_acquire() return client.responses.create( model="mavera-1", input=messages, extra_body={"persona_id": persona_id}, ) ``` ```javascript JavaScript theme={"dark"} class TokenBucket { constructor(rate, capacity = null) { this.rate = rate; this.capacity = capacity ?? Math.floor(rate * 60); this.tokens = this.capacity; this.lastRefill = Date.now() / 1000; this.lock = Promise.resolve(); } async acquire(tokens = 1) { const wait = () => { const now = Date.now() / 1000; const elapsed = now - this.lastRefill; this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.rate); this.lastRefill = now; if (this.tokens >= tokens) { this.tokens -= tokens; return 0; } const need = tokens - this.tokens; return need / this.rate; }; return new Promise((resolve) => { const w = wait(); if (w <= 0) return resolve(0); setTimeout(resolve, w * 1000, w); }); } async waitAndAcquire(tokens = 1) { const wait = await this.acquire(tokens); if (wait > 0) await new Promise((r) => setTimeout(r, wait * 1000)); } } const bucket = new TokenBucket(2.0); async function chatWithBucket(messages, personaId) { await bucket.waitAndAcquire(); return client.responses.create({ model: "mavera-1", input: messages, persona_id: personaId, }); } ``` *** ## Pattern 3: Concurrency Limiting (Semaphore) For endpoints with **max concurrent** limits (Mave: 10, Focus Groups: 5, Video: 3), use a semaphore so you never exceed that many in-flight requests. ```python Python theme={"dark"} import asyncio import httpx # Mave allows max 10 concurrent MAVE_SEMAPHORE = asyncio.Semaphore(10) async def mave_chat_with_concurrency_limit(message: str, thread_id: str = None): async with MAVE_SEMAPHORE: async with httpx.AsyncClient(timeout=120.0) as client: payload = {"message": message} if thread_id: payload["thread_id"] = thread_id resp = await client.post( "https://app.mavera.io/api/v1/mave/chat", headers={"Authorization": f"Bearer {API_KEY}"}, json=payload, ) resp.raise_for_status() return resp.json() ``` ```javascript JavaScript theme={"dark"} // Using p-limit (npm install p-limit) const pLimit = require("p-limit"); const maveLimit = pLimit(10); async function maveChatWithLimit(message, threadId = null) { return maveLimit(async () => { const resp = await fetch("https://app.mavera.io/api/v1/mave/chat", { method: "POST", headers: { Authorization: `Bearer ${process.env.MAVERA_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify(threadId ? { message, thread_id: threadId } : { message }), }); return resp.json(); }); } ``` Combine semaphore with token bucket: semaphore for concurrency, token bucket for overall request rate. E.g. 10 concurrent Mave requests, but only 4 new Mave requests per minute across all workers. *** ## Pattern 4: Request Queue for Batch Processing When you have a list of items to process (e.g. 500 chat requests), push them into a queue and process at a controlled rate. Prevents spikes and respects limits. ```python Python theme={"dark"} import asyncio import queue import threading def process_queue_sync(items, process_one, rate_per_minute=60, max_workers=4): """ Process items through a queue with rate limiting. process_one(item) -> result for each item. """ q = queue.Queue() for item in items: q.put(item) interval = 60.0 / rate_per_minute results = [] lock = threading.Lock() def worker(): while True: try: item = q.get_nowait() except queue.Empty: break start = time.time() try: out = process_one(item) with lock: results.append(out) except Exception as e: with lock: results.append({"error": str(e)}) elapsed = time.time() - start sleep_for = max(0, interval - elapsed) time.sleep(sleep_for) q.task_done() threads = [threading.Thread(target=worker) for _ in range(max_workers)] for t in threads: t.start() for t in threads: t.join() return results ``` ```javascript JavaScript theme={"dark"} // Using p-queue (npm install p-queue) const pQueue = require("p-queue"); const queue = new pQueue({ concurrency: 4, interval: 60000, // 1 min intervalCap: 60, // 60 per interval }); async function processBatch(items, processOne) { const results = await Promise.all( items.map((item) => queue.add(() => processOne(item))) ); return results; } ``` *** ## Pattern 5: Exponential Backoff on 429 (Reactive) When you **do** get a 429, respect `Retry-After` and use exponential backoff. Combine with jitter to avoid thundering herd. ```python Python theme={"dark"} import time import random def backoff_on_429(func, max_retries=5): """Retry on 429 with exponential backoff and Retry-After.""" for attempt in range(max_retries + 1): try: return func() except requests.HTTPError as e: if e.response.status_code != 429 or attempt == max_retries: raise retry_after = int(e.response.headers.get("Retry-After", 30)) base = retry_after * (2 ** attempt) jitter = random.uniform(0, 5) wait = min(base + jitter, 300) # Cap 5 min time.sleep(wait) ``` *** ## Production Checklist Confirm your rate limit (60/120/240/600 rpm). Design throttling for 80–90% of that to leave headroom. Mave (10), Focus Groups (5), Video (3). Use semaphores or equivalent. Log `X-RateLimit-Remaining` and `X-RateLimit-Reset` periodically. Alert when remaining \< 5 frequently. Single chat with full conversation history instead of many single-message calls. Run load tests at 90% of your limit. Verify you get headers and throttle correctly. *** ## Metrics to Monitor | Metric | What to track | | ---------------------- | ------------------------------------------------------------------ | | `rate_limit_remaining` | From `X-RateLimit-Remaining`; alert if often \< 5 | | `rate_limit_reset` | From `X-RateLimit-Reset`; for dashboards | | `requests_per_minute` | Your actual throughput | | `429_count` | Number of rate limit errors; should be near 0 with good throttling | | `retry_count` | Retries due to 429; indicates throttling may need tuning | *** ## See Also Tiers, headers, and basic handling Retry logic and backoff Credit usage (separate from rate limits) Higher limits for Enterprise