LLM-Orc-Station

Multi-provider LLM orchestrator built for a school-grade question answering app.

Routes student queries to the cheapest model that can handle them, rotates API keys automatically, and simulates 1,000 concurrent users in your terminal.

Architecture

Student prompt
      │
      ▼
 classifier.ts          ← classifies prompt complexity: simple / medium / complex
      │
      ▼
  router.ts             ← picks model based on policy + complexity
      │                    cost: simple→mock, medium→flash, complex→pro
      │                    latency: always pick fastest
      │                    fallback: pick by health score
      ▼
 keymngr.ts             ← picks best available key (round-robin, skips open breakers)
      │
      ▼
dispatcher.ts           ← calls Gemini REST API (or mock if no real key)
      │
      ▼
  metrics.ts            ← logs: timestamp, userId, model, keyId, latencyMs, ok
      │
      ▼
orchestrator.ts         ← retries on failure (up to 2 more attempts with fallback policy)

File Map

File	Responsibility
`types.ts`	All TypeScript interfaces (`ApiKey`, `Model`, `LogEntry`, etc.)
`classifier.ts`	Complexity scoring: simple / medium / complex
`registry.ts`	Model catalog, key storage, and key selection
`keymngr.ts`	Key rotation lifecycle and circuit breaker logic
`router.ts`	Routing policies: cost, latency, fallback
`dispatcher.ts`	Actual HTTP call to Gemini (or mock response)
`orchestrator.ts`	Single-query flow with retry logic
`metrics.ts`	In-memory store: P95, per-model stats, time buckets
`simulator.ts`	1000-user CLI simulator with concurrency pool
`server.ts`	Express HTTP API
`index.ts`	Entry point: server mode or simulate mode

Quick Start

Prerequisites

bun install

Run the simulator (no API key needed — uses mock responses)

# Default properties
bun run src/index.ts simulate

# Or with custom settings:
bun run src/index.ts simulate 1000 50 cost
#                             ^     ^  ^
#                         users  conc  policy

Run the HTTP server

bun run src/index.ts

Then in another terminal:

# Single query
curl -X POST http://localhost:3000/query \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is 7 times 8?", "userId": "u1", "persona": "grade-school"}'

# View metrics
curl http://localhost:3000/stats

# View key state
curl http://localhost:3000/keys

# Manually rotate keys for flash model
curl -X POST http://localhost:3000/rotate/flash

# Start 1000-user simulation via API
curl -X POST http://localhost:3000/simulate \
  -H "Content-Type: application/json" \
  -d '{"users": 1000, "policy": "cost", "concurrency": 50}'

Use a real Gemini API key (optional)

# .env file
GEMINI_API_KEY=AIzaSy...

Note: Without a real key, dispatcher automatically returns varied mock responses. The routing, rotation, circuit breaker, and metrics all still work correctly.

How Each Part Works

1. Classifier (`classifier.ts`)

Analyses the prompt text using heuristics (no LLM call needed — that would be recursive!):

Signal	Result
Pure arithmetic, ≤ 8 words	`simple`
Over 120 words	`complex`
Academic verbs: "analyse", "compare", "evaluate"	`complex`
Two or more `?` in prompt	`complex`
Over 40 words	`medium`
Explanation verbs: "explain", "describe", "summarise"	`medium`
Code keywords: "code", "function", "algorithm"	`medium`
Default	`simple`

Why heuristics and not an LLM? Calling a model to decide which model to call adds latency and cost on every request. Heuristics are deterministic, instantaneous, and easy to tune.

Edge case: short but complex prompts "Prove the Riemann hypothesis" is 4 words but clearly complex. The COMPLEX_VERBS list catches "prove" → complexity = complex.

2. Model Registry (`registry.ts`)

Three models:

Model	Tier	Cost/1M tokens	Avg Latency	RPM
`mock`	Free	$0	25ms	∞
`gemini-flash`	Budget	$0.075	800ms	100
`gemini-pro`	Capable	$3.50	2000ms	30

Each model starts with 2 keys. pickKey() sorts by usage ascending (least-used key wins). This is round-robin in practice without needing a separate counter.

Edge case: all keys revoked getUsableKeys() returns empty → pickKey() returns null → router escalates to next model tier → if all models exhausted, orchestrator returns a 503-style response.

3. Key Rotation (`keymngr.ts`)

Time-based rotation (every 5 minutes by default):

t=0    key-A active,         key-B active
t=5m   key-C added (active), key-A → deprecated
t=7m   key-A → revoked (grace period elapsed)
t=10m  key-D added (active), key-B → deprecated

Why keep deprecated keys alive? Any in-flight request that already selected key-A must be allowed to finish. The 2-minute grace period covers even slow Gemini Pro calls. Revocation only happens after the grace period so no request gets a mid-flight key error.

Usage-based rotation (every 100 successful requests per key): Prevents any single key from burning its quota limit. Checked in the same background sweep as time-based rotation.

Circuit Breaker states per key:

closed ──[3 consecutive fails]──► open ──[30s cooldown]──► half ──[success]──► closed
                                                                └──[fail]────► open

closed: Normal operation
open: Key is skipped entirely by getUsableKeys()
half: One "probe" request is allowed through to test recovery

Edge case: last key standing rotateKeys() checks activeKeys.length === 0 before proceeding. If somehow all keys are deprecated/revoked, rotation is skipped rather than leaving you keyless.

4. Router (`router.ts`)

cost policy (default for school app):

simple → tries mock first, then flash, then pro
medium → tries flash first, then pro, then mock
complex → tries pro first, then flash, then mock If chosen model has no usable key, escalates down the list automatically.

latency policy:

Sorts by avgLatency ascending (mock=25ms first)
Returns first model with a usable key

fallback policy:

Scores each model: usableKeys × (1 - errorRate)
Sorts descending → healthiest model wins
Used automatically on retries in orchestrator

5. Orchestrator (`orchestrator.ts`)

Flow for each query:

Classify prompt → complexity
Route(complexity, policy) → model + key
Dispatch call → response or error
On error: record failure (circuit breaker), log, retry with "fallback" policy
Max 2 retries (3 total attempts)
Log to metrics regardless of outcome

Why retry with "fallback" not the original policy? If "cost" policy chose gemini-flash and it failed, retrying with "cost" picks the same model again (same problem). "Fallback" picks the healthiest different model, maximising chance of recovery.

6. Metrics (`metrics.ts`)

Tracked per-request: Timestamp, userId, persona, complexity, model, keyId, latencyMs, ok/fail.

Aggregated:

Total requests / errors / error rate
Average latency and P95 latency (95th percentile of all latency values)
Per-model: requests, errors, errorRate, avgLatency
Per-key: usage count
Time-series buckets (5s granularity, 10min window) for charts

P95 vs Average: Average latency hides tail latency. If 950 requests take 100ms and 50 take 5000ms, average = 347ms but P95 = 5000ms. Students experiencing the P95 case are the ones filing bug reports. P95 is what you should optimise.

7. Simulator (`simulator.ts`)

Concurrency pool pattern:

pool size = 50
[user-001] → query → query → query → done
[user-002] → query → query → done
[user-003] → query → query → query → query → done
...as each user finishes, the next one starts

This prevents the "thundering herd" problem where 1000 simultaneous connections overwhelm even a local server.

Personas and their prompt banks:

Persona	Complexity mix	Example prompt
`grade-school`	~80% simple	"What is 7 times 8?"
`middle-school`	~60% medium	"Explain how photosynthesis works."
`high-school`	~70% complex	"Analyse the themes of power in Macbeth."
`teacher`	~50% complex	"Design a rubric for evaluating student essays."

This naturally creates a realistic distribution without hardcoding percentages.

Think time (500ms–2000ms between queries) prevents unrealistically rapid-fire requests from a single user.

Configuration

Environment variables

# .env file
GEMINI_API_KEY=AIzaSy...   # Optional. Without it, mock responses are used.
PORT=3000                  # Default: 3000

Rotation timing (edit `keymngr.ts`)

const ROTATION_INTERVAL_MS = 5 * 60 * 1000; // 5 minutes
const GRACE_PERIOD_MS = 2 * 60 * 1000; // 2 minutes
const MAX_USAGE_BEFORE_ROTATE = 100; // per key
const BREAKER_FAIL_THRESHOLD = 3; // consecutive fails
const BREAKER_COOLDOWN_MS = 30_000; // 30 seconds

Simulator settings

# Starts the simulator with syntax:
# bun run src/index.ts simulate [totalUsers] [concurrency] [policy]

bun run src/index.ts simulate 500  25 latency
bun run src/index.ts simulate 2000 100 cost

Sample Output

╔══════════════════════════════════════════════════════╗
║         LLM-Orc-Station Simulator Starting           ║
╚══════════════════════════════════════════════════════╝
  Users:       1000
  Queries/user:~4
  Concurrency: 50
  Policy:      cost

[████████████████████████████████████░░░░] 92.3% (3692/4000)  elapsed: 38.2s

╔══════════════════════════════════════════════════════╗
║                  SIMULATION COMPLETE                 ║
╚══════════════════════════════════════════════════════╝
  Duration:        41.5s
  Total Requests:  4000
  Total Errors:    0 (0.00%)
  Avg Latency:     287ms
  P95 Latency:     812ms

  Model Distribution:
    mock               ▓▓▓▓▓▓▓▓▓▓▓▓  1580 reqs  err:0.0%  avg:26ms
    gemini-flash       ▓▓▓▓▓▓▓▓▓▓▓▓  1610 reqs  err:0.0%  avg:812ms
    gemini-pro         ▓▓▓▓▓▓        810 reqs   err:0.0%  avg:2003ms

  Key Usage:
    3f8a1b2c-4e5d…  792 uses
    7a2c9d1e-3f8b…  788 uses
    ...

  Recent Rotation Events:
    [2026-02-23T10:05:00.000Z] flash: +3f8a1b2c deprecated:7a2c9d1e

Known Limitations and Stretch Goals

Redis not integrated — Metrics live in memory and reset on restart. For multi-instance or persistent metrics, replace the metrics object with Redis calls.
Real key creation — rotateKeys() adds a fake UUID key. In production, call your provider's key management API and inject the real secret.
Rate limiting — The RPM field on each model is tracked but not enforced as a hard cap. Add a token bucket or leaky bucket per model to enforce it.
Prometheus export — getStats() returns JSON. Wrapping it with a /metrics endpoint in Prometheus text format would enable Grafana dashboards.

LLM-Orc-Station

Timeline

Role

Status

Technology Stack

LLM-Orc-Station

Architecture

File Map

Quick Start

Prerequisites

Run the simulator (no API key needed — uses mock responses)

Run the HTTP server

Use a real Gemini API key (optional)

How Each Part Works

1. Classifier (`classifier.ts`)

2. Model Registry (`registry.ts`)

3. Key Rotation (`keymngr.ts`)

4. Router (`router.ts`)

5. Orchestrator (`orchestrator.ts`)

6. Metrics (`metrics.ts`)

7. Simulator (`simulator.ts`)

Configuration

Environment variables

Rotation timing (edit `keymngr.ts`)

Simulator settings

Sample Output

Known Limitations and Stretch Goals

Related Projects

Session

Rejected.exe

Technology Stack

LLM-Orc-Station

Architecture

File Map

Quick Start

Prerequisites

Run the simulator (no API key needed — uses mock responses)

Run the HTTP server

Use a real Gemini API key (optional)

How Each Part Works

1. Classifier (classifier.ts)

2. Model Registry (registry.ts)

3. Key Rotation (keymngr.ts)

4. Router (router.ts)

5. Orchestrator (orchestrator.ts)

6. Metrics (metrics.ts)

7. Simulator (simulator.ts)

Configuration

Environment variables

Rotation timing (edit keymngr.ts)

Simulator settings

Sample Output

Known Limitations and Stretch Goals

Related Projects

Session

Rejected.exe

1. Classifier (`classifier.ts`)

2. Model Registry (`registry.ts`)

3. Key Rotation (`keymngr.ts`)

4. Router (`router.ts`)

5. Orchestrator (`orchestrator.ts`)

6. Metrics (`metrics.ts`)

7. Simulator (`simulator.ts`)

Rotation timing (edit `keymngr.ts`)