██████╗ █████╗ █████╗ ██████╗ ██╔══██╗██╔══██╗██╔══██╗██╔══██╗ ██████╔╝███████║███████║██████╔╝ ██╔══██╗██╔══██║██╔══██║██╔══██╗ ██████╔╝██║ ██║██║ ██║██║ ██╗ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝
v0.5.0 · pre-flight budget enforcement for LLM agents
$ baar --about
Stop LLM API calls before they happen.
Not after.
Hard local kill-switch. Estimates cost before every request.
Budget gone → exception raised locally. No DNS. No TCP. $0 charged.
$
pip install baar-core
$ python agent.py
[*] agent_loop starting — task: "answer user queries"
[*] model: gpt-4o no budget limit set
>>> call #1
prompt : "what time is it?"
model : gpt-4o
response : "It's 2:14 AM."
tokens : 847 cost: $0.054 total: $0.054
>>> call #2
prompt : "what time is it?"
model : gpt-4o
tokens : 847 cost: $0.054 total: $0.108
——— 844 more identical calls · 8 hours later ———
>>> call #847
prompt : "what time is it?"
tokens : 847 cost: $0.054 total: $47.23
[!] BILL RECEIVED: $47.23
calls: 847 · tokens: 20,841 · runtime: 8h 07m
no kill-switch active. provider already charged.
$ python agent.py
[✓] baar-core active — budget: $0.10 routing: ON
>>> call #1
prompt : "what time is it?"
pre-flight : estimated $0.054 remaining $0.10 → PASS
routing : complexity 0.02 → cheap tier (gpt-4o-mini)
tokens : 12 cost: $0.0001 total: $0.0001
>>> call #2
prompt : "what time is it?"
pre-flight : estimated $0.054 remaining $0.099 → PASS
routing : complexity 0.02 → cheap tier
tokens : 12 cost: $0.0001 total: $0.0002
>>> call #3
prompt : "what time is it?"
pre-flight : estimated $0.054 remaining $0.098 → FAIL
[✓] BudgetExhausted raised locally
no DNS lookup · no TCP connection · $0 charged on call #3
total spent: $0.0002 · $47.23 saved
$ baar --explain-routing
User task
│
▼
┌───────────────────────────────────────┐
│ Pre-flight budget check │ ← estimated cost > remaining budget?
│ (local, zero network calls) │ raise BudgetExhausted — blocked
└────────────┬──────────────────────────┘
│ affordable
▼
┌───────────────────────────────────────┐
│ Semantic complexity router │ ← cheap LLM scores task 0.0–1.0
│ (gpt-4o-mini, ~$0.000015/call) │ "what time is it?" → 0.02
└────────────┬──────────────────────────┘ "write CUDA matmul" → 0.94
│
┌──────┴───────┐
│ │
simple complex
│ │
▼ ▼
Cheap model Budget check
(fast, $) ├─ affordable → Capable model ($$$)
└─ too close → downgrade to cheap ($)
$ baar-bench --dataset all --limit 200 --mock --seed 42
dataset strategy routed-cheap total-cost savings ────────── ──────────── ──────────── ────────── ───────── MMLU always-big 0% $1.0005 — MMLU baar-core 81% $0.157 84.3% ↓ ────────── ──────────── ──────────── ────────── ───────── GSM8K always-big 0% $1.0005 — GSM8K baar-core 87% $0.129 87.1% ↓ ────────── ──────────── ──────────── ────────── ───────── HumanEval always-big 0% $1.0005 — HumanEval baar-core 39% $0.614 38.6% ↓
HumanEval routes fewer tasks cheap — coding questions score high complexity. Correct behaviour.
0%
max cost reduction
live benchmark, MMLU
live benchmark, MMLU
·
$0
charged per
blocked call
blocked call
·
0 lines
to integrate
$ cat examples/quickstart.py
from baar import BAARRouter, BudgetExhausted router = BAARRouter(budget=0.10) # hard cap: $0.10 total reply = router.chat("Explain recursion") # routes cheap/capable automatically print(f"Spent: ${'{'}router.spent:.5f{'}'} / Remaining: ${'{'}router.remaining:.5f{'}'}") # budget exhausted → BudgetExhausted raised locally, zero API calls made try: router.chat("Another expensive call") except BudgetExhausted as e: print(f"Blocked locally. Remaining: ${'{'}e.remaining:.5f{'}'}")
from baar import BAARRouter from baar.core.stores import SQLiteBudgetStore def router_for(user_id: str) -> BAARRouter: return BAARRouter( budget=0.10, store=SQLiteBudgetStore("budgets.db", namespace=user_id), ) alice = router_for("alice") bob = router_for("bob") alice.chat("Summarise this document") # deducted from Alice's $0.10 only bob.chat("Translate to French") # Bob's quota untouched # concurrent writes are TOCTOU-safe (WAL mode + exclusive transaction)
from baar import BAARRouter from baar.middleware import baar_guard router = BAARRouter(budget=1.00) @baar_guard(router, max_calls=10, cost_per_call=0.002) def run_tool(query: str) -> str: return expensive_api(query) run_tool("query") # fine run_tool("query") # fine # call 11 → GuardExceeded raised before the function executes
from baar.middleware import BaarMiddleware from langgraph.graph import StateGraph middleware = BaarMiddleware( router=BAARRouter(budget=0.50), max_steps=20, ) graph = StateGraph(AgentState) graph.add_node("agent", middleware.wrap(agent_node)) # step limit + budget gate enforced on every LangGraph step
$ baar --compare-alternatives
| feature | baar-core | RouteLLM | LiteLLM | Portkey |
|---|---|---|---|---|
| Hard local kill-switch | ✓ | ✗ | ✗ | ✗ |
| Zero network calls on block | ✓ | ✗ | ✗ | ✗ |
| Prevents DoW OWASP LLM10 | ✓ | ✗ | ✗ | ✗ |
| Fully offline | ✓ | ✗ | ✗ | ✗ |
| Per-user namespaced budgets | ✓ SQLite | ✗ | ✗ proxy req. | ✗ cloud only |
| Cross-process TOCTOU-safe | ✓ | ✗ | ✗ | N/A |
| LangGraph step middleware | ✓ | ✗ | ✗ | ✗ |
| Tool execution guards | ✓ | ✗ | ✗ | ✗ |
| Semantic complexity routing | ✓ | ✓ | ✓ | ✓ |
| No proxy / no server | ✓ | ✓ | ✗ | ✗ |
| Open source (MIT) | ✓ | ✓ | ✓ | ✗ |
Every alternative routes and tracks. Baar-Core prevents — exception raised before a single byte leaves your machine.
$ baar --features --verbose
--kill-switch
local pre-flight check — zero network calls when budget is exceeded
--semantic-route
cheap LLM scores complexity 0.0–1.0, auto-picks cheap vs capable tier
--per-user-quota
SQLite-backed namespaced budgets, survives restarts, multi-process safe
--offline
works fully air-gapped — budget enforcement never touches the network
--owasp-llm10
direct mitigation for OWASP LLM10:2025 Denial-of-Wallet attacks
--tool-guard
@baar_guard decorator: per-function call limits + cost deduction
--langgraph
BaarMiddleware: step limits + budget gate for any LangGraph agent
--telemetry
JSONL audit log — inspect with
baar-telemetry telemetry.jsonl
--open-source
MIT license — fork freely