v0.5.0  ·  pre-flight budget enforcement for LLM agents

$ baar --about

Stop LLM API calls before they happen.
Not after.

Hard local kill-switch. Estimates cost before every request.
Budget gone → exception raised locally. No DNS. No TCP. $0 charged.

$ pip install baar-core
agent.py — without baar-core
$ python agent.py
[*] agent_loop starting — task: "answer user queries"
[*] model: gpt-4o    no budget limit set
>>> call #1
prompt    : "what time is it?"
model     : gpt-4o
response : "It's 2:14 AM."
tokens    : 847   cost: $0.054   total: $0.054
>>> call #2
prompt    : "what time is it?"
model     : gpt-4o
tokens    : 847   cost: $0.054   total: $0.108
 ——— 844 more identical calls  ·  8 hours later ———
>>> call #847
prompt    : "what time is it?"
tokens    : 847   cost: $0.054   total: $47.23
[!] BILL RECEIVED: $47.23
calls: 847  ·  tokens: 20,841  ·  runtime: 8h 07m
no kill-switch active. provider already charged.
agent.py — with baar-core
$ python agent.py
[✓] baar-core active — budget: $0.10   routing: ON
>>> call #1
prompt     : "what time is it?"
pre-flight : estimated $0.054   remaining $0.10   → PASS
routing    : complexity 0.02 → cheap tier (gpt-4o-mini)
tokens     : 12   cost: $0.0001   total: $0.0001
>>> call #2
prompt     : "what time is it?"
pre-flight : estimated $0.054   remaining $0.099PASS
routing    : complexity 0.02 → cheap tier
tokens     : 12   cost: $0.0001   total: $0.0002
>>> call #3
prompt     : "what time is it?"
pre-flight : estimated $0.054   remaining $0.098FAIL
[✓] BudgetExhausted raised locally
no DNS lookup  ·  no TCP connection  ·  $0 charged on call #3
total spent: $0.0002  ·  $47.23 saved

$ baar --explain-routing

User task
    │
    ▼
┌───────────────────────────────────────┐
│  Pre-flight budget check                │  ← estimated cost > remaining budget?
│  (local, zero network calls)          │    raise BudgetExhausted — blocked
└────────────┬──────────────────────────┘
             │ affordable
             ▼
┌───────────────────────────────────────┐
│  Semantic complexity router             │  ← cheap LLM scores task 0.0–1.0
│  (gpt-4o-mini, ~$0.000015/call)       │    "what time is it?"  → 0.02
└────────────┬──────────────────────────┘    "write CUDA matmul" → 0.94
             │
      ┌──────┴───────┐
      │              │
   simple         complex
      │              │
      ▼              ▼
 Cheap model    Budget check
 (fast, $)      ├─ affordable → Capable model ($$$)
                └─ too close  → downgrade to cheap ($)

$ baar-bench --dataset all --limit 200 --mock --seed 42

dataset      strategy       routed-cheap    total-cost    savings
──────────   ────────────   ────────────   ──────────   ─────────
MMLU         always-big     0%             $1.0005       —
MMLU         baar-core      81%            $0.157        84.3% ↓
──────────   ────────────   ────────────   ──────────   ─────────
GSM8K        always-big     0%             $1.0005       —
GSM8K        baar-core      87%            $0.129        87.1% ↓
──────────   ────────────   ────────────   ──────────   ─────────
HumanEval    always-big     0%             $1.0005       —
HumanEval    baar-core      39%            $0.614        38.6% ↓
        

HumanEval routes fewer tasks cheap — coding questions score high complexity. Correct behaviour.

0% max cost reduction
live benchmark, MMLU
·
$0 charged per
blocked call
·
0 lines to integrate

$ cat examples/quickstart.py

from baar import BAARRouter, BudgetExhausted

router = BAARRouter(budget=0.10)           # hard cap: $0.10 total
reply  = router.chat("Explain recursion")    # routes cheap/capable automatically

print(f"Spent: ${'{'}router.spent:.5f{'}'} / Remaining: ${'{'}router.remaining:.5f{'}'}")

# budget exhausted → BudgetExhausted raised locally, zero API calls made
try:
    router.chat("Another expensive call")
except BudgetExhausted as e:
    print(f"Blocked locally. Remaining: ${'{'}e.remaining:.5f{'}'}")
from baar import BAARRouter
from baar.core.stores import SQLiteBudgetStore

def router_for(user_id: str) -> BAARRouter:
    return BAARRouter(
        budget=0.10,
        store=SQLiteBudgetStore("budgets.db", namespace=user_id),
    )

alice = router_for("alice")
bob   = router_for("bob")

alice.chat("Summarise this document")  # deducted from Alice's $0.10 only
bob.chat("Translate to French")        # Bob's quota untouched
# concurrent writes are TOCTOU-safe (WAL mode + exclusive transaction)
from baar import BAARRouter
from baar.middleware import baar_guard

router = BAARRouter(budget=1.00)

@baar_guard(router, max_calls=10, cost_per_call=0.002)
def run_tool(query: str) -> str:
    return expensive_api(query)

run_tool("query")   # fine
run_tool("query")   # fine
# call 11 → GuardExceeded raised before the function executes
from baar.middleware import BaarMiddleware
from langgraph.graph import StateGraph

middleware = BaarMiddleware(
    router=BAARRouter(budget=0.50),
    max_steps=20,
)

graph = StateGraph(AgentState)
graph.add_node("agent", middleware.wrap(agent_node))
# step limit + budget gate enforced on every LangGraph step

$ baar --compare-alternatives

feature baar-core RouteLLM LiteLLM Portkey
Hard local kill-switch
Zero network calls on block
Prevents DoW OWASP LLM10
Fully offline
Per-user namespaced budgets SQLite proxy req. cloud only
Cross-process TOCTOU-safe N/A
LangGraph step middleware
Tool execution guards
Semantic complexity routing
No proxy / no server
Open source (MIT)

Every alternative routes and tracks. Baar-Core prevents — exception raised before a single byte leaves your machine.

$ baar --features --verbose

--kill-switch local pre-flight check — zero network calls when budget is exceeded
--semantic-route cheap LLM scores complexity 0.0–1.0, auto-picks cheap vs capable tier
--per-user-quota SQLite-backed namespaced budgets, survives restarts, multi-process safe
--offline works fully air-gapped — budget enforcement never touches the network
--owasp-llm10 direct mitigation for OWASP LLM10:2025 Denial-of-Wallet attacks
--tool-guard @baar_guard decorator: per-function call limits + cost deduction
--langgraph BaarMiddleware: step limits + budget gate for any LangGraph agent
--telemetry JSONL audit log — inspect with baar-telemetry telemetry.jsonl
--open-source MIT license — fork freely