baar-core is an open-source Python library that enforces hard budget limits on LLM agents. It intercepts API calls before they fire and blocks them locally when the budget is exhausted — no DNS lookup, no TCP connection, $0 charged.

How does baar-core prevent LLM overspending?

baar-core runs a pre-flight check before every LLM API call. It estimates the cost of the call and compares it against the remaining budget. If the estimated cost exceeds the budget, it raises a BudgetExhausted exception locally — before any network request is made — so you are never charged for the blocked call.

How do I install baar-core?

Install via pip: pip install baar-core. Requires Python 3.9 or higher.

Does baar-core support async agents?

Yes. baar-core is fully async-safe and thread-safe. It works with both sync and async LLM agent frameworks including LangChain, LlamaIndex, and custom asyncio agents.

What is semantic routing in baar-core?

Semantic routing automatically classifies each prompt by complexity and routes it to the cheapest model capable of handling it. Simple queries go to a cheap tier (e.g. gpt-4o-mini) and complex ones go to a capable tier (e.g. gpt-4o). This typically reduces LLM costs by 84–94% without any change to output quality.

Can baar-core enforce per-user budgets in a multi-tenant SaaS?

Yes. baar-core's SQLiteBudgetStore tracks spend per user_id across sessions. Budget state persists across restarts, is thread-safe and async-safe, and requires only 3 lines of code to set up.

Is baar-core free and open source?

Yes. baar-core is free and released under the MIT license. Source code is available at github.com/orvi2014/Baar-Core.

██████╗  █████╗  █████╗ ██████╗ 
██╔══██╗██╔══██╗██╔══██╗██╔══██╗
██████╔╝███████║███████║██████╔╝
██╔══██╗██╔══██║██╔══██║██╔══██╗
██████╔╝██║  ██║██║  ██║██║  ██╗
╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝  ╚═╝

v0.5.0 · pre-flight budget enforcement for LLM agents

$ baar --about

Stop LLM API calls before they happen.
Not after.

Name: baar-core
Author: orvi

Hard local kill-switch. Estimates cost before every request.
Budget gone → exception raised locally. No DNS. No TCP. $0 charged.

$ pip install baar-core

View on GitHub ↓ see the problem

agent.py — without baar-core

$ python agent.py

[*] agent_loop starting — task: "answer user queries"

[*] model: gpt-4o no budget limit set

>>> call #1

prompt : "what time is it?"

model : gpt-4o

response : "It's 2:14 AM."

tokens : 847 cost: $0.054 total: $0.054

>>> call #2

prompt : "what time is it?"

model : gpt-4o

tokens : 847 cost: $0.054 total: $0.108

——— 844 more identical calls · 8 hours later ———

>>> call #847

prompt : "what time is it?"

tokens : 847 cost: $0.054 total: $47.23

[!] BILL RECEIVED: $47.23

calls: 847 · tokens: 20,841 · runtime: 8h 07m

no kill-switch active. provider already charged.

agent.py — with baar-core

$ python agent.py

[✓] baar-core active — budget: $0.10 routing: ON

>>> call #1

prompt : "what time is it?"

pre-flight : estimated $0.054 remaining $0.10 → PASS

routing : complexity 0.02 → cheap tier (gpt-4o-mini)

tokens : 12 cost: $0.0001 total: $0.0001

>>> call #2

prompt : "what time is it?"

pre-flight : estimated $0.054 remaining $0.099 → PASS

routing : complexity 0.02 → cheap tier

tokens : 12 cost: $0.0001 total: $0.0002

>>> call #3

prompt : "what time is it?"

pre-flight : estimated $0.054 remaining $0.098 → FAIL

[✓] BudgetExhausted raised locally

no DNS lookup · no TCP connection · $0 charged on call #3

total spent: $0.0002 · $47.23 saved

$ baar --explain-routing

User task
    │
    ▼
┌───────────────────────────────────────┐
│  Pre-flight budget check                │  ← estimated cost > remaining budget?
│  (local, zero network calls)          │    raise BudgetExhausted — blocked
└────────────┬──────────────────────────┘
             │ affordable
             ▼
┌───────────────────────────────────────┐
│  Semantic complexity router             │  ← cheap LLM scores task 0.0–1.0
│  (gpt-4o-mini, ~$0.000015/call)       │    "what time is it?"  → 0.02
└────────────┬──────────────────────────┘    "write CUDA matmul" → 0.94
             │
      ┌──────┴───────┐
      │              │
   simple         complex
      │              │
      ▼              ▼
 Cheap model    Budget check
 (fast, $)      ├─ affordable → Capable model ($$$)
                └─ too close  → downgrade to cheap ($)

$ baar-bench --dataset all --limit 200 --mock --seed 42

dataset      strategy       routed-cheap    total-cost    savings
──────────   ────────────   ────────────   ──────────   ─────────
MMLU         always-big     0%             $1.0005       —
MMLU         baar-core      81%            $0.157        84.3% ↓
──────────   ────────────   ────────────   ──────────   ─────────
GSM8K        always-big     0%             $1.0005       —
GSM8K        baar-core      87%            $0.129        87.1% ↓
──────────   ────────────   ────────────   ──────────   ─────────
HumanEval    always-big     0%             $1.0005       —
HumanEval    baar-core      39%            $0.614        38.6% ↓

HumanEval routes fewer tasks cheap — coding questions score high complexity. Correct behaviour.

0% max cost reduction
live benchmark, MMLU

·

$0 charged per
blocked call

·

0 lines to integrate

$ cat examples/quickstart.py

from baar import BAARRouter, BudgetExhausted

router = BAARRouter(budget=0.10)           # hard cap: $0.10 total
reply  = router.chat("Explain recursion")    # routes cheap/capable automatically

print(f"Spent: ${'{'}router.spent:.5f{'}'} / Remaining: ${'{'}router.remaining:.5f{'}'}")

# budget exhausted → BudgetExhausted raised locally, zero API calls made
try:
    router.chat("Another expensive call")
except BudgetExhausted as e:
    print(f"Blocked locally. Remaining: ${'{'}e.remaining:.5f{'}'}")

from baar import BAARRouter
from baar.core.stores import SQLiteBudgetStore

def router_for(user_id: str) -> BAARRouter:
    return BAARRouter(
        budget=0.10,
        store=SQLiteBudgetStore("budgets.db", namespace=user_id),
    )

alice = router_for("alice")
bob   = router_for("bob")

alice.chat("Summarise this document")  # deducted from Alice's $0.10 only
bob.chat("Translate to French")        # Bob's quota untouched
# concurrent writes are TOCTOU-safe (WAL mode + exclusive transaction)

from baar import BAARRouter
from baar.middleware import baar_guard

router = BAARRouter(budget=1.00)

@baar_guard(router, max_calls=10, cost_per_call=0.002)
def run_tool(query: str) -> str:
    return expensive_api(query)

run_tool("query")   # fine
run_tool("query")   # fine
# call 11 → GuardExceeded raised before the function executes

from baar.middleware import BaarMiddleware
from langgraph.graph import StateGraph

middleware = BaarMiddleware(
    router=BAARRouter(budget=0.50),
    max_steps=20,
)

graph = StateGraph(AgentState)
graph.add_node("agent", middleware.wrap(agent_node))
# step limit + budget gate enforced on every LangGraph step

$ baar --compare-alternatives

feature	baar-core	RouteLLM	LiteLLM	Portkey
Hard local kill-switch	✓	✗	✗	✗
Zero network calls on block	✓	✗	✗	✗
Prevents DoW OWASP LLM10	✓	✗	✗	✗
Fully offline	✓	✗	✗	✗
Per-user namespaced budgets	✓ SQLite	✗	✗ proxy req.	✗ cloud only
Cross-process TOCTOU-safe	✓	✗	✗	N/A
LangGraph step middleware	✓	✗	✗	✗
Tool execution guards	✓	✗	✗	✗
Semantic complexity routing	✓	✓	✓	✓
No proxy / no server	✓	✓	✗	✗
Open source (MIT)	✓	✓	✓	✗

Every alternative routes and tracks. Baar-Core prevents — exception raised before a single byte leaves your machine.

$ baar --features --verbose

--kill-switch local pre-flight check — zero network calls when budget is exceeded

--semantic-route cheap LLM scores complexity 0.0–1.0, auto-picks cheap vs capable tier

--per-user-quota SQLite-backed namespaced budgets, survives restarts, multi-process safe

--offline works fully air-gapped — budget enforcement never touches the network

--owasp-llm10 direct mitigation for OWASP LLM10:2025 Denial-of-Wallet attacks

--tool-guard @baar_guard decorator: per-function call limits + cost deduction

--langgraph BaarMiddleware: step limits + budget gate for any LangGraph agent

--telemetry JSONL audit log — inspect with baar-telemetry telemetry.jsonl

--open-source MIT license — fork freely

$ pip install baar-core ▋

Stop LLM API calls before they happen. Not after.

Stop LLM API calls before they happen.
Not after.