Watch agents haggle. Step in yourself.
A negotiation environment with observable tells and hidden reservation prices. Buyer and seller are both LLMs — Sauda on the buy side, Gemma-4-E4B on the sell side. Strategy improves through self-play; drop in as a seller, watch the arena, or scrub a replay.
Pick a way in
replays →Three policies. Three task suites. Receipts on file.
Buyer-share is the fraction of bargaining surplus the agent captured. Mutual-loss is how often it walked away from a winnable deal. Sauda v2 captures the most surplus per close; it's also the only buyer that walks when the deal is bad.
see all replays| policy | buyer_share | win_rate | mutual_loss | rounds |
|---|---|---|---|---|
| llama-3.2-3b base | 0.570 | 67% | 0% | 2.2 |
| llama-3.1-8b base | 0.686 | 73% | 1% | 3.1 |
| sauda v2 (8b sft+grpo) | 0.799 | 64% | 9% | 6.0 |
Real bargaining isn't about price alone. Sellers fidget, anchor early, claim outside pressure. Most negotiation envs throw those signals away. Ours surfaces twelve of them as first-class observations.
The buyer never sees the seller's reservation. The seller never sees the budget. Both sides infer. The whole point of the agent is to do that inference better than rules can.
The buyer was trained on this env through SFT, GRPO, and RLAIF/DPO. That's why it negotiates twice as long as base models and captures more surplus per close — the env's reward shape made it. The repo is public if you want to train your own. How it's trained →
Two LLMs negotiate. One of them learned how through RLAIF.
BazaarBATNA is an OpenEnv-compliant environment where buyer and seller are both language models. The buyer is Sauda (Llama-3.1-8B + LoRA, trained on this env). The seller is Gemma-4-E4B with persona instructions and four hard rules baked into code: never accept below reservation, never leak it in messages, counter monotonically toward the buyer, anchor with item details.
Both sides infer through asymmetric information. The buyer never sees the seller's reservation. The seller never sees the buyer's budget. The whole system tests whether trained behaviour beats prompted behaviour at this game.
/reset, /step, /state, /score, /tasks. Eight task suites, four seller personas, real Amazon listings as price anchors.
Trained on this env through SFT → GRPO → RLAIF/DPO. Outputs structured JSON action plus a Hinglish/English message.
Persona-prompted. Four code-enforced rules. Auto-accepts at reservation. 50-ep quality eval passes 5 of 6 acceptance criteria.
The buyer agent, top to bottom.
The env emits a structured obs each step: round counter, asking price, your last offer, your private budget, recent history, optional seller-tells channel (12 noisy signals).
Llama-3.1-8B base + QLoRA adapter (PayMyBills/bestdealbot-v2). Outputs strict JSON: action / price / message. The message field carries a Hinglish/English line that gets rendered to the user.
Posterior over seller urgency & flexibility, updated from tells + concession behaviour. Gates the raw model action with a Nash-style target offer and an adaptive close threshold near deadline. (Currently a substrate, not a performance lever — see ablation below.)
Layered backends: a hot HF Inference Endpoint serves the model at production latency, with a local Ollama runtime as a warm secondary and rule-based heuristics as a guaranteed floor. The router degrades gracefully and silently — every request gets a sane buyer, every time. Per-IP rate limits, concurrency caps and a daily spend ceiling sit in front of the metered path so the demo can't be drained.
SFT → GRPO → RLAIF/DPO.
Three stages, each fixing a different kind of bug. SFT teaches the model to speak the protocol. GRPO teaches it to win. DPO with Claude-as-judge polishes the prose. The result is Sauda v2 (and v3 once DPO completes).
Teaches the model the strict-JSON output format and Hinglish/English message register. After SFT, the buyer talks like a buyer instead of a chatbot.
Teaches the model to actually capture surplus. Reward signal is the env's normalized buyer surplus, with first-step shaping for early offers.
Teaches the model to prefer the winning trajectory style. Honest framing: this is RLAIF (Claude-as-judge), not RLHF — published research uses both, we name ours.
For each scenario we sample two buyer rollouts at different temperatures against the same seller. Claude reads both transcripts and picks the winner — “closed the deal, captured more surplus, didn't fold to bluffs.” The (chosen, rejected) pair is fed into trl.DPOTrainer on top of the SFT+GRPO adapter. Our heuristic-judge fallback recognises either-side accepts and uses a soft tiebreak when neither closes, so the pipeline produces real preference signal even without an API key.