OpenEnv · Negotiation Playground

Watch agents haggle. Step in yourself.

A negotiation environment with observable tells and hidden reservation prices. Buyer and seller are both LLMs — Sauda on the buy side, Gemma-4-E4B on the sell side. Strategy improves through self-play; drop in as a seller, watch the arena, or scrub a replay.

Powered by RLAIFOpenEnv-compliant8B · QLoRA

Pick a way in

replays →
State of the playground

Three policies. Three task suites. Receipts on file.

Buyer-share is the fraction of bargaining surplus the agent captured. Mutual-loss is how often it walked away from a winnable deal. Sauda v2 captures the most surplus per close; it's also the only buyer that walks when the deal is bad.

see all replays
policybuyer_sharewin_ratemutual_lossrounds
llama-3.2-3b base0.57067%0%2.2
llama-3.1-8b base0.68673%1%3.1
sauda v2 (8b sft+grpo)0.79964%9%6.0
Llama-3.1-8B QLoRA · SFT + GRPO · 90 ep × 3 tasks · hardened seller. raw eval data →
Why this exists
Tells are noisy and observable.

Real bargaining isn't about price alone. Sellers fidget, anchor early, claim outside pressure. Most negotiation envs throw those signals away. Ours surfaces twelve of them as first-class observations.

Information is asymmetric.

The buyer never sees the seller's reservation. The seller never sees the budget. Both sides infer. The whole point of the agent is to do that inference better than rules can.

Strategy is trained, not prompted.

The buyer was trained on this env through SFT, GRPO, and RLAIF/DPO. That's why it negotiates twice as long as base models and captures more surplus per close — the env's reward shape made it. The repo is public if you want to train your own. How it's trained →

How it works

Two LLMs negotiate. One of them learned how through RLAIF.

BazaarBATNA is an OpenEnv-compliant environment where buyer and seller are both language models. The buyer is Sauda (Llama-3.1-8B + LoRA, trained on this env). The seller is Gemma-4-E4B with persona instructions and four hard rules baked into code: never accept below reservation, never leak it in messages, counter monotonically toward the buyer, anchor with item details.

Both sides infer through asymmetric information. The buyer never sees the seller's reservation. The seller never sees the buyer's budget. The whole system tests whether trained behaviour beats prompted behaviour at this game.

BUYERSaudaLlama-3.1-8B+ LoRA + steeringOPENENVBazaarBATNAFastAPI/reset /step/state /score/tasks /healthSELLERGemma-4-E4Bpersona prompt+ 4 hard rulesactionobs + tellshistoryoffer + msg8 tasks · 3 personasamazon listingstwo LLMs · asymmetric information
env
OpenEnv FastAPI

/reset, /step, /state, /score, /tasks. Eight task suites, four seller personas, real Amazon listings as price anchors.

buyer
Sauda — Llama-3.1-8B + LoRA

Trained on this env through SFT → GRPO → RLAIF/DPO. Outputs structured JSON action plus a Hinglish/English message.

seller
Gemma-4-E4B

Persona-prompted. Four code-enforced rules. Auto-accepts at reservation. 50-ep quality eval passes 5 of 6 acceptance criteria.

Architecture

The buyer agent, top to bottom.

Observationround, ask, budget, history, tellsfrom envLlama-3.1-8B baseunsloth ungated mirror · bf16frozenLoRA adapterSauda v2 · 13.6M trainabletrainedBayesian steeringtell-aware action gatepost-hocAction JSON{ action, price, message }to env
1
Observation

The env emits a structured obs each step: round counter, asking price, your last offer, your private budget, recent history, optional seller-tells channel (12 noisy signals).

2
LLM policy

Llama-3.1-8B base + QLoRA adapter (PayMyBills/bestdealbot-v2). Outputs strict JSON: action / price / message. The message field carries a Hinglish/English line that gets rendered to the user.

3
Bayesian persuasion steering

Posterior over seller urgency & flexibility, updated from tells + concession behaviour. Gates the raw model action with a Nash-style target offer and an adaptive close threshold near deadline. (Currently a substrate, not a performance lever — see ablation below.)

4
Live serving

Layered backends: a hot HF Inference Endpoint serves the model at production latency, with a local Ollama runtime as a warm secondary and rule-based heuristics as a guaranteed floor. The router degrades gracefully and silently — every request gets a sane buyer, every time. Per-IP rate limits, concurrency caps and a daily spend ceiling sit in front of the metered path so the demo can't be drained.

Training pipeline

SFT → GRPO → RLAIF/DPO.

Three stages, each fixing a different kind of bug. SFT teaches the model to speak the protocol. GRPO teaches it to win. DPO with Claude-as-judge polishes the prose. The result is Sauda v2 (and v3 once DPO completes).

BASELlama-3.1-8Bunsloth mirrorSTAGE 1SFTJSON · registerSTAGE 2GRPOenv rewardSTAGE 3DPORLAIF · ClaudeSHIPPEDSauda v3on HFloss 2.14 → 0.10surplus +13%prose polish
stage 1
SFT
Supervised warmup
QLoRA · rule-based buyer rollouts · 1024 examples

Teaches the model the strict-JSON output format and Hinglish/English message register. After SFT, the buyer talks like a buyer instead of a chatbot.

loss: 2.14 → 0.10
stage 2
GRPO
Group-relative policy optimisation
Continues from SFT adapter · env reward + first-step shaping · 30 steps

Teaches the model to actually capture surplus. Reward signal is the env's normalized buyer surplus, with first-step shaping for early offers.

loss: 0.004 final
stage 3
RLAIF / DPO
Direct preference optimisation
Two rollouts at temp 0.5 vs 0.9 · Claude judges · trl.DPOTrainer

Teaches the model to prefer the winning trajectory style. Honest framing: this is RLAIF (Claude-as-judge), not RLHF — published research uses both, we name ours.

loss: cooking now
RLAIF in detail

For each scenario we sample two buyer rollouts at different temperatures against the same seller. Claude reads both transcripts and picks the winner — “closed the deal, captured more surplus, didn't fold to bluffs.” The (chosen, rejected) pair is fed into trl.DPOTrainer on top of the SFT+GRPO adapter. Our heuristic-judge fallback recognises either-side accepts and uses a soft tiebreak when neither closes, so the pipeline produces real preference signal even without an API key.

rollout Atemp 0.5rollout Btemp 0.9JUDGEClaudechosenrejectedtrl.DPOTraineron top of v2 SFT+GRPO