OpenEnv · Negotiation Playground

Watch agents haggle. Step in yourself.

A negotiation environment with observable tells and hidden reservation prices. Buyer and seller are both LLMs — Sauda on the buy side, Gemma-4-E4B on the sell side. Strategy improves through self-play; drop in as a seller, watch the arena, or scrub a replay.

Play as seller Watch a live arena

Pick a way in

replays →

Play

You're the seller. Sauda haggles you down.

interactive

Open

Spectate

Watch the agent vs a scripted seller, turn by turn.

live

Open

Arena

Multiple buyers compete for the same listing.

experimental

Open

Replay

60 logged episodes. Scrub through any of them.

Open

State of the playground

Three policies. Three task suites. Receipts on file.

Buyer-share is the fraction of bargaining surplus the agent captured. Mutual-loss is how often it walked away from a winnable deal. Sauda v2 captures the most surplus per close; it's also the only buyer that walks when the deal is bad.

see all replays

policy	buyer_share	win_rate	mutual_loss	rounds
llama-3.2-3b base	0.570	67%	0%	2.2
llama-3.1-8b base	0.686	73%	1%	3.1
sauda v2 (8b sft+grpo)	0.799	64%	9%	6.0

Llama-3.1-8B QLoRA · SFT + GRPO · 90 ep × 3 tasks · hardened seller. raw eval data →

Why this exists

Tells are noisy and observable.

Real bargaining isn't about price alone. Sellers fidget, anchor early, claim outside pressure. Most negotiation envs throw those signals away. Ours surfaces twelve of them as first-class observations.

Information is asymmetric.

The buyer never sees the seller's reservation. The seller never sees the budget. Both sides infer. The whole point of the agent is to do that inference better than rules can.

Strategy is trained, not prompted.

The buyer was trained on this env through SFT, GRPO, and RLAIF/DPO. That's why it negotiates twice as long as base models and captures more surplus per close — the env's reward shape made it. The repo is public if you want to train your own. How it's trained →

How it works

Two LLMs negotiate. One of them learned how through RLAIF.

BazaarBATNA is an OpenEnv-compliant environment where buyer and seller are both language models. The buyer is Sauda (Llama-3.1-8B + LoRA, trained on this env). The seller is Gemma-4-E4B with persona instructions and four hard rules baked into code: never accept below reservation, never leak it in messages, counter monotonically toward the buyer, anchor with item details.

Both sides infer through asymmetric information. The buyer never sees the seller's reservation. The seller never sees the buyer's budget. The whole system tests whether trained behaviour beats prompted behaviour at this game.

env

OpenEnv FastAPI

/reset, /step, /state, /score, /tasks. Eight task suites, four seller personas, real Amazon listings as price anchors.

buyer

Sauda — Llama-3.1-8B + LoRA

Trained on this env through SFT → GRPO → RLAIF/DPO. Outputs structured JSON action plus a Hinglish/English message.

seller

Gemma-4-E4B

Persona-prompted. Four code-enforced rules. Auto-accepts at reservation. 50-ep quality eval passes 5 of 6 acceptance criteria.

Architecture

The buyer agent, top to bottom.

Observation

The env emits a structured obs each step: round counter, asking price, your last offer, your private budget, recent history, optional seller-tells channel (12 noisy signals).

LLM policy

Llama-3.1-8B base + QLoRA adapter (PayMyBills/bestdealbot-v2). Outputs strict JSON: action / price / message. The message field carries a Hinglish/English line that gets rendered to the user.

Bayesian persuasion steering

Posterior over seller urgency & flexibility, updated from tells + concession behaviour. Gates the raw model action with a Nash-style target offer and an adaptive close threshold near deadline. (Currently a substrate, not a performance lever — see ablation below.)

Live serving

Layered backends: a hot HF Inference Endpoint serves the model at production latency, with a local Ollama runtime as a warm secondary and rule-based heuristics as a guaranteed floor. The router degrades gracefully and silently — every request gets a sane buyer, every time. Per-IP rate limits, concurrency caps and a daily spend ceiling sit in front of the metered path so the demo can't be drained.

Training pipeline

SFT → GRPO → RLAIF/DPO.

Three stages, each fixing a different kind of bug. SFT teaches the model to speak the protocol. GRPO teaches it to win. DPO with Claude-as-judge polishes the prose. The result is Sauda v2 (and v3 once DPO completes).

stage 1

SFT

Supervised warmup

QLoRA · rule-based buyer rollouts · 1024 examples

Teaches the model the strict-JSON output format and Hinglish/English message register. After SFT, the buyer talks like a buyer instead of a chatbot.

loss: 2.14 → 0.10

stage 2

GRPO

Group-relative policy optimisation

Continues from SFT adapter · env reward + first-step shaping · 30 steps

Teaches the model to actually capture surplus. Reward signal is the env's normalized buyer surplus, with first-step shaping for early offers.

loss: 0.004 final

stage 3

RLAIF / DPO

Direct preference optimisation

Two rollouts at temp 0.5 vs 0.9 · Claude judges · trl.DPOTrainer

Teaches the model to prefer the winning trajectory style. Honest framing: this is RLAIF (Claude-as-judge), not RLHF — published research uses both, we name ours.

loss: cooking now

RLAIF in detail

For each scenario we sample two buyer rollouts at different temperatures against the same seller. Claude reads both transcripts and picks the winner — “closed the deal, captured more surplus, didn't fold to bluffs.” The (chosen, rejected) pair is fed into trl.DPOTrainer on top of the SFT+GRPO adapter. Our heuristic-judge fallback recognises either-side accepts and uses a soft tiebreak when neither closes, so the pipeline produces real preference signal even without an API key.