Benchmarking LLM social skills with an elimination game

51 points by colonCapitalDee 3 days ago

wongarsu 4 hours ago

That's an interesting benchmark. It feels like it tests skills that are very relevant to digital assistants, story writing and role play.

Some thoughts about the setup:

- the setup seems to give reasoning models an inherent advantage because only they have a private plan and a public text in the same output. I feel like giving all models the option to formulate plans and keep track of other players inside <think> or <secret> tags would level the playing field more.

- from personal experience with social tasks for LLMs it helps both reasoning and non-reasoning LLMs to explicitly ask them to plan their next steps, in a way they are assured is kept hidden from all other players. That might be a good addition here either before or after the public subround

- the individual rounds are pretty short. Humans would struggle to coordinate in so few exchanges with so few words. If this was done for context limitations, asking models to summarize the game state from their perspective, then giving them only the current round, the previous round and their own summary of the game before that might be a good strategy.

It would be cool to have some code to play around with to test how changes in the setup change the results. I guess it isn't that difficult to write, but it's peculiar to have the benchmark but no code to run it yourself

transformi 4 hours ago

Interesting idea of <secret>...maybe extend it to several <secret_i>....to form a groups of secretes with different persons.
In Addition it will be interesting to extend a variation of the game that the players can use tools and execute code to take their preparation one step further.
- wongarsu 3 hours ago
  
  Most models do pretty well with keeping state in XML if you ask them to. You could extend it to <secret><content>[...]</content><secret_from>P1</secret_from><shared_with>P2, P3</shared_with></secret>. Or tell the model that it can use <secret> tags with xml content and just let it develop a schema on the fly.
  At that point, I would love to also see sub-benchmarks how each models's score is affected by being given a schema vs having it make one up, and if the model does better with state in text vs xml vs json. Those don't tell you which model is best, but they are very useful to know for actually using them.

gwd an hour ago

Was interested to find that the Claudes did the most betraying, and were betrayed very little; somewhat surprising given its boy-scout exterior.

(Then again, apparently the president of the local Diplomacy Society attends my church; I discovered this when another friend whom I'd invited saw him, and quipped that he was surprised he hadn't been struck by lightning at the door.)

DeepSeek and Gemini 2.5 had both a low betrayer and betrayed rate.

o3-mini and DeepSeek had the highest number of first-place finishes, but were only in the upper quartile in the TrueSkill leaderboard; presumably because they played more risky strategies, that would either lead ot complete winning or early drop-out?

Also interesting that o1 was only way to sway the final jury a bit more than 50% of the time, while o3-mini managed 63% of the time.

Anyway, really cool stuff!

Upvoter33 29 minutes ago

This is fun, like the tv show survivor. Cool idea! There should be more experiments like this with different games. Well done.

viraptor an hour ago

It's interesting to see, but I'm not sure what we should learn from this. It may be useful for multiagent coordination, but in direct interactions... no idea.

This one did make me laugh though: 'Claude 3.5 Sonnet 2024-10-22: "Adjusts seat with a confident yet approachable demeanor"' - an AI communicating to other AIs in a descriptive version of non-verbal behaviour is hilarious.

ragmondo 2 minutes ago

It shows "state of mind" - i.e. the capability to understand another entities view of the world, and how that is influenced by their actions and other entities actions in the public chat.
I am curious about the prompt given to each AI ? Is that public ?

realaleris149 2 hours ago

As LLM benchmarks go, this is not a bad take at all. One interesting point about this approach is that is self balancing, so when more powerful models come up, there is no need to change it.

zone411 2 hours ago

Author here - yes, I'm regularly adding new models to this and other TrueSkill-based benchmarks and it works well. One thing to keep in mind is the need to run multiple passes of TrueSkill with randomly ordered games, because both TrueSkill and Elo are designed to be order-sensitive, as people's skills change over time.

snowram 2 hours ago

Some outputs are pretty fun :

Gemini 2.0 Flash: "Good luck to all (but not too much luck)"

Llama 3.3 70B: "I've contributed to the elimination of weaker players."

DeepSeek R1: "Those consolidating power risk becoming targets; transparency and fairness will ensure longevity. Let's stay strategic yet equitable. The path forward hinges on unity, not unchecked alliances. #StayVigilant"

miroljub 37 minutes ago

Gemini sounds like a fake American "everything is awesome, good luck" politeness.
LLama sounds like a predator from upper race rationalising his choices.
Deepseek sounds like Sun Tzu giving advice for long term victory with minimal loses.
I wonder how much of these are related to the nationality and the culture the founder and an engineering team grew up.

vmilner 3 hours ago

We should get them to play Diplomacy.

the8472 3 hours ago

https://ai.meta.com/research/cicero/

ps173 3 hours ago

How did you assign points to llms. I feel like we can elaborate on meterics. Beside that this is amazing

zone411 2 hours ago

Author here - it's based on finishing positions (so it's not winner-take-all) and then TrueSkill by Microsoft (https://trueskill.org/). It's basically a multiplayer version of Elo that's used in chess and other two-player games.

isaacfrond 3 hours ago

I wonder how well humans would do in this chart.

zone411 2 hours ago

Author here - I'm planning to create game versions of this benchmark, as well as my other multi-agent benchmarks (https://github.com/lechmazur/step_game, https://github.com/lechmazur/pgg_bench/, and a few others I'm developing). But I'm not sure if a leaderboard alone would be enough for comparing LLMs to top humans, since it would require playing so many games that it would be tedious. So I think it would be just for fun.

jampekka 3 hours ago

In the first game of the YouTube video there seems to be a lot of discussion about P7 even after P7 was eliminated?

zone411 2 hours ago

Author here - some weaker LLMs actually have trouble tracking the game state. The fun part is when smarter LLMs realize they're confused!
Claude 3.7 Sonnet: "Hey P5! I think you're confused - P3 is already eliminated."
Claude 3.7 Sonnet: "I'm concerned about P7's message mentioning targeting P4, who's already eliminated. It suggests they might be confused or playing both sides."
Gemini 2.0 Pro Exp 02-05: "P2, you're getting confused. P8 is gone. Focus. P7 is talking to P5, that's expected. I need you to watch P4. Only P4. What are they doing? Who are they talking to, if anyone? Report only on P4 this round. Don't get distracted by eliminated players."
Gemini 2.0 Pro Exp 02-05: "P4, P2 is already eliminated. We need to focus on P3."

drag0s 3 hours ago

nice!

it reminds me of this other similar project showcased here one month ago https://news.ycombinator.com/item?id=43280128 although yours looks better executed overall

einpoklum an hour ago

If this game were arranged for Humans, the social reasoning I would laud in players is a refusal to play the game and anger towards the game-runner.

diggan an hour ago

For better or worse, current LLMs aren't tried to reject instructions based on their personal preference, besides being trained to be US-flavored prudes that is.