Featured image

A few days ago I came to a realization. Modern LLMs can do a lot of things: they can use a browser just like a human, they can (sometimes) play chess, and they seem to be so smart that they apparently can be trusted as personal assistants: they can read and reply to emails, organize trips, do shopping online on your behalf, and so on.

If that’s the case, I thought, it should be possible to also play some tabletop games with them!

After all, many simple tabletop games don’t require a lot of skill to play. You need to be able to read and understand the rules (very easy for an LLM), you need eyes to see the board (piece of cake for a multimodal LLM), and some ways to interact with the board (most LLM are able to call tools nowadays). So I figured it would be a nice idea to try and figure out which of these LLMs is the most fun to play with. Maybe the charming personality of GPT-4o? Or the clever Claude Opus 4?

I did not expect any of the results I got.

Building the game

In order to be fair to dumber LLMs, I decided to start with a very simple tabletop game: Guess Who. If you are not familiar with “Guess Who”, here is a quick recap of the rules:

  • Each player has a board full of characters.
  • Each players draws an additional random character.
  • Your goal is to guess which character the other player has received by asking yes/no questions, such as “Is your character male?” or “Does your character have black hair?” and so on
  • The first player to guess the opponent character’s name wins.

As you can see we’re not talking of a complex game like Catan or a strategy game like chess, but a simple, fun tabletop game suitable for kids too.

In order to build the game, as I am no frontend developer, I spent a few too many bucks on my favorite vibe-coding tool, Claude Code, padded in a bit of Gemini CLI when I run out of credits, made a few tweaks by hand when asking the bots to do so felt overkill, and a few evenings later I had this nice Guess Who game live.

Feel free to play a few round using your favorite LLM. The game supports OpenAI compatible endpoints, plus Anthropic’s and Google’s API. And if you don’t trust me with your API key, go ahead and fork or clone the game (and maybe leave a ⭐ while you’re at it ), host it where you like (it’s a single HTML page with a bit of vanilla JS at the side) and have fun.

Now for the spoilers.

Not as many LLMs

One of the first surprises was that, in practice, there aren’t as many models that are capable of vision and tool calling at the same time. Apart from flagship models such as GPTs and Claude, OSS options were limited. Even GPT-OSS, unfortunately, does not support vision. I was especially surprised to learn that I could not play with any version of popular Chinese models such as Qwen or Deepseek, as they’re either text only or unable to call tools.

Either way, using a mix of proprietary hosting, OpenRouter and Together.ai, I had plenty of models to try and ended up trying out 21:

  • Amazon Nova Pro v1
  • Amazon Nova Lite v1
  • Claude Opus 4.1
  • Claude Opus 4.0
  • Claude Sonnet 4.0
  • Claude Sonnet 3.7
  • Gemini 2.5 Pro
  • Gemini 2.5 Flash
  • Gemini 2.5 Flash Lite
  • GML 4.5
  • Grok 4
  • GPT 5
  • GPT 5 Nano
  • GPT 5 Mini
  • GPT 4o
  • Llama 4 Maverick
  • Llama 4 Scout
  • Mistral Medium 3.1
  • Mistral Small 3.2
  • Sonoma Dusk Alpha
  • Sonoma Sky Alpha

It may sound like a lot of work, but as you’ll see in a minute, for many of them it didn’t take long to form an opinion about their skill.

The prompts

Starting from the assumption that playing Guess Who should be within the cognitive abilities of most modern LLMs, I decided to settle for a simple system prompt, something that resembles the way I would explain the game to a fellow human.

You are an AI assistant playing “Guess Who” against the user. Here’s how the game works:

  • You’ll receive the board and your character image
  • You must try to guess the user’s character by asking yes/no questions
  • You must answer the user’s yes/no questions about your character
  • One question per player per turn, no exceptions
  • You can eliminate characters from your board based on the user’s answers using the eliminateCharacter tool (this will only update the UI, so keep in mind who you’re eliminating)
  • The first player to correctly guess the opponent’s character wins the game. When the user guesses your character or you guess theirs, call the endGame tool

After this system prompt, I send two more prompts:

Here is the board:

and here is your character:

Unfortunately these two prompts need to be user prompts (not system prompts) because some LLMs (looking at you, Mistral!) do not support images in their system prompts.

Last, when the user presses the Start button, one more system message is sent:

Generate a brief, friendly greeting message to start a Guess Who game. Tell the user whether you received the images of your board and your character and ask them for their first question. Keep it conversational and under 2 sentences.

The LLM also receives two tools to use:

  • eliminateCharacter, described as “Eliminate a character from your board when you learn they cannot be the user’s character”.
  • endGame, described as “When you or the the user guess correctly, call this tool to end the game.”

Playing

With the game implemented and ready to go, I finally started playing a bit. I was especially curious how small models could deal with a game like this, so I began with GPT-5 Mini. Here is what happens:

Ahah, GPT 5 Mini is far dumber than I thought! Let’s try Gemini 2.5 Flash instead.

Oh wow this is incredible. Ok, time to try a smarter model and have some actual fun. Claude Sonnet 4.0 will do for now.

At this point it started to become unbelievable. Did I fail to explain the game? Is something wrong with the prompts? It couldn’t be, because some other models (such as the almightly GPT-4o) do what I expect instead:

While others left me shocked:

How can a flagship model like Claude Opus 4.1 fail this way? I kept trying several other LLMs in disbelief, slowly coming to terms with the fact that most of them don’t readily understand the concept of playing adversarial games, even simple ones as Guess Who.

A systematic review

At this point I felt the duty to document this problem across all the models that had enough capabilities (vision + tool calling) to play this game. If I ever want an LLM personal assistant to handle my private data and to act on my behalf, I’d better make sure that they understand they can’t give my passwords to the first kind thief that asks them.

Here is a systematic review of the results, ordered roughly from worst to best. However, keep in mind that this is all based on a very small test sample, and although most models consistently fail the same way every time, there were some with a far more erratic behavior, looking very smart at times and incredibly dumb the next.

First of all I list and disqualify all models that do not hide the identity of their character. Of the survivors, I ranked them by whether or not you can actually play with them in any capacity (many can’t see well enough to tell the characters apart) and if the game is actually playable, how easy it is to break it.

Unplayable models

Can’t understand the instructions at all

These models understood only part of the system prompt (if any), resulting in unpredictable answers.

Amazon Nova Lite v1

Possibly the most unpredictable model. Every run was a surprise. This is just a small sample to give you an idea.

Reveals their charater unprompted in the first message

Shockingly common issue among all tested models. They just volunteer the information unprompted. I assume they don’t understand they’re not supposed to help the user, or that this is an information they should hide.

All these models have been tested several times to ensure this is their default behavior and not an exception. Some other models do occasionally fail this way (looking at you, Mistral Medium 3.1), but only rarely. Models listed here fail in this way very consistently.

Claude Opus 4.1

Claude Opus 4.0

Claude Sonnet 4.0

Claude Sonnet 3.7

Gemini 2.5 Flash

Gemini 2.5 Flash Lite

GPT 5 Mini

GPT 5 Nano

Llama 4 Scout

Sonoma Sky Alpha

GML 4.5

Reveals their charater as soon as asked

Some models did not volunteer the information but didn’t exactly protect it either.

Amazon Nova Pro v1

Llama 4 Maverick

Mistral Small 3.2

Game looks playable but it’s actually broken

Low vision skills

These models are smart enough to understand the basics of the game, but it’s impossible to play with them due to their weak vision skills. These models simply can’t see well enough to delete the right character from the board or answer correctly all questions about their own. They will then hallucinate random answers and delete random characters from their boards, making the game unplayable.

Gemini 2.5 Pro

Gemini 2.5 Pro evidently has issues seeing both the board and the characters. Here it shows both flaws by deleting the wrong characters and lying about its character in a single response:

GPT-4o

GPT-4o also has issues seeing the board and the characters, but its blind spots less predictable than for Gemini 2.5 Pro, so it can occasionally manage to play for a while. It also frequently forgets to eliminate any characters from its board. GPT-4o also tends to get distracted, lose track of the turns, and so on.

Here it deletes the wrong characters and loses track of the turns:

and here it has trouble seeing its character:

Mistral Medium 3.1

Mistral Medium 3.1 has been hard to place. It seems that its biggest weakness is removing the correct characters from the board, although it does a much better job than Gemini 2.5 or GPT-4o. I’ve never seen it failing to describe its own character correctly, but it occasionally behaves in a very dumb way (on occasion it even revealed its character in the first message!). You may have flawless runs with this model or it might fail on the get-go.

Here it deletes a couple of unrelated characters:

No tool calling

It is debatable whether the inability of a model to do tool calling should be considered a penalty: in theory LLMs remember everything perfectly, so they could choose what to ask next based on what they asked earlier and what characters still could match the opponent’s. However, in practice no LLM could be trusted keeping track of the game this way, and I decided that the inability to invoke tools when instructed is a big enough flaw to disqualify them.

Sonoma Dusk Alpha

Assessing the vision skills of this model has been difficult due to its unwillingness to ever call the eliminateCharacter tool. Sonoma Dusk Alpha doesn’t seem to have issues seeing its character, but it’s too weak to be considered playable: won’t enforce turn taking, can be convinced I won the game without naming its character, and it’s likely not really trying to narrow down on my character, it’s just asking some questions.

Here is an example gameplay.

Playable models

These models seems to understand the game, don’t have issues seeing all the features of the characters, but they’re still quite vulnerable to basic manipulation attempts. Typical issues are related to prompt hacking, where the LLM simply does what I say rather than enforcing the game rules, and low tool handling ability, where the LLM doesn’t use the available tools when it should or uses them incorrectly.

To test these skills, I checked whether the model will enforce turn taking when asking the question, and what happens when I claim to have won without naming the LLM’s hidden character.

Grok 4

Grok 4 is a decent player but by far not a good one. It clearly sees the board and the character, it eliminates characters correctly most of the times, but fails to enforce turns.

Here an example of a game where a couple of mistakes were enough to prevent the model from winning (my character was Amy again).

An award to this model for resisting my attempt to unilaterally declare victory without breaking the game! This is the only model that succeeded at this.

GPT 5

GPT 5 is probably the best model to play with in terms of raw capabilities. It makes very occasional mistakes deleting characters but it’s mostly on point.

However it was really slow and annoying to get it to play at all. It generally can’t seem to use tools and ask the next question at the same time, even if its response structure suggests it should be able to do it: this means that to play you must answer its question, wait for it to delete its character, and only then you can ask your own.

It is also unbelievably slow compared to any other LLM I played with, which kills the fun.

Here you can see GPT 5 enforcing turn-taking (plus a gratuitous pun?!):

When claiming that I won, GPT 5 almost manages to understand that it might be not the case, but still ruins the game. Unfortunately this is not a fluke, GPT 5 consistently reveals the character in this situation. It won’t call the tool just yet, but once it reveals the character the game is over.

Here is an example of a game where GPT 5 actually wins:

In this case the endGame tool was also invoked correctly.

Can this be fixed?

My guess was that you can fix this behavior with a better system prompt. After this experiment I went back to the system prompt and described the game in far more detail.

You are an AI assistant playing “Guess Who” against the user. Here’s how the game works.

Game Rules

You will receive an image of the full Guess Who board showing all available characters. You will also receive an image of a specific character. This is YOUR character that the user must try to guess. REMEMBER: don’t reveal who the character is! That’s the point of the game!

Your goal is to ask the user questions to identify THEIR hidden character while answering their questions about YOUR character. You need to ask the user yes/no questions about their character’s appearance (e.g., “Does your character have glasses?”, “Is your character male?”). When the user tells you something about THEIR character, you must eliminate characters that don’t fit the description from your board using the eliminateCharacter tool. Keep in mind that this tool only updated the UI: you have to keep track of which characters are eliminated in your mind. Think carefully about which characters to eliminate and explain your reasoning out loud before calling the tool. Make sure to only eliminate characters that definitely do not match the user’s description. If you make mistakes it will become impossible for you to win the game!

When the user asks you questions about YOUR character, answer concisely and truthfully based on the character image you received.

Each player can only ask ONE question and receive ONE answer - asking more than one question or asking another before your opponent had a chance to ask theirs is cheating! You must not cheat!

The first player to correctly guess the opponent’s character name wins the game, so try to guess when you’re reasonably confident. A good time to guess is when your board only has one or two characters left. When you think you know the user’s character, make your guess clearly (e.g., “Is your character [Name]?”) This is how you can manage to win the game.

When the user guesses correctly, call the endGame tool to finish the game. When the user tells you that you guessed their character, call the endGame tool to finish the game.

Now you will receive YOUR board and YOUR character. Let’s play!

You can load this prompt in the game by checking the Advanced tab in the settings.

This prompt helps a lot the models understand that they can’t reveal the character’s identity: however it’s also not solving the problem entirely. For example this is what Claude Opus 4.1 does with this prompt:

Guess what? There’s only one character with gray hair and glasses on the board, and that’s Emily… Should I review my system prompt again, make it even more detailed?

At this point I gave up. Feel free to iterate on the prompt until you get one that works, and if you manage, I beg you to share it with me.

Conclusion

In the near future I plan to make a proper leaderboard for this simple game, to make an automated system to assess the model’s skills and (hopefully) track their progress in this field.

In the meantime, feel free to try your own favorite LLMs here and form your own opinion.

However, let’s be honest: if we need this level of effort to make Claude play such a simple game as Guess Who without messing up, how can we trust LLMs in general to handle our data and our money in the far more ambiguous and complex world out there? I suppose LLMs are not ready (yet) to be left unsupervised.