Benchmarks like HellaSwag are a bit too abstract for me to get a sense of how well they perform in real-world workflows.

I had the idea of writing a script that asks prompts testing basic reasoning, instruction following, and creativity on around 60 models that I could get my hands on through inferences API.

The script stored all the answers in a SQLite database, and those are the raw results.

Source: LLM Benchmarks

Useful! Quite often, the responses are incorrect. LLMs are still intriguing but we need to be cautious about how much trust is placed in them.