Benchmarks like HellaSwag are a bit too abstract for me to get a sense of how well they perform in real-world workflows.
I had the idea of writing a script that asks prompts testing basic reasoning, instruction following, and creativity on around 60 models that I could get my hands on through inferences API.
The script stored all the answers in a SQLite database, and those are the raw results.
Source: LLM Benchmarks
Useful! Quite often, the responses are incorrect. LLMs are still intriguing but we need to be cautious about how much trust is placed in them.