Every Sunday, the NPR host will start Shortz, The New York Times crossword puzzle guru. They get the quiz thousands of listeners in a long-running segment called the Sunday Puzzle. Hence written solutions are also available without too much knowledge. The brainteasers are usually challenging even more skilled contestants.
That’s how some of the experts will start to think of a promising way to test the limit of AI and it problem-solving abilities. According to the latest study, a team of researchers hailing from Wellesley College, The University of Texas at Austin, Northeastern University, Oberlin College, Charles University, and startup Cursor created an AI benchmark by using the riddles from Sunday Puzzle episodes. The team clearly says their test uncovered the surprising insight like the reasoning model of the OpenAI’s o1 among the others. Sometimes give up and provide the answers they know aren’t correct.
The AI industry is in a bit of a benchmark quandary at the moment. Most of the tests are commonly used to evaluate the AI models for skills and competency at the PhD level math and science questions. Hence other benchmarks even the benchmarks released relatively recently. The quickly approaching situation points and the advantages of public radio quiz games like Sunday Puzzle are that it doesn’t test for esoteric knowledge and many benchmarks released relatively recently. More quicker to approach the situation point.
Guha stated
“I think what makes these problems hard is that it’s really difficult to make meaningful progress on a problem until you solve it — that’s when everything clicks together all at once,”. “That requires a combination of insight and a process of elimination.”
No benchmark is perfect, of course. The Sunday Puzzle is U.S.-centric and English only. And because the quizzes are publicly available, it’s possible that models trained on them can “cheat” in a sense, although Guha says he hasn’t seen evidence of this.
He also added
“New questions are released every week, and we can expect the latest questions to be truly unseen,”. “We intend to keep the benchmark fresh and track how model performance changes over time.”
On the basis of the researcher’s benchmarks, it also consists of around 600 Sunday Puzzle riddle reasoning models like o1 and DeepSeek’s R1 which far outperform the rest. The reasoning models throughout the fact-checking themselves before giving out the results. That also helps them to avoid some of the pitfalls that normally trip up the AI models. The trade-off is the reasoning models tasks that take a little longer to arrive at the solutions which is typically seconds to minutes longer.