The version of the model that hit the 87.5 ARC-AGI score costs roughly ~$5000 in electricity per prompt and works on standardized logic problems. It also trained on 75% of the questions in the eval ahead of time, admitted by OpenAI themselves. It likely did the same with the Codeforces questions. Each question in either evaluation required careful handcrafted prompts (aka semantic computer programming) to get an answer.
The ARC-AGI eval looks like this:
View attachment 566085
View attachment 566086
This is only AGI be redefining the word AGI to mean "really good at standardized tests." To meet the criteria for actual AGI requires a
highly autonomous system capable of doing any work that a human could do. This is not that. o3 is at best a semantic programming language/interpreter for solving standardized problems using a massive amount of compute.
View attachment 566088