AI models keep scoring differently on the same test. A new reporting standard wants to fix that. — type0 | type0