HongShan launches new benchmark tool for AI

HongShan launches new benchmark tool for AI

Tech in Asia·2025-05-26 17:00

HongShan (formerly Sequoia China) has launched xbench, a new benchmarking tool to evaluate the practical utility of AI.

After two years of development and testing, the platform is now available to the AI community, along with a research paper explaining its methodology.

Unlike traditional benchmarks, xbench aims to map AI performance against measurable business outcomes, featuring test sets that adapt to ongoing advancements in AI.

The public release includes two evaluations: xbench-ScienceQA, which measures academic knowledge and reasoning, and xbench-DeepSearch, which focuses on information gathering in Chinese-language environments. These evaluations are updated monthly and refreshed quarterly.

HongShan plans to expand Profession Aligned evaluations to sectors such as finance, law, and sales. The tool also incorporates item response theory (IRT) to track improvements over time, providing a framework for measuring AI progress and market alignment.

.source-ref{font-size:0.85em;color:#666;display:block;margin-top:1em;}a.ask-tia-citation-link:hover{color:#11628d !important;background:#e9f6f5 !important;border-color:#11628d !important;text-decoration:none !important;}@media only screen and (min-width:768px){a.ask-tia-citation-link{font-size:11px !important;}}

🔗 Source: HSG

🧠 Food for thought

1️⃣ The benchmarking obsolescence cycle plagues AI evaluation

The rapid obsolescence of AI benchmarks has been a persistent industry challenge that xbench specifically aims to address with its “evergreen evaluation” approach.

This problem is widespread. BetterBench’s assessment of 24 AI benchmarks found significant quality variations and widespread issues, including lack of statistical significance reporting and insufficient maintenance processes 1.

The cycle typically unfolds predictably: new benchmarks are created, AI models quickly master them, and the benchmarks become obsolete. This aligns with xbench’s experience of having to replace test suites within months as models maxed out scores.

This pattern mirrors earlier technology evaluation challenges seen in computer vision benchmarks, where evaluation methods required constant updating as deep learning capabilities rapidly advanced after 2010 2.

2️⃣ AI evaluation is shifting from theoretical to practical, real-world metrics

xbench’s dual-track system reflects a broader industry transition from theoretical capabilities testing toward measuring real-world, professional utility—a shift appearing across multiple evaluation frameworks.

The Berkeley Function-Calling Leaderboard (BFCL) exemplifies this trend, having evolved through multiple iterations to assess increasingly complex real-world scenarios rather than abstract capabilities 3.

Similarly, Ï-bench focuses specifically on real-world interactions in dynamic environments with domain-specific policies, moving beyond simplified test cases to evaluate how agents perform in complex situations with practical constraints 3.

Industry experts now emphasize that effective evaluation must combine both automated benchmarks and human assessments to comprehensively gauge model effectiveness across diverse real-world applications 4.

3️⃣ Standardization remains an elusive goal in AI agent evaluation

Despite numerous evaluation frameworks emerging, the field still lacks widely accepted standards for comparing AI agents, a challenge xbench aims to address with its comprehensive approach.

The proliferation of specialized frameworks—including DeepEval, MLFlow LLM Evaluate, RAGAs, Deepchecks, and Arize AI Phoenix—creates a fragmented evaluation landscape with each focusing on different aspects of AI performance 5.

This fragmentation makes it difficult for organizations to make informed decisions about which AI systems best meet their needs, particularly as models become more capable of complex, multi-step tasks requiring tool use and planning 3.

The Allen Institute for AI has responded with initiatives like the OLMES Standard for reproducible evaluations, specifically designed to facilitate comparisons across different model sizes and architectures 6.

Measurement challenges are further complicated by the need to evaluate not just accuracy but also factors like coherence, bias, hallucination resistance, and toxicity—metrics that often require different evaluation methodologies and lack standardized assessment approaches 4.

Recent Sequoia developments

……

Read full article on Tech in Asia

Technology