🔜 Coming Soon

  • [2025.9] Provide mainstream Agent evaluation capabilities in the industry, supporting the evaluation of DeepSeek V3.1 Search/Code Agent

  • [2025.10] Support plug-and-play integration of cutting-edge test benchmarks under the AISBench framework to address the increasingly complex and diverse testing tasks in the industry

  • [2025.11] Provide cutting-edge multimodal evaluation capabilities in the industry

  • [x] [2025.8] Will support performance evaluation of multi-turn dialogue datasets such as ShareGPT and BFCL.

  • [x] [2025.8] Optimize the computing efficiency of the evaluation phase in performance testing, reduce the memory usage of tools, and supplement the tool usage specifications.

  • [x] [2025.7] For custom datasets used in performance evaluation scenarios, support defining the maximum output length limit for each piece of data.

🤝 Acknowledgments

  • The code of this project is extended and developed based on 🔗 OpenCompass.

  • Some datasets and prompt implementations of this project are modified from simple-evals.

  • The performance indicators tracked in the code of this project are aligned with VLLM Benchmark.

  • The BFCL function calling capability evaluation feature of this project is implemented based on Berkeley Function Calling Leaderboard (BFCL).