🔜 Coming Soon
[2025.9] Provide mainstream Agent evaluation capabilities in the industry, supporting the evaluation of DeepSeek V3.1 Search/Code Agent
[2025.10] Support plug-and-play integration of cutting-edge test benchmarks under the AISBench framework to address the increasingly complex and diverse testing tasks in the industry
[2025.11] Provide cutting-edge multimodal evaluation capabilities in the industry
[x] [2025.8] Will support performance evaluation of multi-turn dialogue datasets such as ShareGPT and BFCL.
[x] [2025.8] Optimize the computing efficiency of the evaluation phase in performance testing, reduce the memory usage of tools, and supplement the tool usage specifications.
[x] [2025.7] For custom datasets used in performance evaluation scenarios, support defining the maximum output length limit for each piece of data.
🤝 Acknowledgments
The code of this project is extended and developed based on 🔗 OpenCompass.
Some datasets and prompt implementations of this project are modified from simple-evals.
The performance indicators tracked in the code of this project are aligned with VLLM Benchmark.
The BFCL function calling capability evaluation feature of this project is implemented based on Berkeley Function Calling Leaderboard (BFCL).