# 🔜 Coming Soon - **[2025.9]** Provide mainstream Agent evaluation capabilities in the industry, supporting the evaluation of DeepSeek V3.1 Search/Code Agent - **[2025.10]** Support plug-and-play integration of cutting-edge test benchmarks under the AISBench framework to address the increasingly complex and diverse testing tasks in the industry - **[2025.11]** Provide cutting-edge multimodal evaluation capabilities in the industry - [x] **[2025.8]** Will support performance evaluation of multi-turn dialogue datasets such as ShareGPT and BFCL. - [x] **[2025.8]** Optimize the computing efficiency of the evaluation phase in performance testing, reduce the memory usage of tools, and supplement the tool usage specifications. - [x] **[2025.7]** For custom datasets used in performance evaluation scenarios, support defining the maximum output length limit for each piece of data. # 🤝 Acknowledgments - The code of this project is extended and developed based on 🔗 [OpenCompass](https://github.com/open-compass/opencompass). - Some datasets and prompt implementations of this project are modified from [simple-evals](https://github.com/openai/simple-evals). - The performance indicators tracked in the code of this project are aligned with [VLLM Benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks). - The BFCL function calling capability evaluation feature of this project is implemented based on [Berkeley Function Calling Leaderboard (BFCL)](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard).