Service-Oriented Accuracy Evaluationο
In a service-oriented deployment environment, the accuracy of the model in real service scenarios is evaluated by comparing model outputs with standard answers through standardized requests. It supports multiple datasets and backend configurations, facilitating the comparison of model accuracy across different service-oriented solutions.
Preconditions for Service-Oriented Accuracy Evaluationο
Before performing service-oriented inference, the following conditions must be met:
Accessible service-oriented model service: Ensure the service process can be directly accessed in the current environment.
Dataset task preparation:
Open-source datasets: Select a dataset from π Open-Source Datasets, and choose the dataset task to execute from the βdetailed introductionβ document corresponding to the dataset. Prepare the dataset files by referring to the βdetailed introductionβ document of the selected dataset task. It is recommended to manually place the open-source dataset in the default directory
ais_bench/datasets/; the program will automatically load the dataset files during task execution.Custom datasets: No need to specify a dataset task; refer to π Custom Dataset for other configurations.
Model task preparation: Select the model task to execute from π Service-Oriented Inference Backend.
Main Functional Scenariosο
Single-Task Evaluationο
Please refer to π Quick Start on the homepage for details; no further elaboration here.
Multi-Task Evaluationο
It supports configuring multiple models or multiple dataset tasks simultaneously and conducting batch evaluations with a single command, which is suitable for large-scale model horizontal comparison or multi-dataset accuracy comparison analysis.
Command Descriptionο
Users can specify multiple configuration tasks via the --models and --datasets parameters. The number of subtasks is the product of the number of tasks configured by --models and --datasetsβthat is, one model configuration and one dataset configuration form a subtask. Example command:
ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt
The above command specifies 2 model tasks (vllm_api_general_chat, vllm_api_stream_chat) and 2 dataset tasks (gsm8k_gen_4_shot_cot_str, aime2024_gen_0_shot_chat_prompt), and will execute the following 4 combined accuracy test tasks:
vllm_api_general_chat model task + gsm8k_gen_4_shot_cot_str dataset task
vllm_api_general_chat model task + aime2024_gen_0_shot_chat_prompt dataset task
vllm_api_stream_chat model task + gsm8k_gen_4_shot_cot_str dataset task
vllm_api_stream_chat model task + aime2024_gen_0_shot_chat_prompt dataset task
Modify Configuration Files Corresponding to Tasksο
The actual paths of the configuration files for model tasks and dataset tasks can be queried by executing the command with the --search parameter:
ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt --search
The following configuration files to be modified will be queried:
βββββββββββββββ€ββββββββββββββββββββββββββββββββββ€ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Task Type β Task Name β Config File Path β
βββββββββββββββͺββββββββββββββββββββββββββββββββββͺββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β --models β vllm_api_general_chat β /your_workspace/benchmark_test/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py β
βββββββββββββββΌββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --models β vllm_api_stream_chat β /your_workspace/benchmark_test/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py β
βββββββββββββββΌββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --datasets β gsm8k_gen_4_shot_cot_str β /your_workspace/benchmark_test/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py β
βββββββββββββββΌββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --datasets β aime2024_gen_0_shot_chat_prompt β /your_workspace/benchmark_test/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_chat_prompt.py β
βββββββββββββββ§ββββββββββββββββββββββββββββββββββ§ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Refer to π Service-Oriented Inference Backend Configuration Parameter Description to configure the configuration files corresponding to the model tasks
vllm_api_general_chatandvllm_api_stream_chataccording to the actual situation.Refer to π Configure Open-Source Datasets to configure the configuration files corresponding to the dataset tasks
gsm8k_gen_4_shot_cot_strandaime2024_gen_0_shot_chat_promptaccording to the actual situation. Note: If the dataset is placed in the default directoryais_bench/datasets/, no configuration is generally required.
Execute the Evaluation Commandο
Execute the command:
ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt
During execution, a timestamp directory will be created under the path specified by π --work-dir (default: outputs/default/) to store execution details.
After the task is completed, an example of the on-screen log showing the results is as follows:
dataset version metric mode vllm-api-general-chat vllm-api-stream-chat
--------- --------- -------- ------ ----------------------- ----------------------
gsm8k 84f965 accuracy gen 56.70 55.97
aime2024 604a78 accuracy gen 50.00 50.00
At the same time, the final generated directory structure is as follows:
# Under output/default
20250628_172032/ # Output directory corresponding to the task creation time
βββ configs # A combined configuration file of the configuration files for model tasks, dataset tasks, and structure presentation tasks
β βββ 20250628_172032_4469.py
βββ logs # Logs including inference and accuracy evaluation phases
β βββ eval # Logs of the accuracy calculation phase
β β βββ vllm-api-general-chat
β β β βββ aime2024.out
β β β βββ gsm8k.out
β β βββ vllm-api-stream-chat
β β βββ aime2024.out
β β βββ gsm8k.out
β βββ infer # Logs of the inference phase
β βββ vllm-api-general-chat
β β βββ aime2024.out
β β βββ gsm8k.out
β βββ vllm-api-stream-chat
β βββ aime2024.out
β βββ gsm8k.out
βββ predictions # Inference result files, recording the input of each request, model output, and reference answers (for accuracy calculation)
β βββ vllm-api-general-chat
β β βββ aime2024.json
β β βββ gsm8k.json
β βββ vllm-api-stream-chat
β βββ aime2024.json
β βββ gsm8k.json
βββ results # Accuracy evaluation results generated based on predictions
β βββ vllm-api-general-chat
β β βββ aime2024.json
β β βββ gsm8k.json
β βββ vllm-api-stream-chat
β βββ aime2024.json
β βββ gsm8k.json
βββ summary # Summary view of accuracy results, including CSV, Markdown, and TXT formats
βββ summary_20250628_172032.csv
βββ summary_20250628_172032.md
βββ summary_20250628_172032.txt
Multi-Task Parallel Evaluationο
By default, multiple subtasks are executed serially. Continuous Batch is enabled by default within a single task, and multiple processes will be launched to send and process requests according to the maximum concurrency configured by the user, allowing for large concurrency settings. When the concurrency of a single task is low, multi-task parallelism can be achieved by setting the π --max-num-workers parameter. Example as follows:
ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt --max-num-workers 4
In the example above, the maximum number of concurrent tasks is set to 4, so four subtasks will be executed simultaneously. This can be viewed on the command line dashboard:
Base path of result&log : outputs/default/20251106_113926
Task Progress Table (Updated at: 2025-11-06 11:39:58)
Page: 1/1 Total 5 rows of data
Press Up/Down arrow to page, 'P' to PAUSE/RESUME screen refresh, 'Ctrl + C' to exit
+--------------------------------+-----------+----------------------------------------------------+-------------+-------------+-----------------------------------------------+---------------------------------------------------+
| Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters |
+================================+===========+====================================================+=============+=============+===============================================+===================================================+
| vllm-api-general-chat/gsm8k | 1250142 | [ ] 5/1319 [5.0 it/s] | 0:00:07 | inferencing | logs/infer/vllm-api-general-chat/gsm8k.out | {'POST': 10, 'RECV': 5, 'FINISH': 5, 'FAIL': 0} |
+--------------------------------+-----------+----------------------------------------------------+-------------+-------------+-----------------------------------------------+---------------------------------------------------+
| vllm-api-general-chat/aime2024 | 1250139 | [##### ] 5/30 [5.0 it/s] | 0:00:07 | inferencing | logs/infer/vllm-api-general-chat/aime2024.out | {'POST': 10, 'RECV': 5, 'FINISH': 5, 'FAIL': 0} |
+--------------------------------+-----------+----------------------------------------------------+-------------+-------------+-----------------------------------------------+---------------------------------------------------+
| vllm-api-stream-chat/gsm8k | 1250143 | [ ] 5/1319 [5.0 it/s] | 0:00:07 | inferencing | logs/infer/vllm-api-stream-chat/gsm8k.out | {'POST': 10, 'RECV': 5, 'FINISH': 5, 'FAIL': 0} |
+--------------------------------+-----------+----------------------------------------------------+-------------+-------------+-----------------------------------------------+---------------------------------------------------+
| vllm-api-stream-chat/aime2024 | 1250138 | [############### ] 15/30 [5.0 it/s] | 0:00:07 | inferencing | logs/infer/vllm-api-stream-chat/aime2024.out | {'POST': 20, 'RECV': 15, 'FINISH': 15, 'FAIL': 0} |
+--------------------------------+-----------+----------------------------------------------------+-------------+-------------+-----------------------------------------------+---------------------------------------------------+
The generated result is consistent with the example in Multi-Task Evaluation.
Resumption After Interruption & Retesting of Failed Casesο
If the inference task fails due to an unexpected interruption or server exception during the evaluation, the breakpoint management function can be enabled via --reuse to resume the task. It also supports automatic retesting of only failed cases without re-running all tasks. Example as follows:
Assume the user first executes the inference evaluation with the following command. If the task is interrupted due to an abnormal exit or some requests fail due to server exceptions:
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt
At this point, some inference results will be saved, and the following file content will be generated under the π --work-dir directory:
# Under output/default
20250628_151326/ # Timestamp directory created by the test task
βββ configs # A combined configuration file of the configuration files for model tasks, dataset tasks, and structure presentation tasks
β βββ 20250628_151326_29317.py
βββ logs # Logs during execution; if --debug is added to the command, no process logs will be saved to disk (all will be printed directly)
β βββ infer # Logs of the inference phase
βββ predictions # Directory for inference results, recording the input of each request, model output, and answers (for accuracy evaluation)
βββ vllm-api-general-chat
βββ tmp_demo_gsm8k # Inference output of completed requests
βββ tmp_0_2766386_1749107195.json # Cache file, named in the format: tmp_{task_process_ID}_{process_number}_{timestamp}.json
Resume the inference by specifying the task timestamp directory via the
--reuseparameter:
ais_bench --models vllm_api_general --datasets gsm8k_gen --reuse 20250628_151326
The following content will be printed in the log, indicating that the resumption task has started:
02/20 13:14:15 - AISBench - INFO - Found 10 tmp items, run infer task from the last interrupted position
After the resumption is completed, the accuracy results of all requests will be recalculated and printed, and the generated results are consistent with the example in π Quick Start.
β οΈ Note: Resumption after interruption and retesting of failed cases may change the order of requests, which may cause slight fluctuations in results.
π‘ Multi-Task Evaluation also supports resumption after interruption and retesting of failed cases for all or part of the tasks. For example, if an interruption occurs when executing the following multi-task evaluation command:
ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt
Resume all tasks after interruption in the following way:
ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt --reuse 20250628_151326
You can also resume only part of the tasks in the following ways:
# Resume only the task of vllm_api_general_chat + gsm8k_gen_4_shot_cot_str
ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_4_shot_cot_str --reuse 20250628_151326
# Resume the two tasks of vllm_api_general_chat + gsm8k_gen_4_shot_cot_str and vllm_api_general_chat + aime2024_gen_0_shot_chat_prompts
ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt --reuse 20250628_151326
# Resume the two tasks of vllm_api_general_chat + aime2024_gen_0_shot_chat_prompts and vllm_api_stream_chat + aime2024_gen_0_shot_chat_prompts
ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets aime2024_gen_0_shot_chat_prompt --reuse 20250628
Merging Sub-dataset Inferenceο
Some datasets are categorized into different sub-datasets, which will be split into multiple subtasks for inference during the inference process. Examples include π MMLU and π CEVAL. AISBench Benchmark supports merging datasets that consist of multiple small-scale datasets into a single task for unified evaluation. An example command is as follows:
ais_bench --models vllm_api_general --datasets ceval_gen --merge-ds
β οΈ Note: In merge mode, only the overall result will be generated, and the accuracy of individual sub-datasets will no longer be listed separately. Additionally, if you need to resume interrupted inference or re-run failed cases for inference results that were interrupted or failed in merge mode, you must also add
--merge-dsto the command.
Multiple Independent Repeat Inferenceο
After enabling this feature, the
dataset/number of requestswill be expanded exponentially at thedata point level, which will significantly increase inference time and memory usage. Please read π Accuracy Evaluation Scenario: Interpretation of Evaluation Metrics first, and confirm whether this feature is necessary for your current scenario before enabling it.
This scenario aims to explore model capabilities from multiple dimensions such as reliability, stability, and overall accuracy. To enable it, configure the value of the πnum_return_sequences parameter in the hyperparameter generation_kwargs within the service-side inference backend configuration parameters. Refer to the following example for the format (the value provided is for reference only):
models = [
dict(
... # Other parameters
generation_kwargs = dict(
num_return_sequences = 5, # For specific functions and constraints, refer to the document accuracy_metric.md
... # Other parameters
),
...
)
]
After the accuracy evaluation phase is completed, the results will be recorded in the log and printed in the running window. The format is as shown in the following example (data is for reference only):
| dataset | version | metric | mode | vllm-api-stream-chat |
| --------- | --------- | ------------------------- | ---- | -------------------- |
| aime2024 | 604a78 | accuracy (5 runs average) | gen | 18.00 |
| aime2024 | 604a78 | avg@5 | gen | 18.00 |
| aime2024 | 604a78 | pass@5 | gen | 53.33 |
| aime2024 | 604a78 | cons@5 | gen | 13.33 |
For specific interpretation of indicators and parameter constraints in the table above, please refer to π Accuracy Evaluation Scenario: Interpretation of Evaluation Metrics.
Other Functional Scenariosο
Re-evaluation of Inference Resultsο
The execution process of evaluation tasks in main functional scenarios includes a complete workflow of inference β evaluation β summarization:
graph LR;
A[Perform inference based on the given dataset] --> B((Inference results))
B --> C[Evaluate based on inference results]
C --> D((Accuracy data))
D --> E[Generate a summary report based on accuracy data]
E --> F((Present results))
Each link in the entire execution process is independently decoupled, and inference results can be re-evaluated repeatedly. If there is an issue with the accuracy data obtained from the first accuracy evaluation (e.g., failure to accurately extract valuable content from the response), you can modify the answer extraction method and perform re-evaluation of the inference results. The specific operations are as follows:
Assume the command used for the previous performance evaluation was:
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt
And the timestamp of the saved results is 20250628_151326. However, the accuracy data for 8 cases is incorrect, showing a score of 0:
dataset version metric mode vllm_api_general_chat
----------------------- -------- -------- ----- ----------------------
demo_gsm8k 401e4c accuracy gen 00.00
Check 20250628_151326/predictions/vllm-api-general-chat/gsm8k.json and find that the inference results actually contain the correct answers. At this point, you can modify the configuration file corresponding to the gsm8k_gen_4_shot_cot_chat_prompt dataset task. Use the --search command to query the path of the corresponding configuration file:
ais_bench --datasets gsm8k_gen_4_shot_cot_chat_prompt --search
The configuration file path will be displayed as follows:
βββββββββββββββ€ββββββββββββββββββββββββββββββββββββββββ€ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Task Type β Task Name β Config File Path β
βββββββββββββββͺββββββββββββββββββββββββββββββββββββββββͺββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β --datasets β gsm8k_gen_4_shot_cot_chat_prompt β /your_workspace/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_chat_prompt.py β
βββββββββββββββ§ββββββββββββββββββββββββββββββββββββββββ§ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Open gsm8k_gen_4_shot_cot_chat_prompt.py and replace or modify the answer extraction function:
# ......
from ais_bench.benchmark.datasets import GSM8KDataset, gsm8k_postprocess, gsm8k_dataset_postprocess, Gsm8kEvaluator
gsm8k_reader_cfg = dict(input_columns=['question'], output_column='answer')
# ......
gsm8k_eval_cfg = dict(evaluator=dict(type=Gsm8kEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=gsm8k_postprocess), # Replace or modify the implementation of the answer extraction function
dataset_postprocessor=dict(type=gsm8k_dataset_postprocess))
# ......
You can add --mode eval and --reuse {timestamp of the inference results to be reused} to the command of the first accuracy evaluation to perform repeated re-evaluation:
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode eval --reuse 20250628_151326