Explanation of Running Modes
Accuracy Evaluation Scenarios
All Mode
In All Mode, the evaluation tool executes the complete workflow of Inference → Evaluation → Summary:
graph LR;
A[Execute inference based on the given dataset] --> B((Inference Results))
B --> C[Perform evaluation based on inference results]
C --> D((Accuracy Data))
D --> E[Generate a summary report based on accuracy data]
E --> F((Display Results))
Command Example:
ais_bench --models vllm_api_general --datasets gsm8k_gen --mode all
Generated Directory Structure:
outputs/default/
├── 20250220_120000/ # Each experiment corresponds to a timestamp folder
├── 20250220_183030/
│ ├── configs/ # Dumped configuration files (may include configs for multiple experiments)
│ ├── logs/
│ │ ├── eval/ # Logs of the evaluation phase
│ │ └── infer/ # Logs of the inference phase
│ ├── predictions/ # Inference result data
│ ├── results/ # Evaluation results for each task
│ └── summary/ # Summary report of a single experiment
└── ...
Infer Mode
In Infer Mode, only the inference phase is executed, and the output results are saved:
graph LR;
A[Execute inference based on the given dataset] --> B((Inference Results))
Command Example:
ais_bench --models vllm_api_general --datasets gsm8k_gen --mode infer
Generated Directory Structure:
outputs/default/
├── 20250220_120000/
├── 20250220_183030/
│ ├── configs/
│ ├── logs/
│ │ └── infer/
│ └── predictions/ # Contains only inference results
└── ...
Eval Mode
In Eval Mode, evaluation and report generation are performed based on existing inference results. The --reuse parameter is required:
graph LR;
B((Inference Results)) --> C[Perform evaluation based on inference results]
C --> D((Accuracy Data))
D --> E[Generate a summary report based on accuracy data]
E --> F((Display Results))
Command Example:
ais_bench --models vllm_api_general --datasets gsm8k_gen --mode eval --reuse
Generated Directory Structure:
outputs/default/
├── 20250220_120000/
├── 20250220_183030/
│ ├── configs/
│ ├── logs/
│ │ ├── eval/ # Newly added eval logs
│ │ └── infer/
│ ├── predictions/
│ └── results/ # Newly added evaluation result files
└── ...
Viz Mode
In Viz Mode, only a summary report is generated and displayed based on existing accuracy data. The --reuse parameter is also required:
graph LR;
D((Accuracy Data)) --> E[Generate a summary report based on accuracy data]
E --> F((Display Results))
Command Example:
ais_bench --models vllm_api_general --datasets gsm8k_gen --mode viz --reuse
Generated Directory Structure:
outputs/default/
├── 20250220_120000/
├── 20250220_183030/
│ ├── configs/
│ ├── logs/
│ │ ├── eval/
│ │ └── infer/
│ ├── predictions/
│ ├── results/
│ └── summary/ # Newly added summary report (output of viz mode)
└── ...
Performance Evaluation Scenarios
Perf Mode
In Perf Mode, the evaluation tool executes the complete workflow of Performance Sampling → Calculation → Summary and generates a visualization report:
graph LR;
A[Execute inference based on the given dataset] --> B((Performance Sampling Data))
B --> C[Calculate metrics based on sampling data]
C --> D((Performance Data))
D --> E[Generate a summary report based on performance data]
E --> F((Display Results))
⚠️ Note: In the performance evaluation scenario,
--modelsonly supports streaming service-oriented inference APIs (refer to Service-Oriented Inference Backend), such asvllm_api_general_stream.
Command Example:
ais_bench --models vllm_api_general_stream --datasets synthetic_gen --mode perf
Example of Generated Directory Structure:
outputs/default/
├── 20200220_120000/
├── 20230220_183030/
│ ├── configs/
│ ├── logs/
│ │ └── performance/ # Performance evaluation logs
│ └── performance/ # Performance evaluation results
│ └── vllm-api-general-stream/
│ ├── syntheticdataset.csv # Performance data of single inference requests
│ ├── syntheticdataset.json # End-to-end performance data
│ ├── syntheticdataset_details.h5 # Full sampling ITL (Inter-Token Latency) data
│ ├── syntheticdataset_details.json # Detailed full sampling data
│ └── syntheticdataset_plot.html # Real-time concurrency and request visualization page
└── ...
Performance sampling is based on
syntheticdataset.csvandsyntheticdataset.json.
Perf_Viz Mode
In Perf_Viz Mode, only a summary report is generated and displayed based on existing performance data. The --reuse parameter is required:
graph LR;
D((Performance Data)) --> E[Generate a summary report based on performance data]
E --> F((Display Results))
Command Example:
ais_bench --models vllm_api_general_stream --datasets synthetic_gen --mode perf_viz --reuse
Explanation:
perf_vizwill readsyntheticdataset.csvandsyntheticdataset.jsonfrom the most recent experiment folder, and generate visualization results based on the introduction of performance metrics.
For reference on performance evaluation results: Explanation of Performance Evaluation Results