Service-Oriented Steady-State Performance Testingο
Basic Introductionο
Concept Explanationο
Steady-state performance testing (hereinafter referred to as βsteady-state testingβ) is designed to simulate real-world business scenarios of inference services and test the performance of inference services when they are in a stable state.
A steady state refers to a state where an inference service can simultaneously process requests and remain stable when the number of concurrent requests reaches its maximum.
Differences from Conventional Performance Testingο
The only difference between steady-state testing and π conventional service-oriented performance testing lies in the method of calculating performance data:
graph LR;
A[Execute inference based on the given dataset] --> B((performance logging data))
B --> C[Calculate metrics based on logging data]:::redNode
C --> D((Performance data)):::redNode
D --> E[Generate summary report based on performance data]
E --> F((Present results))
classDef redNode fill:#ff4b4b,stroke:#ff4b4b,stroke-width:1px;
Explanation of Performance Data Calculation in the Steady Stageο
The steady-stage performance data calculated by AISBench essentially comes from all requests processed during this stage.
When the number of concurrent requests being processed by the inference service reaches the maximum concurrency level, the system can be considered to be in the steady stage. The ideal trend of the number of concurrent requests processed by the inference service over test time is shown in the figure below:

Traffic Ramp-Up Stage: The number of clients connected to the inference service increases continuously, and the number of concurrent requests processed by the service also increases accordingly.
Actual Steady Stage: The number of concurrent requests processed by the inference service reaches the maximum concurrency level.
Calculated Steady Stage: The time period from the moment (
t2) when the number of concurrent requests processed by the inference service first reaches the maximum concurrency level (i.e., the sending time of the first request after reaching max concurrency) to the moment (t4) when the number of concurrent requests last remains at the maximum concurrency level. The tool treats all requests whose start time falls within this period as βsteady-stage requestsβ.
The βBenchmark Durationβ in performance metrics refers to the time delay of this stage.
Note: Since Benchmark Duration is used to calculate throughput, the resulting throughput may contain errors. These errors arise from differences in computing resource usage between: 1) requests beforet0tot2(not included in the steady stage) and 2) requests betweent4tot5(incorrectly included in the steady stage). Additionally, this discrepancy may cause the βConcurrencyβ metric to exceed βMax Concurrencyβ, but the βConcurrencyβ value will still be displayed as βMax Concurrencyβ. The calculated throughput data is only sufficiently reliable if the maximum single-request latency (E2EL, End-to-End Latency) during the entire test is less than 1/3 of the Benchmark Duration.Request Sending Stage: During this stage, the tool continuously sends requests to the inference service. After this stage, the tool waits for all requests to return.
Traffic Ramp-Down Stage: The number of concurrent requests processed by the inference service decreases continuously until all requests are finally returned.
Quick Start for Steady-State Testingο
Command Explanationο
The commands for steady-state testing are the same as those explained in π Quick Start for Service-Oriented Performance Evaluation / Command Meaning. On this basis, you need to specify --summarizer stable_stage to calculate performance data in the steady-state manner. Take the following AISBench command as an example:
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer stable_stage --mode perf
Where:
--modelsspecifies the model task, i.e., thevllm_api_stream_chatmodel task.--datasetsspecifies the dataset task, i.e., thedemo_gsm8k_gen_4_shot_cot_chat_promptdataset task.--summarizerspecifies the result presentation task, i.e., thestable_stageresult presentation task.
Preparations Before Running the Commandο
--models: To use thevllm_api_stream_chatmodel task, you need to prepare an inference service that supports thev1/chat/completionssub-service. You can refer to π Start an OpenAI-Compatible Server with VLLM to launch the inference service.--datasets: To use thedemo_gsm8k_gen_4_shot_cot_chat_promptdataset task, you need to prepare the GSM8K dataset. You can download it from π the GSM8K dataset zip package provided by OpenCompass. Extract thegsm8k/folder and deploy it to theais_bench/datasetsfolder in the root directory of the AISBench tool.
Modifying Configuration Files for Corresponding Tasksο
Each model task, dataset task, and result presentation task corresponds to a configuration file. These files need to be modified before running the command. You can query the paths of these configuration files by adding --search to the original AISBench command. For example:
# Whether to add "--mode perf" and "--pressure" to the search command does not affect the search results
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer stable_stage --mode perf --pressure --search
β οΈ Note: Executing the command with
--searchwill print the absolute paths of the configuration files corresponding to the tasks.
Executing the query command will yield results similar to the following:
06/28 11:52:25 - AISBench - INFO - Searching configs...
ββββββββββββββββ€ββββββββββββββββββββββββββββββββββββββββ€ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Task Type β Task Name β Config File Path β
ββββββββββββββββͺββββββββββββββββββββββββββββββββββββββββͺββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β --models β vllm_api_stream_chat β /your_workspace/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py β
ββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --datasets β demo_gsm8k_gen_4_shot_cot_chat_prompt β /your_workspace/ais_bench/benchmark/configs/datasets/demo/demo_gsm8k_gen_4_shot_cot_chat_prompt.py β
ββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --summarizer β stable_stage β /your_workspace/ais_bench/benchmark/configs/summarizers/perf/stable_stage.py β
ββββββββββββββββ§ββββββββββββββββββββββββββββββββββββββββ§ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The dataset task configuration file
demo_gsm8k_gen_4_shot_cot_chat_prompt.pyin this quick start does not require additional modifications. For an introduction to the content of dataset task configuration files, refer to π Open-Source Datasets.
The model configuration file vllm_api_stream_chat.py contains configuration settings related to model operation and needs to be modified according to actual conditions. The content to be modified in this quick start is marked with comments:
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="", # Specify the absolute path to the model's serialized vocabulary file (usually the path to the model weight folder)
model="DeepSeek-R1", # Specify the name of the model loaded on the server; configure it according to the actual model name pulled by the VLLM inference service (leave empty to auto-detect)
request_rate = 0, # Invalid in stress testing scenarios
retry = 2,
host_ip = "localhost", # Specify the IP address of the inference service
host_port = 8080, # Specify the port of the inference service
max_out_len = 512, # Maximum number of tokens output by the inference service
batch_size=3, # Maximum concurrency for request sending
generation_kwargs = dict(
temperature = 0.5,
top_k = 10,
top_p = 0.95,
seed = None,
repetition_penalty = 1.03,
ignore_eos = True, # The inference service ignores EOS (end-of-sequence token), so the output length will always reach max_out_len
)
)
]
Executing the Commandο
After modifying the configuration files, execute the following command to start the service-oriented performance test (β οΈ It is recommended to add --debug for the first execution to print detailed logs to the screen, which helps troubleshoot errors during request inference):
# Add --debug to the command line
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer stable_stage --mode perf --debug
Viewing Performance Resultsο
An example of the on-screen performance result output is as follows:
06/05 20:22:24 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/gsm8kdataset:
ββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββ‘
β E2EL β stable β 2048.2945 ms β 1729.7498 ms β 3450.96 ms β 2491.8789 ms β 2750.85 ms β 3184.9186 ms β 3424.4354 ms β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β TTFT β stable β 50.332 ms β 50.6244 ms β 52.0585 ms β 50.3237 ms β 50.5872 ms β 50.7566 ms β 50.0551 ms β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β TPOT β stable β 10.6965 ms β 10.061 ms β 10.8805 ms β 10.7495 ms β 10.7818 ms β 10.808 ms β 10.8582 ms β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β ITL β stable β 10.6965 ms β 7.3583 ms β 13.7707 ms β 10.7513 ms β 10.8009 ms β 10.8358 ms β 10.9322 ms β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β InputTokens β stable β 1512.5 β 1481.0 β 1566.0 β 1511.5 β 1520.25 β 1536.6 β 1563.06 β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β OutputTokens β stable β 287.375 β 200.0 β 407.0 β 280.0 β 322.75 β 374.8 β 403.78 β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β OutputTokenThroughput β stable β 115.9216 token/s β 107.6555 token/s β 116.5352 token/s β 117.6448 token/s β 118.2426 token/s β 118.3765 token/s β 118.6388 token/s β 8 β
ββββββββββββββββββββββββββββ§ββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββ
ββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββββββ
β Common Metric β Stage β Value β
ββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββββ‘
β Benchmark Duration β stable β 19897.8505 ms β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Total Requests β stable β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Failed Requests β stable β 0 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Success Requests β stable β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Concurrency β stable β 0.9972 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Max Concurrency β stable β 1 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Request Throughput β stable β 0.4021 req/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Total Input Tokens β stable β 12100 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Prefill Token Throughput β stable β 17014.3123 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Total generated tokens β stable β 2299 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Input Token Throughput β stable β 608.7438 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Output Token Throughput β stable β 115.7835 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Total Token Throughput β stable β 723.5273 token/s β
ββββββββββββββββββββββββββββ§ββββββββββ§βββββββββββββββββββββ
06/05 20:22:24 - AISBench - INFO - Performance Result files locate in outputs/default/20250605_202220/performances/vllm-api-stream-chat.
π‘ For the meaning of specific performance parameters, please refer to π Explanation of Performance Test Results
Viewing Performance Detailsο
After executing the AISBench command, more details about the task execution are finally saved to the default output path. This output path is indicated in the on-screen log during runtime, for example:
06/28 15:13:26 - AISBench - INFO - Current exp folder: outputs/default/20250628_151326
This log indicates that the task execution details are saved in outputs/default/20250628_151326 under the directory where the command is executed.
After the command execution is completed, the task execution details in outputs/default/20250628_151326 are as follows:
20250628_151326 # Unique directory generated based on timestamp for each experiment
βββ configs # Automatically stored dumped configuration files
βββ logs # Runtime logs; if --debug is added to the command, no runtime logs will be saved to disk (all logs are printed directly to the screen)
β βββ performance/ # Log files of the inference stage
βββ performance # Performance test results
β βββ vllm-api-stream-chat/ # Name of the "service-oriented model configuration", corresponding to the `abbr` parameter of `models` in the model task configuration file
β βββ gsm8kdataset.csv # Per-request performance output (CSV), consistent with the "Performance Parameters" table in the on-screen performance results
β βββ gsm8kdataset.json # End-to-end performance output (JSON), consistent with the "Common Metric" table in the on-screen performance results
β βββ gsm8kdataset_details.h5 # ITL data from complete logging
β βββ gsm8kdataset_details.json # Detailed complete logging information
β βββ gsm8kdataset_plot.html # Request concurrency visualization report (HTML)
π‘ It is recommended to open the gsm8kdataset_plot.html (request concurrency visualization report) using browsers such as Chrome or Edge. This report allows you to view the latency of each request and the number of concurrent service processing requests perceived by the client at each moment:
For instructions on how to view the charts in this HTML file, please refer to π Instructions for Using Performance Test Visualization Concurrency Charts
Other Functional Scenariosο
Recalculating Performance Resultsο
Refer to π Recalculation of Performance Results
Configuration Differencesο
Modify the configuration file stable_stage.py corresponding to the stable_stage result presentation task specified by --summarizer.
In the recalculation command, also specify --summarizer stable_stage.
π‘ Results obtained from conventional performance tests can also be directly recalculated by specifying
--summarizer stable_stage.
Enabling Steady-State Testing with Stress Testingο
If the dataset used for your performance test is too small to bring the service into a steady state, you can use the stress testing capability of the AISBench tool to make the tested service reach a steady state.
Request Sending Method for Performance Stress Testingο
Stress testing aims to simulate multiple clients sending requests continuously. It increases the test pressure by gradually increasing the number of clients. When the number of clients finally reaches the maximum concurrency level, the inference service officially enters a steady state (as shown in the figure below). The entire stress testing process lasts for a fixed period, during which the dataset content is continuously polled to construct requests, ensuring that the steady state is maintained for a certain duration.

Quick Start for Stress Testingο
The process of stress testing is basically the same as that of Quick Start for Steady-State Testing, with the main differences in the following two aspects:
Stress Testing Parameter Descriptionο
Specify the duration of stress testing through the command-line parameter --pressure-time. The stress testing duration cannot exceed 86400 seconds (24 hours).
Specify the frequency of adding new threads (clients) per process by configuring the request_rate parameter in the model configuration file. The larger the value of this parameter, the greater the deviation in the actual frequency of adding new threads (clients) (the deviation is related to the single-core processing capability of the CPU).
Specify the number of processes used in stress testing by modifying the WORKERS_NUM parameter in the global constants configuration file to improve the concurrency capability of stress testing.
Adding Stress Testing Commandο
Add --pressure to the command line:
# Add --debug to the command line
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer stable_stage --mode perf --pressure --pressure-time 30