Explanation of Performance Evaluation Results
The performance evaluation results include performance output results for individual inference requests and end-to-end performance output results. The parameter descriptions are as follows:
1. Performance Output Results for Individual Inference Requests
Explanations of key statistical indicators are as follows:
P75 / P90 / P99: Taking TPOT as an example, these represent the performance of TPOT values at the 75th, 90th, and 99th percentiles across all requests, respectively.
E2EL (End-to-End Latency): The total latency of a single request from sending the request to receiving the complete response.
TTFT (Time To First Token): The latency for the first token to be returned.
TPOT (Time Per Output Token): The average generation latency per token during the output phase (excluding the first token).
ITL (Inter-token Latency): The average interval latency between adjacent tokens (excluding the first token).
InputTokens: The number of input tokens in the request.
OutputTokens: The number of output tokens generated by the request.
OutputTokenThroughput: The throughput of output tokens (in tokens per second, Token/s).
Tokenizer: The time consumed for Tokenizer encoding.
Detokenizer: The time consumed for Detokenizer decoding.
Performance Parameters |
Stage |
Average |
Max |
Min |
Median |
P75 |
P90 |
P99 |
N |
|---|---|---|---|---|---|---|---|---|---|
E2EL |
Stage for this parameter |
Average request latency |
Maximum request latency |
Minimum request latency |
Median request latency |
75th-percentile request latency |
90th-percentile request latency |
99th-percentile request latency |
Test data volume (from input parameters) |
TTFT |
Stage for this parameter |
Average latency of first token |
Maximum latency of first token |
Minimum latency of first token |
Median latency of first token |
75th-percentile latency of first token |
90th-percentile latency of first token |
99th-percentile latency of first token |
Test data volume (from input parameters) |
TPOT |
Stage for this parameter |
Average latency of Decode stage |
Maximum latency of Decode stage |
Minimum latency of Decode stage |
Median latency of Decode stage |
75th-percentile latency of Decode stage |
90th-percentile average latency of Decode stage per request |
99th-percentile latency of Decode stage |
Test data volume (from input parameters) |
ITL |
Stage for this parameter |
Average inter-token latency |
Maximum inter-token latency |
Minimum inter-token latency |
Median inter-token latency |
75th-percentile inter-token latency |
90th-percentile inter-token latency |
99th-percentile inter-token latency |
Test data volume (from input parameters) |
InputTokens |
Stage for this parameter |
Average length of input tokens |
Maximum length of input tokens |
Minimum length of input tokens |
Median length of input tokens |
75th-percentile length of input tokens |
90th-percentile length of input tokens |
99th-percentile length of input tokens |
Test data volume (from input parameters) |
OutputTokens |
Stage for this parameter |
Average length of output tokens |
Maximum length of output tokens |
Minimum length of output tokens |
Median length of output tokens |
75th-percentile length of output tokens |
90th-percentile length of output tokens |
99th-percentile length of output tokens |
Test data volume (from input parameters) |
OutputTokenThroughput |
Stage for this parameter |
Average output throughput |
Maximum output throughput |
Minimum output throughput |
Median output throughput |
75th-percentile output throughput |
90th-percentile output throughput |
99th-percentile output throughput |
Test data volume (from input parameters) |
2. End-to-End Performance Output Results
Parameter |
Description |
|---|---|
Benchmark Duration |
Total execution time of the test task |
Total Requests |
Total number of requests |
Failed Requests |
Number of failed requests (including unresponsive requests or empty responses) |
Success Requests |
Number of successfully returned requests (including empty and non-empty responses) |
Concurrency |
Actual average concurrency |
Max Concurrency |
Configured maximum concurrency |
Request Throughput |
Request-level throughput (requests per second, Requests/s) |
Total Input Tokens |
Total number of input tokens across all requests |
Prefill Token Throughput |
Token throughput during the Prefill stage (Token/s) |
Total Output Tokens |
Total number of output tokens generated across all requests |
Input Token Throughput |
Input token throughput (Token/s) |
Output Token Throughput |
Output token throughput (Token/s) |
Total Token Throughput |
Total token throughput (input + output) (Token/s) |