Explanation of Performance Evaluation Results

The performance evaluation results include performance output results for individual inference requests and end-to-end performance output results. The parameter descriptions are as follows:

1. Performance Output Results for Individual Inference Requests

Explanations of key statistical indicators are as follows:

  • P75 / P90 / P99: Taking TPOT as an example, these represent the performance of TPOT values at the 75th, 90th, and 99th percentiles across all requests, respectively.

  • E2EL (End-to-End Latency): The total latency of a single request from sending the request to receiving the complete response.

  • TTFT (Time To First Token): The latency for the first token to be returned.

  • TPOT (Time Per Output Token): The average generation latency per token during the output phase (excluding the first token).

  • ITL (Inter-token Latency): The average interval latency between adjacent tokens (excluding the first token).

  • InputTokens: The number of input tokens in the request.

  • OutputTokens: The number of output tokens generated by the request.

  • OutputTokenThroughput: The throughput of output tokens (in tokens per second, Token/s).

  • Tokenizer: The time consumed for Tokenizer encoding.

  • Detokenizer: The time consumed for Detokenizer decoding.

Performance Parameters

Stage

Average

Max

Min

Median

P75

P90

P99

N

E2EL

Stage for this parameter

Average request latency

Maximum request latency

Minimum request latency

Median request latency

75th-percentile request latency

90th-percentile request latency

99th-percentile request latency

Test data volume (from input parameters)

TTFT

Stage for this parameter

Average latency of first token

Maximum latency of first token

Minimum latency of first token

Median latency of first token

75th-percentile latency of first token

90th-percentile latency of first token

99th-percentile latency of first token

Test data volume (from input parameters)

TPOT

Stage for this parameter

Average latency of Decode stage

Maximum latency of Decode stage

Minimum latency of Decode stage

Median latency of Decode stage

75th-percentile latency of Decode stage

90th-percentile average latency of Decode stage per request

99th-percentile latency of Decode stage

Test data volume (from input parameters)

ITL

Stage for this parameter

Average inter-token latency

Maximum inter-token latency

Minimum inter-token latency

Median inter-token latency

75th-percentile inter-token latency

90th-percentile inter-token latency

99th-percentile inter-token latency

Test data volume (from input parameters)

InputTokens

Stage for this parameter

Average length of input tokens

Maximum length of input tokens

Minimum length of input tokens

Median length of input tokens

75th-percentile length of input tokens

90th-percentile length of input tokens

99th-percentile length of input tokens

Test data volume (from input parameters)

OutputTokens

Stage for this parameter

Average length of output tokens

Maximum length of output tokens

Minimum length of output tokens

Median length of output tokens

75th-percentile length of output tokens

90th-percentile length of output tokens

99th-percentile length of output tokens

Test data volume (from input parameters)

OutputTokenThroughput

Stage for this parameter

Average output throughput

Maximum output throughput

Minimum output throughput

Median output throughput

75th-percentile output throughput

90th-percentile output throughput

99th-percentile output throughput

Test data volume (from input parameters)

2. End-to-End Performance Output Results

Parameter

Description

Benchmark Duration

Total execution time of the test task

Total Requests

Total number of requests

Failed Requests

Number of failed requests (including unresponsive requests or empty responses)

Success Requests

Number of successfully returned requests (including empty and non-empty responses)

Concurrency

Actual average concurrency

Max Concurrency

Configured maximum concurrency

Request Throughput

Request-level throughput (requests per second, Requests/s)

Total Input Tokens

Total number of input tokens across all requests

Prefill Token Throughput

Token throughput during the Prefill stage (Token/s)

Total Output Tokens

Total number of output tokens generated across all requests

Input Token Throughput

Input token throughput (Token/s)

Output Token Throughput

Output token throughput (Token/s)

Total Token Throughput

Total token throughput (input + output) (Token/s)