Accuracy Evaluation Scenarios: Analysis of Evaluation Metricsο
I. Relationship between n, k in Formulas and num_return_sequences in API Configuration Filesο
1. Calculation Logic of pass@kο
Only the
pass@kmetric is briefly described here for reference. For formulas of other metrics, please refer to Definitions and Relationships of pass@k, cons@k, avg@n.
pass@k is a core evaluation metric for code generation tasks, measuring the probability that at least one of the k candidate solutions generated by the model passes all test cases. Its calculation adopts an unbiased estimation method to avoid variance issues caused by direct sampling. The specific logic is as follows:
Sample Generation and Correctness Statistics:
Generate
ncandidate solutions for each problem (n β₯ k), wherecsolutions pass the tests (i.e., are functionally correct).Example: Generate
n=100samples, withc=20correct ones. The single-sample pass rate is \(P_{pass} = \frac{c}{n} = 0.2\).
Combinatorial Math Formula:
Calculate the probability that all fail when randomly selecting
ksamples fromn: \(\frac{\binom{n-c}{k}}{\binom{n}{k}}\)pass@kis the probability of at least one success: \(pass@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\)Optimized Calculation: To avoid factorial overflow, the code uses numerical optimization:
pass@k = 1 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
Significance of Unbiased Estimation:
In the current implementation,
kandnonly support being configured to the same value vianum_return_sequences. Thus, we only discuss the advantages of unbiased estimation here.Directly sampling
ktimes leads to high variance (especially for largek). Generatingnsamples (n >> k) and estimating via combinatorial formulas significantly improves statistical stability.
Example: If
n=5,c=3,k=2:
Probability of all failures =
C(2,2)/C(5,2) = 1/10 = 0.1
pass@2 = 1 - 0.1 = 0.9(i.e., a 90% probability of at least one success in 2 attempts).
2. Relationship between n, k, and num_return_sequencesο
All three must be positive integers
Configurable? |
Parameter |
Explanation |
Definition Location |
Constraints |
|---|---|---|---|---|
No |
|
Number of replicas (i.e., |
Derived from |
Must satisfy |
No |
|
Number of samples randomly selected for evaluation, determining |
Derived from |
Must satisfy |
Yes |
|
Number of independent repeated inferences per request |
API model configuration file, default: |
- |
3. Summaryο
pass@k Logic: Unbiased estimation based on combinatorial mathematics solves the high variance issue of direct sampling.
Parameter Relationships and Constraints in the current implementation:
nandkare not configurable separately; onlynum_return_sequencescan be specified in the API configuration file.n=k=num_return_sequencesThe
korninpass@k,cons@k, andavg@nall refer tonum_return_sequences.
Although
nandkare only used in metric calculations during evaluation, andnum_return_sequencesis used during inference, their values are derived fromnum_return_sequencesin the API configuration file. Thus, when executing the evaluation phase (--mode eval), ensure thatnum_return_sequencesin the reused inference results matches the currentnum_return_sequencesvalue.
II. Definitions and Relationships of pass@k, cons@k, avg@nο
1. Background Introductionο
In reinforcement learning evaluation for large language models and multimodal understanding, pass@k, cons@k, and avg@n are core metrics that measure model performance across multiple inferences from different dimensions. These metrics apply to tasks requiring multiple independent inferences (e.g., code generation, mathematical reasoning, reinforcement learning), providing statistically meaningful multidimensional evaluations of model performance.
2. Metric Definitions and Calculationsο
2.1 Metric Definition Tableο
Metric |
Mathematical Definition |
Calculation Logic |
Evaluation Goal |
Value Range |
|---|---|---|---|---|
pass@k |
\(1β\prod_{j=nβc+1}^{n} (1β\frac{k}{j})\) |
Probability of at least one correct result (unbiased estimation) |
Reliability of problem-solving ability |
[0, 1] |
cons@k |
\(\frac{1}{N} \sum_{i=1}^{N} I(c_i > k/2)\) |
Estimated probability of majority correctness |
Stability of output results |
[0, 1] |
avg@n |
\(\frac{1}{N} \sum_{i=1}^{N} \frac{c_i}{n}\) |
Average sample accuracy |
Overall accuracy of predictions |
[0, 1] |
Where:
N: Total number of problems (i.e., number of questions in the dataset).
n: Number of repeated inferences per problem (total generated samples), corresponding to the
nparameter in code.k: Number of samples for evaluation, used in
pass@kandcons@k, corresponding to thekparameter in code.\(c_i\): Number of correct results for problem i (i.e., samples passing tests for that problem).
\(I(β )\): Indicator function (1 if condition is met, 0 otherwise).
The product term in the formula indexes j from nβc+1 to n to ensure numerical stability.
2.2 Detailed Calculation Logicο
pass@k: Uses an unbiased estimation method to avoid variance from direct sampling. Calculated in code via
compute_pass_at_k(n, c, k), wherenis the total samples per problem,cis correct samples, andkis sampled count. The formula is equivalent to the combinatorial form \(1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\) but optimized using a product form.cons@k: Represents βconsistencyβ or βstabilityβ of model outputs, i.e., the proportion of problems where most samples are correct. For each problem, if correct samples \(c\) exceed \(k/2\), it counts as 1; otherwise 0. The average across all problems reflects majority voting accuracy.
avg@n: Represents the average sample-level accuracy across all problems. For each problem, calculate
c / n(accuracy), then average across all problems to reflect overall prediction accuracy.
2.3 Calculation Example (num_return_sequences=3, i.e., n, k = 3)ο
Problem 1: Predictions [A, A, X] β Correct count = 2
Problem 2: Predictions [B, C, B] β Correct count = 2
Problem 3: Predictions [X, X, C] β Correct count = 1
Problem 4: Predictions [X, X, X] β Correct count = 0
pass@3 = (1.0 + 1.0 + 1.0 + 0.0)/4 = 0.75 (Problems 1, 2, 3 have at least one correct; Problem 4 has none)
avg@3 = (2/3 + 2/3 + 1/3 + 0/3)/4 = (0.6667 + 0.6667 + 0.3333 + 0.0)/4 β 0.4167
cons@3 = (1 + 1 + 0 + 0)/4 = 0.5 (Problems 1 and 2 have majority correct; Problems 3 and 4 do not)
3. cons@k vs avg@nο
3.1 Analysis of Magnitude Relationshipsο
Due to their statistical definitions,
pass@kis always greater than or equal toavg@nandcons@k. It is impossible forpass@kto be smaller than the other two metrics, so we do not comparepass@kwith the others here.
The relationship between cons@k and avg@n is uncertain and depends on the modelβs prediction patterns. Below are common scenarios:
Scenario 1: cons@k > avg@nο
Context: Model predictions tend to be highly consistent but not perfectly correct (i.e., most problems have a strict majority of correct votes, but accuracy is not 100%).
Example: Let
k=3with 2 problems:Problem 1: Predictions
[A, A, B], true answerAβ Correct count 2, accuracy 2/3 β 0.667; majority correct (Aappears 2 times > 1.5), soconscontributes 1.Problem 2: Predictions
[B, B, C], true answerBβ Accuracy 2/3 β 0.667; majority correct,conscontributes 1.avg@n= (0.667 + 0.667) / 2 = 0.667cons@k= (1 + 1) / 2 = 1.0Thus,
cons@k>avg@n.
Scenario 2: cons@k < avg@nο
Context: Model predictions are scattered with no majority, but average accuracy is high (i.e., correct predictions are evenly distributed but lack consistency).
Example: Let
k=3with 2 problems:Problem 1: Predictions
[A, B, C], true answerAβ Correct count 1, accuracy 1/3 β 0.333; no strict majority (all counts β€ 1.5), soconscontributes 0.Problem 2: Predictions
[A, B, C], true answerBβ Accuracy 1/3 β 0.333; no strict majority,conscontributes 0.avg@n= (0.333 + 0.333) / 2 = 0.333cons@k= (0 + 0) / 2 = 0Thus,
cons@k<avg@n.
Scenario 3: cons@k β avg@nο
Context: Model predictions are nearly perfect or completely wrong, or the distribution makes majority accuracy close to average accuracy.
Example: Let
k=3with 2 problems:Problem 1: Predictions
[A, A, A], true answerAβ Accuracy 1.0; majority correct,conscontributes 1.Problem 2: Predictions
[B, B, B], true answerCβ Accuracy 0.0; majority wrong,conscontributes 0.avg@n= (1.0 + 0.0) / 2 = 0.5cons@k= (1 + 0) / 2 = 0.5Thus,
cons@k=avg@n.
3.2 General Trendsο
When model predictions are highly consistent (i.e., most problems have a strict majority of correct votes),
cons@kmay exceedavg@nbecausecons@konly requires a majority to be correct, whileavg@nis dragged down by incorrect predictions.When predictions are scattered (i.e., most problems have no strict majority) but average accuracy is high,
avg@nmay exceedcons@kbecauseavg@nrewards partial correctness, whilecons@krequires a majority.In ideal cases (all predictions correct or all wrong), the two metrics are similar.
In practical applications (e.g., multi-turn inference in reinforcement learning),
cons@kis typically used to evaluate stability, whileavg@nevaluates overall accuracy. They are complementary with no fixed magnitude relationship.
4. Summary and Recommendationsο
Metric Selection Principles
Prioritize
pass@kto evaluate model potential.Use
cons@kto verify stability.Use
avg@nto measure overall performance.
Common Interpretation Pitfalls
Focusing only on
pass@1: Ignores the modelβs potential with multiple attempts.Neglecting
cons@k: May lead to instability in production environments.Using
avg@nalone: Fails to distinguish consistency and fault tolerance.
Metric Applications and Decision Guidance
The following thresholds are hypothetical: High (>0.8), Medium (0.5-0.8), Low (<0.5) The following decision-related content is for reference only
Recommended Application Scenarios
Scenario Type
Core Metric
Auxiliary Metrics
Target Value
Reliability Priority (medical diagnosis, financial analysis)
cons@k
pass@k
cons@k > 0.8, pass@k > 0.9
Fault Tolerance Priority (code generation, exploration tasks)
pass@k
avg@n
pass@k > 0.8, avg@n > 0.7
Balanced Evaluation (general AI assistants)
avg@n
cons@k + pass@k
avg@n > 0.75
Decision Guidance Matrix
Metric Combination
Model Status
Improvement Direction
High pass@k, Medium avg@n, Low cons@k
High potential but poor stability
Enhance consistency (temperature penalty, voting mechanisms)
Medium pass@k, Medium avg@n, Medium cons@k
Balanced but improvable
Comprehensive optimization (data augmentation, prompt engineering)
Low pass@k, Low avg@n, High cons@k
Systematic bias
Inspect data/prompt engineering/model bias
Low pass@k, Low avg@n, Low cons@k
Nearly ineffective
Retrain or replace model architecture
By combining these three metrics, one can comprehensively evaluate the performance characteristics of large language models, providing statistically meaningful scientific bases for model optimization and application deployment.
5. Notesο
Currently, not all dataset configuration files use evaluators that support calculating these three metrics. If the
Evaluatorspecified ineval_cfgof a dataset configuration file does not return the required metrics, the results will fall back to calculating only the original accuracy-representing metrics.
III. Difference Analysis between accuracy (n runs average) and avg@nο
Dataset Example: textvqaο
The mathematical formulas below reflect the evaluation logic implemented in the TEXTEvaluator class for the textvqa dataset.
1. Metric Definitionsο
accuracy (n runs average)Average of replica-level soft accuracy (similarity)
Formula: \(\frac{1}{n} \sum_{j=1}^{n} \left( \frac{1}{D} \sum_{i=1}^{D} \text{avg_acc}_{ij} \right)\)
n: Number of replicas
D: Number of data points
\(\text{avg_acc}_{ij}\): Soft correctness value (continuous 0-1) for data point i in replica j
avg@nAverage of data point-level hard accuracy (whether similarity exceeds threshold
0.5)Formula:
\(\frac{1}{D} \sum_{i=1}^{D} (\frac{1}{n} \sum_{j=1}^{n} H_{ij})\)
\(H_{ij}\): Hard correctness flag (binary: 0 or 1)
\(H_{ij} = \begin{cases} 1 & \text{if } \text{avg_acc}_{ij} > 0.5 \\ 0 & \text{otherwise} \end{cases}\)
2. Fundamental Reason for Value Differencesο
The two metrics use different correctness measures:
Soft Correctness (
accuracy):Continuous value (0-1) reflecting partial matching between predictions and reference answers
\(\text{avg_acc}_{ij} = \frac{1}{K} \sum_{k=1}^{K} \text{match_score}\) (K is the number of reference answers)
Hard Correctness (
avg@n):Binary (0/1) determined by threshold: \(H_{ij} = \mathbb{I}(\text{avg_acc}_{ij} > 0.5)\)
3. Mathematical Difference Mechanismο
For a data point with \(\text{avg_acc}\) values across n replicas:
\([a_1,a_2,...,a_n]\)
Then:
\(accuracy=\frac{1}{n} \sum_{j=1}^{n}a_j\)
\(avg@n=\frac{1}{n}\sum_{j=1}^{n} \mathbb{I}(a_j > 0.5)\)
Conditions for Differencesο
When \(a_jβ(0,0.5)βͺ(0.5,1)\) (i.e., intermediate values other than 0/1), the two metrics will differ.
Example Calculationο
Assume predictions for a data point across n=3 replicas:
- Replica 1: avg_acc = 0.6 β correct = True
- Replica 2: avg_acc = 0.4 β correct = False
- Replica 3: avg_acc = 0.6 β correct = True
Calculations:
accuracy (n runs average):
Replica-level accuracy = (0.6 + 0.4 + 0.6)/3 β 0.533
Global value = 0.533 (averaged across multiple data points if applicable)
avg@n:
Accuracy for this data point = 2/3 β 0.666 (since 2 are correct=True)
Global value = 0.666 (averaged across multiple data points if applicable)
β The metrics differ (0.533 vs 0.666)
4. Reasonableness of Differencesο
Different Evaluation Goals:
accuracy: Measures average matching quality of predictions (fine-grained evaluation)avg@n: Measures the proportion of predictions exceeding the threshold (consistency evaluation)
Task Adaptability:
The Evaluator for the textvqa dataset (used in the example) requires soft correctness (other datasets may have reasonable variants)
Hard correctness is used for model robustness analysis
Mathematical Validity:
Both metrics are correctly calculated under their definitions
Differences stem from input data properties (soft vs. hard), not calculation errors
5. Special Cases of Equal Valuesο
When all \(\text{avg_acc}_{ij} \in \{0, 1\}\) (i.e., perfect matches or complete mismatches):
accuracy = avg@n
6. Example Summaryο
Feature |
|
|
|---|---|---|
Measure Type |
Soft correctness (continuous) |
Hard correctness (binary) |
Calculation Level |
Data point average first β then replica average |
Replica average first β then data point average |
Core Formula |
\(\frac{1}{nD} \sum \text{avg_acc}\) |
\(\frac{1}{D} \sum\frac{correct}{n}\) |
Application Scenario |
Fine-grained quality evaluation |
Consistency/robustness evaluation |
Difference Attributionο
From the above analysis of metrics for the textvqa dataset, differences between avg@n and accuracy (n runs average) arise when the calculation logic for accuracy (or other precision metrics like pass@1 in the livecodebench dataset) in the datasetβs evaluation method (i.e., the score function implemented in the Evaluator class specified in eval_cfg) differs from the judgment logic in the details field. Such differences are reasonable, statistically meaningful, and not caused by code errors.