Accuracy Evaluation Scenarios: Analysis of Evaluation Metrics

I. Relationship between `n`, `k` in Formulas and `num_return_sequences` in API Configuration Files

1. Calculation Logic of `pass@k`

Only the pass@k metric is briefly described here for reference. For formulas of other metrics, please refer to Definitions and Relationships of pass@k, cons@k, avg@n.

pass@k is a core evaluation metric for code generation tasks, measuring the probability that at least one of the k candidate solutions generated by the model passes all test cases. Its calculation adopts an unbiased estimation method to avoid variance issues caused by direct sampling. The specific logic is as follows:

Sample Generation and Correctness Statistics:
- Generate n candidate solutions for each problem (n ≥ k), where c solutions pass the tests (i.e., are functionally correct).
- Example: Generate n=100 samples, with c=20 correct ones. The single-sample pass rate is \(P_{pass} = \frac{c}{n} = 0.2\).
Combinatorial Math Formula:
- Calculate the probability that all fail when randomly selecting k samples from n: \(\frac{\binom{n-c}{k}}{\binom{n}{k}}\)
- pass@k is the probability of at least one success: \(pass@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\)
- Optimized Calculation: To avoid factorial overflow, the code uses numerical optimization:
```
pass@k = 1 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
```
Significance of Unbiased Estimation:

In the current implementation, k and n only support being configured to the same value via num_return_sequences. Thus, we only discuss the advantages of unbiased estimation here.
- Directly sampling k times leads to high variance (especially for large k). Generating n samples (n >> k) and estimating via combinatorial formulas significantly improves statistical stability.

Example: If n=5, c=3, k=2:

Probability of all failures = C(2,2)/C(5,2) = 1/10 = 0.1

pass@2 = 1 - 0.1 = 0.9 (i.e., a 90% probability of at least one success in 2 attempts).

2. Relationship between `n`, `k`, and `num_return_sequences`

All three must be positive integers

Configurable?	Parameter	Explanation	Definition Location	Constraints
No	`n`	Number of replicas (i.e., `n` replicas generated per problem), or total samples	Derived from `num_return_sequences` (not configurable separately)	Must satisfy `n ≥ k`, but no need to focus as separate configuration is unsupported
No	`k`	Number of samples randomly selected for evaluation, determining `pass@k` scale	Derived from `num_return_sequences` (not configurable separately)	Must satisfy `n ≥ k`, but no need to focus as separate configuration is unsupported
Yes	`num_return_sequences`	Number of independent repeated inferences per request	API model configuration file, default: `1`	-

3. Summary

pass@k Logic: Unbiased estimation based on combinatorial mathematics solves the high variance issue of direct sampling.
Parameter Relationships and Constraints in the current implementation:
- n and k are not configurable separately; only num_return_sequences can be specified in the API configuration file.
- n = k = num_return_sequences
- The k or n in pass@k, cons@k, and avg@n all refer to num_return_sequences.

Although n and k are only used in metric calculations during evaluation, and num_return_sequences is used during inference, their values are derived from num_return_sequences in the API configuration file. Thus, when executing the evaluation phase (--mode eval), ensure that num_return_sequences in the reused inference results matches the current num_return_sequences value.

II. Definitions and Relationships of pass@k, cons@k, avg@n

1. Background Introduction

In reinforcement learning evaluation for large language models and multimodal understanding, pass@k, cons@k, and avg@n are core metrics that measure model performance across multiple inferences from different dimensions. These metrics apply to tasks requiring multiple independent inferences (e.g., code generation, mathematical reasoning, reinforcement learning), providing statistically meaningful multidimensional evaluations of model performance.

2. Metric Definitions and Calculations

2.1 Metric Definition Table

Metric	Mathematical Definition	Calculation Logic	Evaluation Goal	Value Range
pass@k	\(1−\prod_{j=n−c+1}^{n} (1−\frac{k}{j})\)	Probability of at least one correct result (unbiased estimation)	Reliability of problem-solving ability	[0, 1]
cons@k	\(\frac{1}{N} \sum_{i=1}^{N} I(c_i > k/2)\)	Estimated probability of majority correctness	Stability of output results	[0, 1]
avg@n	\(\frac{1}{N} \sum_{i=1}^{N} \frac{c_i}{n}\)	Average sample accuracy	Overall accuracy of predictions	[0, 1]

Where:

N: Total number of problems (i.e., number of questions in the dataset).

n: Number of repeated inferences per problem (total generated samples), corresponding to the n parameter in code.

k: Number of samples for evaluation, used in pass@k and cons@k, corresponding to the k parameter in code.

\(c_i\): Number of correct results for problem i (i.e., samples passing tests for that problem).

\(I(⋅)\): Indicator function (1 if condition is met, 0 otherwise).

The product term in the formula indexes j from n−c+1 to n to ensure numerical stability.

2.2 Detailed Calculation Logic

pass@k: Uses an unbiased estimation method to avoid variance from direct sampling. Calculated in code via compute_pass_at_k(n, c, k), where n is the total samples per problem, c is correct samples, and k is sampled count. The formula is equivalent to the combinatorial form \(1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\) but optimized using a product form.
cons@k: Represents “consistency” or “stability” of model outputs, i.e., the proportion of problems where most samples are correct. For each problem, if correct samples \(c\) exceed \(k/2\), it counts as 1; otherwise 0. The average across all problems reflects majority voting accuracy.
avg@n: Represents the average sample-level accuracy across all problems. For each problem, calculate c / n (accuracy), then average across all problems to reflect overall prediction accuracy.

2.3 Calculation Example (`num_return_sequences=3`, i.e., `n`, `k` = `3`)

Problem 1: Predictions [A, A, X] → Correct count = 2
Problem 2: Predictions [B, C, B] → Correct count = 2
Problem 3: Predictions [X, X, C] → Correct count = 1
Problem 4: Predictions [X, X, X] → Correct count = 0

pass@3 = (1.0 + 1.0 + 1.0 + 0.0)/4 = 0.75 (Problems 1, 2, 3 have at least one correct; Problem 4 has none)
avg@3 = (2/3 + 2/3 + 1/3 + 0/3)/4 = (0.6667 + 0.6667 + 0.3333 + 0.0)/4 ≈ 0.4167
cons@3 = (1 + 1 + 0 + 0)/4 = 0.5 (Problems 1 and 2 have majority correct; Problems 3 and 4 do not)

3. `cons@k` vs `avg@n`

3.1 Analysis of Magnitude Relationships

Due to their statistical definitions, pass@k is always greater than or equal to avg@n and cons@k. It is impossible for pass@k to be smaller than the other two metrics, so we do not compare pass@k with the others here.

The relationship between cons@k and avg@n is uncertain and depends on the model’s prediction patterns. Below are common scenarios:

Scenario 1: `cons@k` > `avg@n`

Context: Model predictions tend to be highly consistent but not perfectly correct (i.e., most problems have a strict majority of correct votes, but accuracy is not 100%).
Example: Let k=3 with 2 problems:
- Problem 1: Predictions [A, A, B], true answer A → Correct count 2, accuracy 2/3 ≈ 0.667; majority correct (A appears 2 times > 1.5), so cons contributes 1.
- Problem 2: Predictions [B, B, C], true answer B → Accuracy 2/3 ≈ 0.667; majority correct, cons contributes 1.
- avg@n = (0.667 + 0.667) / 2 = 0.667
- cons@k = (1 + 1) / 2 = 1.0
- Thus, cons@k > avg@n.

Scenario 2: `cons@k` < `avg@n`

Context: Model predictions are scattered with no majority, but average accuracy is high (i.e., correct predictions are evenly distributed but lack consistency).
Example: Let k=3 with 2 problems:
- Problem 1: Predictions [A, B, C], true answer A → Correct count 1, accuracy 1/3 ≈ 0.333; no strict majority (all counts ≤ 1.5), so cons contributes 0.
- Problem 2: Predictions [A, B, C], true answer B → Accuracy 1/3 ≈ 0.333; no strict majority, cons contributes 0.
- avg@n = (0.333 + 0.333) / 2 = 0.333
- cons@k = (0 + 0) / 2 = 0
- Thus, cons@k < avg@n.

Scenario 3: `cons@k` ≈ `avg@n`

Context: Model predictions are nearly perfect or completely wrong, or the distribution makes majority accuracy close to average accuracy.
Example: Let k=3 with 2 problems:
- Problem 1: Predictions [A, A, A], true answer A → Accuracy 1.0; majority correct, cons contributes 1.
- Problem 2: Predictions [B, B, B], true answer C → Accuracy 0.0; majority wrong, cons contributes 0.
- avg@n = (1.0 + 0.0) / 2 = 0.5
- cons@k = (1 + 0) / 2 = 0.5
- Thus, cons@k = avg@n.

3.2 General Trends

When model predictions are highly consistent (i.e., most problems have a strict majority of correct votes), cons@k may exceed avg@n because cons@k only requires a majority to be correct, while avg@n is dragged down by incorrect predictions.
When predictions are scattered (i.e., most problems have no strict majority) but average accuracy is high, avg@n may exceed cons@k because avg@n rewards partial correctness, while cons@k requires a majority.
In ideal cases (all predictions correct or all wrong), the two metrics are similar.
In practical applications (e.g., multi-turn inference in reinforcement learning), cons@k is typically used to evaluate stability, while avg@n evaluates overall accuracy. They are complementary with no fixed magnitude relationship.

4. Summary and Recommendations

Metric Selection Principles
- Prioritize pass@k to evaluate model potential.
- Use cons@k to verify stability.
- Use avg@n to measure overall performance.
Common Interpretation Pitfalls
- Focusing only on pass@1: Ignores the model’s potential with multiple attempts.
- Neglecting cons@k: May lead to instability in production environments.
- Using avg@n alone: Fails to distinguish consistency and fault tolerance.

Metric Applications and Decision Guidance

The following thresholds are hypothetical: High (>0.8), Medium (0.5-0.8), Low (<0.5) The following decision-related content is for reference only

Recommended Application Scenarios

Scenario Type	Core Metric	Auxiliary Metrics	Target Value
Reliability Priority (medical diagnosis, financial analysis)	cons@k	pass@k	cons@k > 0.8, pass@k > 0.9
Fault Tolerance Priority (code generation, exploration tasks)	pass@k	avg@n	pass@k > 0.8, avg@n > 0.7
Balanced Evaluation (general AI assistants)	avg@n	cons@k + pass@k	avg@n > 0.75

Decision Guidance Matrix

Metric Combination	Model Status	Improvement Direction
High pass@k, Medium avg@n, Low cons@k	High potential but poor stability	Enhance consistency (temperature penalty, voting mechanisms)
Medium pass@k, Medium avg@n, Medium cons@k	Balanced but improvable	Comprehensive optimization (data augmentation, prompt engineering)
Low pass@k, Low avg@n, High cons@k	Systematic bias	Inspect data/prompt engineering/model bias
Low pass@k, Low avg@n, Low cons@k	Nearly ineffective	Retrain or replace model architecture

By combining these three metrics, one can comprehensively evaluate the performance characteristics of large language models, providing statistically meaningful scientific bases for model optimization and application deployment.

5. Notes

Currently, not all dataset configuration files use evaluators that support calculating these three metrics. If the Evaluator specified in eval_cfg of a dataset configuration file does not return the required metrics, the results will fall back to calculating only the original accuracy-representing metrics.

III. Difference Analysis between `accuracy (n runs average)` and `avg@n`

Dataset Example: textvqa

The mathematical formulas below reflect the evaluation logic implemented in the TEXTEvaluator class for the textvqa dataset.

1. Metric Definitions

accuracy (n runs average)

Average of replica-level soft accuracy (similarity)

Formula: \(\frac{1}{n} \sum_{j=1}^{n} \left( \frac{1}{D} \sum_{i=1}^{D} \text{avg_acc}_{ij} \right)\)
- n: Number of replicas
- D: Number of data points
- \(\text{avg_acc}_{ij}\): Soft correctness value (continuous 0-1) for data point i in replica j
avg@n

Average of data point-level hard accuracy (whether similarity exceeds threshold 0.5)

Formula:

\(\frac{1}{D} \sum_{i=1}^{D} (\frac{1}{n} \sum_{j=1}^{n} H_{ij})\)
- \(H_{ij}\): Hard correctness flag (binary: 0 or 1)
- \(H_{ij} = \begin{cases} 1 & \text{if } \text{avg_acc}_{ij} > 0.5 \\ 0 & \text{otherwise} \end{cases}\)

2. Fundamental Reason for Value Differences

The two metrics use different correctness measures:

Soft Correctness (accuracy):

Continuous value (0-1) reflecting partial matching between predictions and reference answers

\(\text{avg_acc}_{ij} = \frac{1}{K} \sum_{k=1}^{K} \text{match_score}\) (K is the number of reference answers)
Hard Correctness (avg@n):

Binary (0/1) determined by threshold: \(H_{ij} = \mathbb{I}(\text{avg_acc}_{ij} > 0.5)\)

3. Mathematical Difference Mechanism

For a data point with \(\text{avg_acc}\) values across n replicas:

\([a_1,a_2,...,a_n]\)

Then:

\(accuracy=\frac{1}{n} \sum_{j=1}^{n}a_j\)
\(avg@n=\frac{1}{n}\sum_{j=1}^{n} \mathbb{I}(a_j > 0.5)\)

Conditions for Differences

When \(a_j∈(0,0.5)∪(0.5,1)\) (i.e., intermediate values other than 0/1), the two metrics will differ.

Example Calculation

Assume predictions for a data point across n=3 replicas:

- Replica 1: avg_acc = 0.6 → correct = True
- Replica 2: avg_acc = 0.4 → correct = False
- Replica 3: avg_acc = 0.6 → correct = True

Calculations:

accuracy (n runs average):
- Replica-level accuracy = (0.6 + 0.4 + 0.6)/3 ≈ 0.533
- Global value = 0.533 (averaged across multiple data points if applicable)
avg@n:
- Accuracy for this data point = 2/3 ≈ 0.666 (since 2 are correct=True)
- Global value = 0.666 (averaged across multiple data points if applicable)

⇒ The metrics differ (0.533 vs 0.666)

4. Reasonableness of Differences

Different Evaluation Goals:
- accuracy: Measures average matching quality of predictions (fine-grained evaluation)
- avg@n: Measures the proportion of predictions exceeding the threshold (consistency evaluation)
Task Adaptability:
- The Evaluator for the textvqa dataset (used in the example) requires soft correctness (other datasets may have reasonable variants)
- Hard correctness is used for model robustness analysis
Mathematical Validity:
- Both metrics are correctly calculated under their definitions
- Differences stem from input data properties (soft vs. hard), not calculation errors

5. Special Cases of Equal Values

When all \(\text{avg_acc}_{ij} \in \{0, 1\}\) (i.e., perfect matches or complete mismatches):

accuracy = avg@n

6. Example Summary

Feature	`accuracy (n runs average)`	`avg@n`
Measure Type	Soft correctness (continuous)	Hard correctness (binary)
Calculation Level	Data point average first → then replica average	Replica average first → then data point average
Core Formula	\(\frac{1}{nD} \sum \text{avg_acc}\)	\(\frac{1}{D} \sum\frac{correct}{n}\)
Application Scenario	Fine-grained quality evaluation	Consistency/robustness evaluation

Difference Attribution

From the above analysis of metrics for the textvqa dataset, differences between avg@n and accuracy (n runs average) arise when the calculation logic for accuracy (or other precision metrics like pass@1 in the livecodebench dataset) in the dataset’s evaluation method (i.e., the score function implemented in the Evaluator class specified in eval_cfg) differs from the judgment logic in the details field. Such differences are reasonable, statistically meaningful, and not caused by code errors.