Accuracy Evaluation Scenarios: Analysis of Evaluation Metrics

I. Relationship between n, k in Formulas and num_return_sequences in API Configuration Files

1. Calculation Logic of pass@k

Only the pass@k metric is briefly described here for reference. For formulas of other metrics, please refer to Definitions and Relationships of pass@k, cons@k, avg@n.

pass@k is a core evaluation metric for code generation tasks, measuring the probability that at least one of the k candidate solutions generated by the model passes all test cases. Its calculation adopts an unbiased estimation method to avoid variance issues caused by direct sampling. The specific logic is as follows:

  1. Sample Generation and Correctness Statistics:

    • Generate n candidate solutions for each problem (n β‰₯ k), where c solutions pass the tests (i.e., are functionally correct).

    • Example: Generate n=100 samples, with c=20 correct ones. The single-sample pass rate is \(P_{pass} = \frac{c}{n} = 0.2\).

  2. Combinatorial Math Formula:

    • Calculate the probability that all fail when randomly selecting k samples from n: \(\frac{\binom{n-c}{k}}{\binom{n}{k}}\)

    • pass@k is the probability of at least one success: \(pass@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\)

    • Optimized Calculation: To avoid factorial overflow, the code uses numerical optimization:

      pass@k = 1 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
      
  3. Significance of Unbiased Estimation:

    In the current implementation, k and n only support being configured to the same value via num_return_sequences. Thus, we only discuss the advantages of unbiased estimation here.

    • Directly sampling k times leads to high variance (especially for large k). Generating n samples (n >> k) and estimating via combinatorial formulas significantly improves statistical stability.

Example: If n=5, c=3, k=2:

  • Probability of all failures = C(2,2)/C(5,2) = 1/10 = 0.1

  • pass@2 = 1 - 0.1 = 0.9 (i.e., a 90% probability of at least one success in 2 attempts).

2. Relationship between n, k, and num_return_sequences

All three must be positive integers

Configurable?

Parameter

Explanation

Definition Location

Constraints

No

n

Number of replicas (i.e., n replicas generated per problem), or total samples

Derived from num_return_sequences (not configurable separately)

Must satisfy n β‰₯ k, but no need to focus as separate configuration is unsupported

No

k

Number of samples randomly selected for evaluation, determining pass@k scale

Derived from num_return_sequences (not configurable separately)

Must satisfy n β‰₯ k, but no need to focus as separate configuration is unsupported

Yes

num_return_sequences

Number of independent repeated inferences per request

API model configuration file, default: 1

-

3. Summary

  • pass@k Logic: Unbiased estimation based on combinatorial mathematics solves the high variance issue of direct sampling.

  • Parameter Relationships and Constraints in the current implementation:

    • n and k are not configurable separately; only num_return_sequences can be specified in the API configuration file.

    • n = k = num_return_sequences

    • The k or n in pass@k, cons@k, and avg@n all refer to num_return_sequences.

Although n and k are only used in metric calculations during evaluation, and num_return_sequences is used during inference, their values are derived from num_return_sequences in the API configuration file. Thus, when executing the evaluation phase (--mode eval), ensure that num_return_sequences in the reused inference results matches the current num_return_sequences value.


II. Definitions and Relationships of pass@k, cons@k, avg@n

1. Background Introduction

In reinforcement learning evaluation for large language models and multimodal understanding, pass@k, cons@k, and avg@n are core metrics that measure model performance across multiple inferences from different dimensions. These metrics apply to tasks requiring multiple independent inferences (e.g., code generation, mathematical reasoning, reinforcement learning), providing statistically meaningful multidimensional evaluations of model performance.

2. Metric Definitions and Calculations

2.1 Metric Definition Table

Metric

Mathematical Definition

Calculation Logic

Evaluation Goal

Value Range

pass@k

\(1βˆ’\prod_{j=nβˆ’c+1}^{n} (1βˆ’\frac{k}{j})\)

Probability of at least one correct result (unbiased estimation)

Reliability of problem-solving ability

[0, 1]

cons@k

\(\frac{1}{N} \sum_{i=1}^{N} I(c_i > k/2)\)

Estimated probability of majority correctness

Stability of output results

[0, 1]

avg@n

\(\frac{1}{N} \sum_{i=1}^{N} \frac{c_i}{n}\)

Average sample accuracy

Overall accuracy of predictions

[0, 1]

Where:

  • N: Total number of problems (i.e., number of questions in the dataset).

  • n: Number of repeated inferences per problem (total generated samples), corresponding to the n parameter in code.

  • k: Number of samples for evaluation, used in pass@k and cons@k, corresponding to the k parameter in code.

  • \(c_i\): Number of correct results for problem i (i.e., samples passing tests for that problem).

  • \(I(β‹…)\): Indicator function (1 if condition is met, 0 otherwise).

  • The product term in the formula indexes j from nβˆ’c+1 to n to ensure numerical stability.

2.2 Detailed Calculation Logic

  • pass@k: Uses an unbiased estimation method to avoid variance from direct sampling. Calculated in code via compute_pass_at_k(n, c, k), where n is the total samples per problem, c is correct samples, and k is sampled count. The formula is equivalent to the combinatorial form \(1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\) but optimized using a product form.

  • cons@k: Represents β€œconsistency” or β€œstability” of model outputs, i.e., the proportion of problems where most samples are correct. For each problem, if correct samples \(c\) exceed \(k/2\), it counts as 1; otherwise 0. The average across all problems reflects majority voting accuracy.

  • avg@n: Represents the average sample-level accuracy across all problems. For each problem, calculate c / n (accuracy), then average across all problems to reflect overall prediction accuracy.

2.3 Calculation Example (num_return_sequences=3, i.e., n, k = 3)

Problem 1: Predictions [A, A, X] β†’ Correct count = 2
Problem 2: Predictions [B, C, B] β†’ Correct count = 2
Problem 3: Predictions [X, X, C] β†’ Correct count = 1
Problem 4: Predictions [X, X, X] β†’ Correct count = 0

pass@3 = (1.0 + 1.0 + 1.0 + 0.0)/4 = 0.75 (Problems 1, 2, 3 have at least one correct; Problem 4 has none)
avg@3 = (2/3 + 2/3 + 1/3 + 0/3)/4 = (0.6667 + 0.6667 + 0.3333 + 0.0)/4 β‰ˆ 0.4167
cons@3 = (1 + 1 + 0 + 0)/4 = 0.5 (Problems 1 and 2 have majority correct; Problems 3 and 4 do not)

3. cons@k vs avg@n

3.1 Analysis of Magnitude Relationships

Due to their statistical definitions, pass@k is always greater than or equal to avg@n and cons@k. It is impossible for pass@k to be smaller than the other two metrics, so we do not compare pass@k with the others here.

The relationship between cons@k and avg@n is uncertain and depends on the model’s prediction patterns. Below are common scenarios:

Scenario 1: cons@k > avg@n
  • Context: Model predictions tend to be highly consistent but not perfectly correct (i.e., most problems have a strict majority of correct votes, but accuracy is not 100%).

  • Example: Let k=3 with 2 problems:

    • Problem 1: Predictions [A, A, B], true answer A β†’ Correct count 2, accuracy 2/3 β‰ˆ 0.667; majority correct (A appears 2 times > 1.5), so cons contributes 1.

    • Problem 2: Predictions [B, B, C], true answer B β†’ Accuracy 2/3 β‰ˆ 0.667; majority correct, cons contributes 1.

    • avg@n = (0.667 + 0.667) / 2 = 0.667

    • cons@k = (1 + 1) / 2 = 1.0

    • Thus, cons@k > avg@n.

Scenario 2: cons@k < avg@n
  • Context: Model predictions are scattered with no majority, but average accuracy is high (i.e., correct predictions are evenly distributed but lack consistency).

  • Example: Let k=3 with 2 problems:

    • Problem 1: Predictions [A, B, C], true answer A β†’ Correct count 1, accuracy 1/3 β‰ˆ 0.333; no strict majority (all counts ≀ 1.5), so cons contributes 0.

    • Problem 2: Predictions [A, B, C], true answer B β†’ Accuracy 1/3 β‰ˆ 0.333; no strict majority, cons contributes 0.

    • avg@n = (0.333 + 0.333) / 2 = 0.333

    • cons@k = (0 + 0) / 2 = 0

    • Thus, cons@k < avg@n.

Scenario 3: cons@k β‰ˆ avg@n
  • Context: Model predictions are nearly perfect or completely wrong, or the distribution makes majority accuracy close to average accuracy.

  • Example: Let k=3 with 2 problems:

    • Problem 1: Predictions [A, A, A], true answer A β†’ Accuracy 1.0; majority correct, cons contributes 1.

    • Problem 2: Predictions [B, B, B], true answer C β†’ Accuracy 0.0; majority wrong, cons contributes 0.

    • avg@n = (1.0 + 0.0) / 2 = 0.5

    • cons@k = (1 + 0) / 2 = 0.5

    • Thus, cons@k = avg@n.

4. Summary and Recommendations

  1. Metric Selection Principles

    • Prioritize pass@k to evaluate model potential.

    • Use cons@k to verify stability.

    • Use avg@n to measure overall performance.

  2. Common Interpretation Pitfalls

    • Focusing only on pass@1: Ignores the model’s potential with multiple attempts.

    • Neglecting cons@k: May lead to instability in production environments.

    • Using avg@n alone: Fails to distinguish consistency and fault tolerance.

  3. Metric Applications and Decision Guidance

    The following thresholds are hypothetical: High (>0.8), Medium (0.5-0.8), Low (<0.5) The following decision-related content is for reference only

    • Recommended Application Scenarios

      Scenario Type

      Core Metric

      Auxiliary Metrics

      Target Value

      Reliability Priority (medical diagnosis, financial analysis)

      cons@k

      pass@k

      cons@k > 0.8, pass@k > 0.9

      Fault Tolerance Priority (code generation, exploration tasks)

      pass@k

      avg@n

      pass@k > 0.8, avg@n > 0.7

      Balanced Evaluation (general AI assistants)

      avg@n

      cons@k + pass@k

      avg@n > 0.75

    • Decision Guidance Matrix

      Metric Combination

      Model Status

      Improvement Direction

      High pass@k, Medium avg@n, Low cons@k

      High potential but poor stability

      Enhance consistency (temperature penalty, voting mechanisms)

      Medium pass@k, Medium avg@n, Medium cons@k

      Balanced but improvable

      Comprehensive optimization (data augmentation, prompt engineering)

      Low pass@k, Low avg@n, High cons@k

      Systematic bias

      Inspect data/prompt engineering/model bias

      Low pass@k, Low avg@n, Low cons@k

      Nearly ineffective

      Retrain or replace model architecture

By combining these three metrics, one can comprehensively evaluate the performance characteristics of large language models, providing statistically meaningful scientific bases for model optimization and application deployment.

5. Notes

Currently, not all dataset configuration files use evaluators that support calculating these three metrics. If the Evaluator specified in eval_cfg of a dataset configuration file does not return the required metrics, the results will fall back to calculating only the original accuracy-representing metrics.


III. Difference Analysis between accuracy (n runs average) and avg@n

Dataset Example: textvqa

The mathematical formulas below reflect the evaluation logic implemented in the TEXTEvaluator class for the textvqa dataset.

1. Metric Definitions

  • accuracy (n runs average)

    Average of replica-level soft accuracy (similarity)

    Formula: \(\frac{1}{n} \sum_{j=1}^{n} \left( \frac{1}{D} \sum_{i=1}^{D} \text{avg_acc}_{ij} \right)\)

    • n: Number of replicas

    • D: Number of data points

    • \(\text{avg_acc}_{ij}\): Soft correctness value (continuous 0-1) for data point i in replica j

  • avg@n

    Average of data point-level hard accuracy (whether similarity exceeds threshold 0.5)

    Formula:

    \(\frac{1}{D} \sum_{i=1}^{D} (\frac{1}{n} \sum_{j=1}^{n} H_{ij})\)

    • \(H_{ij}\): Hard correctness flag (binary: 0 or 1)

    • \(H_{ij} = \begin{cases} 1 & \text{if } \text{avg_acc}_{ij} > 0.5 \\ 0 & \text{otherwise} \end{cases}\)

2. Fundamental Reason for Value Differences

The two metrics use different correctness measures:

  • Soft Correctness (accuracy):

    Continuous value (0-1) reflecting partial matching between predictions and reference answers

    \(\text{avg_acc}_{ij} = \frac{1}{K} \sum_{k=1}^{K} \text{match_score}\) (K is the number of reference answers)

  • Hard Correctness (avg@n):

    Binary (0/1) determined by threshold: \(H_{ij} = \mathbb{I}(\text{avg_acc}_{ij} > 0.5)\)

3. Mathematical Difference Mechanism

For a data point with \(\text{avg_acc}\) values across n replicas:

\([a_1,a_2,...,a_n]\)

Then:

  • \(accuracy=\frac{1}{n} \sum_{j=1}^{n}a_j\)

  • \(avg@n=\frac{1}{n}\sum_{j=1}^{n} \mathbb{I}(a_j > 0.5)\)

Conditions for Differences

When \(a_j∈(0,0.5)βˆͺ(0.5,1)\) (i.e., intermediate values other than 0/1), the two metrics will differ.

Example Calculation

Assume predictions for a data point across n=3 replicas:

- Replica 1: avg_acc = 0.6 β†’ correct = True
- Replica 2: avg_acc = 0.4 β†’ correct = False
- Replica 3: avg_acc = 0.6 β†’ correct = True

Calculations:

  • accuracy (n runs average):

    • Replica-level accuracy = (0.6 + 0.4 + 0.6)/3 β‰ˆ 0.533

    • Global value = 0.533 (averaged across multiple data points if applicable)

  • avg@n:

    • Accuracy for this data point = 2/3 β‰ˆ 0.666 (since 2 are correct=True)

    • Global value = 0.666 (averaged across multiple data points if applicable)

β‡’ The metrics differ (0.533 vs 0.666)

4. Reasonableness of Differences

  1. Different Evaluation Goals:

    • accuracy: Measures average matching quality of predictions (fine-grained evaluation)

    • avg@n: Measures the proportion of predictions exceeding the threshold (consistency evaluation)

  2. Task Adaptability:

    • The Evaluator for the textvqa dataset (used in the example) requires soft correctness (other datasets may have reasonable variants)

    • Hard correctness is used for model robustness analysis

  3. Mathematical Validity:

    • Both metrics are correctly calculated under their definitions

    • Differences stem from input data properties (soft vs. hard), not calculation errors

5. Special Cases of Equal Values

When all \(\text{avg_acc}_{ij} \in \{0, 1\}\) (i.e., perfect matches or complete mismatches):

accuracy = avg@n

6. Example Summary

Feature

accuracy (n runs average)

avg@n

Measure Type

Soft correctness (continuous)

Hard correctness (binary)

Calculation Level

Data point average first β†’ then replica average

Replica average first β†’ then data point average

Core Formula

\(\frac{1}{nD} \sum \text{avg_acc}\)

\(\frac{1}{D} \sum\frac{correct}{n}\)

Application Scenario

Fine-grained quality evaluation

Consistency/robustness evaluation

Difference Attribution

From the above analysis of metrics for the textvqa dataset, differences between avg@n and accuracy (n runs average) arise when the calculation logic for accuracy (or other precision metrics like pass@1 in the livecodebench dataset) in the dataset’s evaluation method (i.e., the score function implemented in the Evaluator class specified in eval_cfg) differs from the judgment logic in the details field. Such differences are reasonable, statistically meaningful, and not caused by code errors.