# Accuracy Evaluation Scenarios: Analysis of Evaluation Metrics

## I. Relationship between `n`, `k` in Formulas and `num_return_sequences` in API Configuration Files

### 1. Calculation Logic of `pass@k`

> Only the `pass@k` metric is briefly described here for reference. For formulas of other metrics, please refer to [Definitions and Relationships of pass@k, cons@k, avg@n](#II-Definitions-and-Relationships-of-passk-consk-avgn).

`pass@k` is a core evaluation metric for code generation tasks, measuring the probability that at least one of the `k` candidate solutions generated by the model passes all test cases. Its calculation adopts an **unbiased estimation method** to avoid variance issues caused by direct sampling. The specific logic is as follows:

1. **Sample Generation and Correctness Statistics**:
   - Generate `n` candidate solutions for each problem (`n ≥ k`), where `c` solutions pass the tests (i.e., are functionally correct).
   - Example: Generate `n=100` samples, with `c=20` correct ones. The single-sample pass rate is $P_{pass} = \frac{c}{n} = 0.2$.

2. **Combinatorial Math Formula**:
   - Calculate the probability that **all fail** when randomly selecting `k` samples from `n`:
     $\frac{\binom{n-c}{k}}{\binom{n}{k}}$
   - `pass@k` is the probability of at least one success:
     $pass@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$
   - **Optimized Calculation**: To avoid factorial overflow, the code uses numerical optimization:
     ```text
     pass@k = 1 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
     ```

3. **Significance of Unbiased Estimation**:
   > In the current implementation, `k` and `n` only support being configured to the same value via `num_return_sequences`. Thus, we only discuss the advantages of unbiased estimation here.
   - Directly sampling `k` times leads to high variance (especially for large `k`). Generating `n` samples (`n >> k`) and estimating via combinatorial formulas significantly improves statistical stability.

> **Example**: If `n=5`, `c=3`, `k=2`:
> - Probability of all failures = `C(2,2)/C(5,2) = 1/10 = 0.1`
> - `pass@2 = 1 - 0.1 = 0.9` (i.e., a 90% probability of at least one success in 2 attempts).

### 2. Relationship between `n`, `k`, and `num_return_sequences`

> **All three must be positive integers**

| **Configurable?** | **Parameter**               | **Explanation**                                                                 | **Definition Location**                          | **Constraints**                                                |
|-------------------|------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------|----------------------------------------------------------------|
| **No**            | **`n`**                      | Number of replicas (i.e., `n` replicas generated per problem), or total samples | Derived from `num_return_sequences` (not configurable separately) | Must satisfy `n ≥ k`, but no need to focus as separate configuration is unsupported |
| **No**            | **`k`**                      | Number of samples randomly selected for evaluation, determining `pass@k` scale  | Derived from `num_return_sequences` (not configurable separately) | Must satisfy `n ≥ k`, but no need to focus as separate configuration is unsupported |
| **Yes**           | **`num_return_sequences`**   | **Number of independent repeated inferences per request**                       | API model configuration file, default: `1`       | -                                                              |

### 3. Summary

- **pass@k Logic**: Unbiased estimation based on combinatorial mathematics solves the high variance issue of direct sampling.
- **Parameter Relationships and Constraints** in the current implementation:
  - `n` and `k` are not configurable separately; only `num_return_sequences` can be specified in the **API configuration file**.
  - `n` = `k` = **`num_return_sequences`**
  - The `k` or `n` in `pass@k`, `cons@k`, and `avg@n` all refer to `num_return_sequences`.

> **Although `n` and `k` are only used in metric calculations during evaluation, and `num_return_sequences` is used during inference, their values are derived from `num_return_sequences` in the API configuration file. Thus, when executing the evaluation phase (`--mode eval`), ensure that `num_return_sequences` in the reused inference results matches the current `num_return_sequences` value.**

---

## II. Definitions and Relationships of pass@k, cons@k, avg@n

### 1. Background Introduction

In reinforcement learning evaluation for large language models and multimodal understanding, `pass@k`, `cons@k`, and `avg@n` are core metrics that measure model performance across multiple inferences from different dimensions. These metrics apply to tasks requiring **multiple independent inferences** (e.g., code generation, mathematical reasoning, reinforcement learning), providing statistically meaningful multidimensional evaluations of model performance.

### 2. Metric Definitions and Calculations

#### 2.1 Metric Definition Table

| **Metric** | **Mathematical Definition**                | **Calculation Logic**                          | **Evaluation Goal**               | **Value Range** |
|------------|--------------------------------------------|------------------------------------------------|-----------------------------------|-----------------|
| **pass@k** | $1−\prod_{j=n−c+1}^{n} (1−\frac{k}{j})$    | Probability of at least one correct result (unbiased estimation) | Reliability of problem-solving ability | [0, 1]          |
| **cons@k** | $\frac{1}{N} \sum_{i=1}^{N} I(c_i > k/2)$  | Estimated probability of majority correctness  | Stability of output results       | [0, 1]          |
| **avg@n**  | $\frac{1}{N} \sum_{i=1}^{N} \frac{c_i}{n}$ | Average sample accuracy                        | Overall accuracy of predictions   | [0, 1]          |

> Where:
> - N: Total number of problems (i.e., number of questions in the dataset).
> - n: Number of repeated inferences per problem (total generated samples), corresponding to the `n` parameter in code.
> - k: Number of samples for evaluation, used in `pass@k` and `cons@k`, corresponding to the `k` parameter in code.
> - $c_i$: Number of correct results for problem i (i.e., samples passing tests for that problem).
> - $I(⋅)$: Indicator function (1 if condition is met, 0 otherwise).
> - The product term in the formula indexes j from n−c+1 to n to ensure numerical stability.

#### 2.2 Detailed Calculation Logic

- **pass@k**: Uses an unbiased estimation method to avoid variance from direct sampling. Calculated in code via `compute_pass_at_k(n, c, k)`, where `n` is the total samples per problem, `c` is correct samples, and `k` is sampled count. The formula is equivalent to the combinatorial form $1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ but optimized using a product form.
- **cons@k**: Represents "consistency" or "stability" of model outputs, i.e., the proportion of problems where most samples are correct. For each problem, if correct samples $c$ exceed $k/2$, it counts as 1; otherwise 0. The average across all problems reflects majority voting accuracy.
- **avg@n**: Represents the average sample-level accuracy across all problems. For each problem, calculate `c / n` (accuracy), then average across all problems to reflect overall prediction accuracy.

#### 2.3 Calculation Example (`num_return_sequences=3`, i.e., `n`, `k` = `3`)

```plaintext
Problem 1: Predictions [A, A, X] → Correct count = 2
Problem 2: Predictions [B, C, B] → Correct count = 2
Problem 3: Predictions [X, X, C] → Correct count = 1
Problem 4: Predictions [X, X, X] → Correct count = 0

pass@3 = (1.0 + 1.0 + 1.0 + 0.0)/4 = 0.75 (Problems 1, 2, 3 have at least one correct; Problem 4 has none)
avg@3 = (2/3 + 2/3 + 1/3 + 0/3)/4 = (0.6667 + 0.6667 + 0.3333 + 0.0)/4 ≈ 0.4167
cons@3 = (1 + 1 + 0 + 0)/4 = 0.5 (Problems 1 and 2 have majority correct; Problems 3 and 4 do not)
```

### 3. `cons@k` vs `avg@n`

#### 3.1 Analysis of Magnitude Relationships

> Due to their statistical definitions, **`pass@k` is always greater than or equal to `avg@n` and `cons@k`**. It is impossible for `pass@k` to be smaller than the other two metrics, so we **do not compare `pass@k` with the others** here.

The relationship between `cons@k` and `avg@n` is uncertain and depends on the model's prediction patterns. Below are common scenarios:

##### Scenario 1: `cons@k` > `avg@n`

- **Context**: Model predictions tend to be highly consistent but not perfectly correct (i.e., most problems have a strict majority of correct votes, but accuracy is not 100%).
- **Example**: Let `k=3` with 2 problems:
  - Problem 1: Predictions `[A, A, B]`, true answer `A` → Correct count 2, accuracy 2/3 ≈ 0.667; majority correct (`A` appears 2 times > 1.5), so `cons` contributes 1.
  - Problem 2: Predictions `[B, B, C]`, true answer `B` → Accuracy 2/3 ≈ 0.667; majority correct, `cons` contributes 1.
  - `avg@n` = (0.667 + 0.667) / 2 = 0.667
  - `cons@k` = (1 + 1) / 2 = 1.0
  - Thus, `cons@k` > `avg@n`.

##### Scenario 2: `cons@k` < `avg@n`

- **Context**: Model predictions are scattered with no majority, but average accuracy is high (i.e., correct predictions are evenly distributed but lack consistency).
- **Example**: Let `k=3` with 2 problems:
  - Problem 1: Predictions `[A, B, C]`, true answer `A` → Correct count 1, accuracy 1/3 ≈ 0.333; no strict majority (all counts ≤ 1.5), so `cons` contributes 0.
  - Problem 2: Predictions `[A, B, C]`, true answer `B` → Accuracy 1/3 ≈ 0.333; no strict majority, `cons` contributes 0.
  - `avg@n` = (0.333 + 0.333) / 2 = 0.333
  - `cons@k` = (0 + 0) / 2 = 0
  - Thus, `cons@k` < `avg@n`.

##### Scenario 3: `cons@k` ≈ `avg@n`

- **Context**: Model predictions are nearly perfect or completely wrong, or the distribution makes majority accuracy close to average accuracy.
- **Example**: Let `k=3` with 2 problems:
  - Problem 1: Predictions `[A, A, A]`, true answer `A` → Accuracy 1.0; majority correct, `cons` contributes 1.
  - Problem 2: Predictions `[B, B, B]`, true answer `C` → Accuracy 0.0; majority wrong, `cons` contributes 0.
  - `avg@n` = (1.0 + 0.0) / 2 = 0.5
  - `cons@k` = (1 + 0) / 2 = 0.5
  - Thus, `cons@k` = `avg@n`.

#### 3.2 General Trends

- When model predictions are highly consistent (i.e., most problems have a strict majority of correct votes), `cons@k` may exceed `avg@n` because `cons@k` only requires a majority to be correct, while `avg@n` is dragged down by incorrect predictions.
- When predictions are scattered (i.e., most problems have no strict majority) but average accuracy is high, `avg@n` may exceed `cons@k` because `avg@n` rewards partial correctness, while `cons@k` requires a majority.
- In ideal cases (all predictions correct or all wrong), the two metrics are similar.
- In practical applications (e.g., multi-turn inference in reinforcement learning), `cons@k` is typically used to evaluate stability, while `avg@n` evaluates overall accuracy. They are complementary with no fixed magnitude relationship.

### 4. Summary and Recommendations

1. **Metric Selection Principles**
   - Prioritize `pass@k` to evaluate model potential.
   - Use `cons@k` to verify stability.
   - Use `avg@n` to measure overall performance.

2. **Common Interpretation Pitfalls**
   - Focusing only on `pass@1`: Ignores the model's potential with multiple attempts.
   - Neglecting `cons@k`: May lead to instability in production environments.
   - Using `avg@n` alone: Fails to distinguish consistency and fault tolerance.

3. **Metric Applications and Decision Guidance**
   > **The following thresholds are hypothetical**: High (>0.8), Medium (0.5-0.8), Low (<0.5)
   > **The following decision-related content is for reference only**

   - Recommended Application Scenarios

     | **Scenario Type**                          | **Core Metric** | **Auxiliary Metrics**       | **Target Value**             |
     |--------------------------------------------|-----------------|------------------------------|------------------------------|
     | **Reliability Priority** (medical diagnosis, financial analysis) | cons@k          | pass@k                       | cons@k > 0.8, pass@k > 0.9   |
     | **Fault Tolerance Priority** (code generation, exploration tasks) | pass@k          | avg@n                        | pass@k > 0.8, avg@n > 0.7    |
     | **Balanced Evaluation** (general AI assistants) | avg@n           | cons@k + pass@k              | avg@n > 0.75                 |

   - Decision Guidance Matrix

     | **Metric Combination**                | **Model Status**               | **Improvement Direction**                |
     |---------------------------------------|--------------------------------|-------------------------------------------|
     | **High pass@k, Medium avg@n, Low cons@k** | High potential but poor stability | Enhance consistency (temperature penalty, voting mechanisms) |
     | **Medium pass@k, Medium avg@n, Medium cons@k** | Balanced but improvable        | Comprehensive optimization (data augmentation, prompt engineering) |
     | **Low pass@k, Low avg@n, High cons@k** | Systematic bias                | Inspect data/prompt engineering/model bias |
     | **Low pass@k, Low avg@n, Low cons@k** | Nearly ineffective             | Retrain or replace model architecture      |

By combining these three metrics, one can comprehensively evaluate the performance characteristics of large language models, providing statistically meaningful scientific bases for model optimization and application deployment.

### 5. Notes

> Currently, **not all** dataset configuration files use evaluators that support calculating these three metrics. If the `Evaluator` specified in `eval_cfg` of a dataset configuration file does not return the required metrics, the results will fall back to calculating only the original accuracy-representing metrics.

---

## III. Difference Analysis between `accuracy (n runs average)` and `avg@n`

### Dataset Example: textvqa

> The mathematical formulas below reflect the evaluation logic implemented in the TEXTEvaluator class for the textvqa dataset.

#### 1. Metric Definitions

- **`accuracy (n runs average)`**

  Average of replica-level soft accuracy (similarity)

  **Formula**:
  $\frac{1}{n} \sum_{j=1}^{n} \left( \frac{1}{D} \sum_{i=1}^{D} \text{avg_acc}_{ij} \right)$

  - n: Number of replicas
  - D: Number of data points
  - $\text{avg_acc}_{ij}$: Soft correctness value (continuous 0-1) for data point i in replica j

- **`avg@n`**

  Average of data point-level hard accuracy (whether similarity exceeds threshold `0.5`)

  **Formula**:

  $\frac{1}{D} \sum_{i=1}^{D} (\frac{1}{n} \sum_{j=1}^{n} H_{ij})$

  - $H_{ij}$: Hard correctness flag (binary: 0 or 1)
  - $H_{ij} = \begin{cases} 1 & \text{if } \text{avg_acc}_{ij} > 0.5 \\ 0 & \text{otherwise} \end{cases}$

#### 2. Fundamental Reason for Value Differences

The two metrics use different correctness measures:

- **Soft Correctness** (`accuracy`):

  Continuous value (0-1) reflecting partial matching between predictions and reference answers

  $\text{avg_acc}_{ij} = \frac{1}{K} \sum_{k=1}^{K} \text{match_score}$ (K is the number of reference answers)

- **Hard Correctness** (`avg@n`):

  Binary (0/1) determined by threshold: $H_{ij} = \mathbb{I}(\text{avg_acc}_{ij} > 0.5)$

#### 3. Mathematical Difference Mechanism

For a data point with $\text{avg_acc}$ values across n replicas:

$[a_1,a_2,...,a_n]$

Then:

- $accuracy=\frac{1}{n} \sum_{j=1}^{n}a_j$
- $avg@n=\frac{1}{n}\sum_{j=1}^{n} \mathbb{I}(a_j > 0.5)$

##### Conditions for Differences

When $a_j∈(0,0.5)∪(0.5,1)$ (i.e., intermediate values other than 0/1), the two metrics will differ.

##### Example Calculation

Assume predictions for a data point across n=3 replicas:

```text
- Replica 1: avg_acc = 0.6 → correct = True
- Replica 2: avg_acc = 0.4 → correct = False
- Replica 3: avg_acc = 0.6 → correct = True
```

Calculations:

- accuracy (n runs average):
  - Replica-level accuracy = (0.6 + 0.4 + 0.6)/3 ≈ 0.533
  - Global value = 0.533 (averaged across multiple data points if applicable)

- avg@n:
  - Accuracy for this data point = 2/3 ≈ 0.666 (since 2 are correct=True)
  - Global value = 0.666 (averaged across multiple data points if applicable)

⇒ The metrics differ (0.533 vs 0.666)

#### 4. Reasonableness of Differences

1. **Different Evaluation Goals**:
   - `accuracy`: Measures average matching quality of predictions (fine-grained evaluation)
   - `avg@n`: Measures the proportion of predictions exceeding the threshold (consistency evaluation)
2. **Task Adaptability**:
   - The Evaluator for the textvqa dataset (used in the example) requires soft correctness (other datasets may have reasonable variants)
   - Hard correctness is used for model robustness analysis
3. **Mathematical Validity**:
   - Both metrics are correctly calculated under their definitions
   - Differences stem from input data properties (soft vs. hard), not calculation errors

#### 5. Special Cases of Equal Values

When all $\text{avg_acc}_{ij} \in \{0, 1\}$ (i.e., perfect matches or complete mismatches):

accuracy = avg@n

#### 6. Example Summary

| Feature         | `accuracy (n runs average)`         | `avg@n`                             |
|-----------------|-------------------------------------|-------------------------------------|
| **Measure Type**| Soft correctness (continuous)       | Hard correctness (binary)           |
| **Calculation Level** | Data point average first → then replica average | Replica average first → then data point average |
| **Core Formula**| $\frac{1}{nD} \sum \text{avg_acc}$ | $\frac{1}{D} \sum\frac{correct}{n}$ |
| **Application Scenario** | Fine-grained quality evaluation     | Consistency/robustness evaluation   |

### Difference Attribution

From the above analysis of metrics for the `textvqa` dataset, differences between `avg@n` and `accuracy (n runs average)` arise when the calculation logic for `accuracy` (or other precision metrics like `pass@1` in the `livecodebench` dataset) in the dataset's evaluation method (i.e., the `score` function implemented in the `Evaluator` class specified in `eval_cfg`) differs from the judgment logic in the `details` field. **Such differences are reasonable, statistically meaningful, and not caused by code errors.**