# Accuracy Evaluation Scenarios: Analysis of Evaluation Metrics ## I. Relationship between `n`, `k` in Formulas and `num_return_sequences` in API Configuration Files ### 1. Calculation Logic of `pass@k` > Only the `pass@k` metric is briefly described here for reference. For formulas of other metrics, please refer to [Definitions and Relationships of pass@k, cons@k, avg@n](#II-Definitions-and-Relationships-of-passk-consk-avgn). `pass@k` is a core evaluation metric for code generation tasks, measuring the probability that at least one of the `k` candidate solutions generated by the model passes all test cases. Its calculation adopts an **unbiased estimation method** to avoid variance issues caused by direct sampling. The specific logic is as follows: 1. **Sample Generation and Correctness Statistics**: - Generate `n` candidate solutions for each problem (`n ≥ k`), where `c` solutions pass the tests (i.e., are functionally correct). - Example: Generate `n=100` samples, with `c=20` correct ones. The single-sample pass rate is $P_{pass} = \frac{c}{n} = 0.2$. 2. **Combinatorial Math Formula**: - Calculate the probability that **all fail** when randomly selecting `k` samples from `n`: $\frac{\binom{n-c}{k}}{\binom{n}{k}}$ - `pass@k` is the probability of at least one success: $pass@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ - **Optimized Calculation**: To avoid factorial overflow, the code uses numerical optimization: ```text pass@k = 1 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1)) ``` 3. **Significance of Unbiased Estimation**: > In the current implementation, `k` and `n` only support being configured to the same value via `num_return_sequences`. Thus, we only discuss the advantages of unbiased estimation here. - Directly sampling `k` times leads to high variance (especially for large `k`). Generating `n` samples (`n >> k`) and estimating via combinatorial formulas significantly improves statistical stability. > **Example**: If `n=5`, `c=3`, `k=2`: > - Probability of all failures = `C(2,2)/C(5,2) = 1/10 = 0.1` > - `pass@2 = 1 - 0.1 = 0.9` (i.e., a 90% probability of at least one success in 2 attempts). ### 2. Relationship between `n`, `k`, and `num_return_sequences` > **All three must be positive integers** | **Configurable?** | **Parameter** | **Explanation** | **Definition Location** | **Constraints** | |-------------------|------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------|----------------------------------------------------------------| | **No** | **`n`** | Number of replicas (i.e., `n` replicas generated per problem), or total samples | Derived from `num_return_sequences` (not configurable separately) | Must satisfy `n ≥ k`, but no need to focus as separate configuration is unsupported | | **No** | **`k`** | Number of samples randomly selected for evaluation, determining `pass@k` scale | Derived from `num_return_sequences` (not configurable separately) | Must satisfy `n ≥ k`, but no need to focus as separate configuration is unsupported | | **Yes** | **`num_return_sequences`** | **Number of independent repeated inferences per request** | API model configuration file, default: `1` | - | ### 3. Summary - **pass@k Logic**: Unbiased estimation based on combinatorial mathematics solves the high variance issue of direct sampling. - **Parameter Relationships and Constraints** in the current implementation: - `n` and `k` are not configurable separately; only `num_return_sequences` can be specified in the **API configuration file**. - `n` = `k` = **`num_return_sequences`** - The `k` or `n` in `pass@k`, `cons@k`, and `avg@n` all refer to `num_return_sequences`. > **Although `n` and `k` are only used in metric calculations during evaluation, and `num_return_sequences` is used during inference, their values are derived from `num_return_sequences` in the API configuration file. Thus, when executing the evaluation phase (`--mode eval`), ensure that `num_return_sequences` in the reused inference results matches the current `num_return_sequences` value.** --- ## II. Definitions and Relationships of pass@k, cons@k, avg@n ### 1. Background Introduction In reinforcement learning evaluation for large language models and multimodal understanding, `pass@k`, `cons@k`, and `avg@n` are core metrics that measure model performance across multiple inferences from different dimensions. These metrics apply to tasks requiring **multiple independent inferences** (e.g., code generation, mathematical reasoning, reinforcement learning), providing statistically meaningful multidimensional evaluations of model performance. ### 2. Metric Definitions and Calculations #### 2.1 Metric Definition Table | **Metric** | **Mathematical Definition** | **Calculation Logic** | **Evaluation Goal** | **Value Range** | |------------|--------------------------------------------|------------------------------------------------|-----------------------------------|-----------------| | **pass@k** | $1−\prod_{j=n−c+1}^{n} (1−\frac{k}{j})$ | Probability of at least one correct result (unbiased estimation) | Reliability of problem-solving ability | [0, 1] | | **cons@k** | $\frac{1}{N} \sum_{i=1}^{N} I(c_i > k/2)$ | Estimated probability of majority correctness | Stability of output results | [0, 1] | | **avg@n** | $\frac{1}{N} \sum_{i=1}^{N} \frac{c_i}{n}$ | Average sample accuracy | Overall accuracy of predictions | [0, 1] | > Where: > - N: Total number of problems (i.e., number of questions in the dataset). > - n: Number of repeated inferences per problem (total generated samples), corresponding to the `n` parameter in code. > - k: Number of samples for evaluation, used in `pass@k` and `cons@k`, corresponding to the `k` parameter in code. > - $c_i$: Number of correct results for problem i (i.e., samples passing tests for that problem). > - $I(⋅)$: Indicator function (1 if condition is met, 0 otherwise). > - The product term in the formula indexes j from n−c+1 to n to ensure numerical stability. #### 2.2 Detailed Calculation Logic - **pass@k**: Uses an unbiased estimation method to avoid variance from direct sampling. Calculated in code via `compute_pass_at_k(n, c, k)`, where `n` is the total samples per problem, `c` is correct samples, and `k` is sampled count. The formula is equivalent to the combinatorial form $1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ but optimized using a product form. - **cons@k**: Represents "consistency" or "stability" of model outputs, i.e., the proportion of problems where most samples are correct. For each problem, if correct samples $c$ exceed $k/2$, it counts as 1; otherwise 0. The average across all problems reflects majority voting accuracy. - **avg@n**: Represents the average sample-level accuracy across all problems. For each problem, calculate `c / n` (accuracy), then average across all problems to reflect overall prediction accuracy. #### 2.3 Calculation Example (`num_return_sequences=3`, i.e., `n`, `k` = `3`) ```plaintext Problem 1: Predictions [A, A, X] → Correct count = 2 Problem 2: Predictions [B, C, B] → Correct count = 2 Problem 3: Predictions [X, X, C] → Correct count = 1 Problem 4: Predictions [X, X, X] → Correct count = 0 pass@3 = (1.0 + 1.0 + 1.0 + 0.0)/4 = 0.75 (Problems 1, 2, 3 have at least one correct; Problem 4 has none) avg@3 = (2/3 + 2/3 + 1/3 + 0/3)/4 = (0.6667 + 0.6667 + 0.3333 + 0.0)/4 ≈ 0.4167 cons@3 = (1 + 1 + 0 + 0)/4 = 0.5 (Problems 1 and 2 have majority correct; Problems 3 and 4 do not) ``` ### 3. `cons@k` vs `avg@n` #### 3.1 Analysis of Magnitude Relationships > Due to their statistical definitions, **`pass@k` is always greater than or equal to `avg@n` and `cons@k`**. It is impossible for `pass@k` to be smaller than the other two metrics, so we **do not compare `pass@k` with the others** here. The relationship between `cons@k` and `avg@n` is uncertain and depends on the model's prediction patterns. Below are common scenarios: ##### Scenario 1: `cons@k` > `avg@n` - **Context**: Model predictions tend to be highly consistent but not perfectly correct (i.e., most problems have a strict majority of correct votes, but accuracy is not 100%). - **Example**: Let `k=3` with 2 problems: - Problem 1: Predictions `[A, A, B]`, true answer `A` → Correct count 2, accuracy 2/3 ≈ 0.667; majority correct (`A` appears 2 times > 1.5), so `cons` contributes 1. - Problem 2: Predictions `[B, B, C]`, true answer `B` → Accuracy 2/3 ≈ 0.667; majority correct, `cons` contributes 1. - `avg@n` = (0.667 + 0.667) / 2 = 0.667 - `cons@k` = (1 + 1) / 2 = 1.0 - Thus, `cons@k` > `avg@n`. ##### Scenario 2: `cons@k` < `avg@n` - **Context**: Model predictions are scattered with no majority, but average accuracy is high (i.e., correct predictions are evenly distributed but lack consistency). - **Example**: Let `k=3` with 2 problems: - Problem 1: Predictions `[A, B, C]`, true answer `A` → Correct count 1, accuracy 1/3 ≈ 0.333; no strict majority (all counts ≤ 1.5), so `cons` contributes 0. - Problem 2: Predictions `[A, B, C]`, true answer `B` → Accuracy 1/3 ≈ 0.333; no strict majority, `cons` contributes 0. - `avg@n` = (0.333 + 0.333) / 2 = 0.333 - `cons@k` = (0 + 0) / 2 = 0 - Thus, `cons@k` < `avg@n`. ##### Scenario 3: `cons@k` ≈ `avg@n` - **Context**: Model predictions are nearly perfect or completely wrong, or the distribution makes majority accuracy close to average accuracy. - **Example**: Let `k=3` with 2 problems: - Problem 1: Predictions `[A, A, A]`, true answer `A` → Accuracy 1.0; majority correct, `cons` contributes 1. - Problem 2: Predictions `[B, B, B]`, true answer `C` → Accuracy 0.0; majority wrong, `cons` contributes 0. - `avg@n` = (1.0 + 0.0) / 2 = 0.5 - `cons@k` = (1 + 0) / 2 = 0.5 - Thus, `cons@k` = `avg@n`. #### 3.2 General Trends - When model predictions are highly consistent (i.e., most problems have a strict majority of correct votes), `cons@k` may exceed `avg@n` because `cons@k` only requires a majority to be correct, while `avg@n` is dragged down by incorrect predictions. - When predictions are scattered (i.e., most problems have no strict majority) but average accuracy is high, `avg@n` may exceed `cons@k` because `avg@n` rewards partial correctness, while `cons@k` requires a majority. - In ideal cases (all predictions correct or all wrong), the two metrics are similar. - In practical applications (e.g., multi-turn inference in reinforcement learning), `cons@k` is typically used to evaluate stability, while `avg@n` evaluates overall accuracy. They are complementary with no fixed magnitude relationship. ### 4. Summary and Recommendations 1. **Metric Selection Principles** - Prioritize `pass@k` to evaluate model potential. - Use `cons@k` to verify stability. - Use `avg@n` to measure overall performance. 2. **Common Interpretation Pitfalls** - Focusing only on `pass@1`: Ignores the model's potential with multiple attempts. - Neglecting `cons@k`: May lead to instability in production environments. - Using `avg@n` alone: Fails to distinguish consistency and fault tolerance. 3. **Metric Applications and Decision Guidance** > **The following thresholds are hypothetical**: High (>0.8), Medium (0.5-0.8), Low (<0.5) > **The following decision-related content is for reference only** - Recommended Application Scenarios | **Scenario Type** | **Core Metric** | **Auxiliary Metrics** | **Target Value** | |--------------------------------------------|-----------------|------------------------------|------------------------------| | **Reliability Priority** (medical diagnosis, financial analysis) | cons@k | pass@k | cons@k > 0.8, pass@k > 0.9 | | **Fault Tolerance Priority** (code generation, exploration tasks) | pass@k | avg@n | pass@k > 0.8, avg@n > 0.7 | | **Balanced Evaluation** (general AI assistants) | avg@n | cons@k + pass@k | avg@n > 0.75 | - Decision Guidance Matrix | **Metric Combination** | **Model Status** | **Improvement Direction** | |---------------------------------------|--------------------------------|-------------------------------------------| | **High pass@k, Medium avg@n, Low cons@k** | High potential but poor stability | Enhance consistency (temperature penalty, voting mechanisms) | | **Medium pass@k, Medium avg@n, Medium cons@k** | Balanced but improvable | Comprehensive optimization (data augmentation, prompt engineering) | | **Low pass@k, Low avg@n, High cons@k** | Systematic bias | Inspect data/prompt engineering/model bias | | **Low pass@k, Low avg@n, Low cons@k** | Nearly ineffective | Retrain or replace model architecture | By combining these three metrics, one can comprehensively evaluate the performance characteristics of large language models, providing statistically meaningful scientific bases for model optimization and application deployment. ### 5. Notes > Currently, **not all** dataset configuration files use evaluators that support calculating these three metrics. If the `Evaluator` specified in `eval_cfg` of a dataset configuration file does not return the required metrics, the results will fall back to calculating only the original accuracy-representing metrics. --- ## III. Difference Analysis between `accuracy (n runs average)` and `avg@n` ### Dataset Example: textvqa > The mathematical formulas below reflect the evaluation logic implemented in the TEXTEvaluator class for the textvqa dataset. #### 1. Metric Definitions - **`accuracy (n runs average)`** Average of replica-level soft accuracy (similarity) **Formula**: $\frac{1}{n} \sum_{j=1}^{n} \left( \frac{1}{D} \sum_{i=1}^{D} \text{avg_acc}_{ij} \right)$ - n: Number of replicas - D: Number of data points - $\text{avg_acc}_{ij}$: Soft correctness value (continuous 0-1) for data point i in replica j - **`avg@n`** Average of data point-level hard accuracy (whether similarity exceeds threshold `0.5`) **Formula**: $\frac{1}{D} \sum_{i=1}^{D} (\frac{1}{n} \sum_{j=1}^{n} H_{ij})$ - $H_{ij}$: Hard correctness flag (binary: 0 or 1) - $H_{ij} = \begin{cases} 1 & \text{if } \text{avg_acc}_{ij} > 0.5 \\ 0 & \text{otherwise} \end{cases}$ #### 2. Fundamental Reason for Value Differences The two metrics use different correctness measures: - **Soft Correctness** (`accuracy`): Continuous value (0-1) reflecting partial matching between predictions and reference answers $\text{avg_acc}_{ij} = \frac{1}{K} \sum_{k=1}^{K} \text{match_score}$ (K is the number of reference answers) - **Hard Correctness** (`avg@n`): Binary (0/1) determined by threshold: $H_{ij} = \mathbb{I}(\text{avg_acc}_{ij} > 0.5)$ #### 3. Mathematical Difference Mechanism For a data point with $\text{avg_acc}$ values across n replicas: $[a_1,a_2,...,a_n]$ Then: - $accuracy=\frac{1}{n} \sum_{j=1}^{n}a_j$ - $avg@n=\frac{1}{n}\sum_{j=1}^{n} \mathbb{I}(a_j > 0.5)$ ##### Conditions for Differences When $a_j∈(0,0.5)∪(0.5,1)$ (i.e., intermediate values other than 0/1), the two metrics will differ. ##### Example Calculation Assume predictions for a data point across n=3 replicas: ```text - Replica 1: avg_acc = 0.6 → correct = True - Replica 2: avg_acc = 0.4 → correct = False - Replica 3: avg_acc = 0.6 → correct = True ``` Calculations: - accuracy (n runs average): - Replica-level accuracy = (0.6 + 0.4 + 0.6)/3 ≈ 0.533 - Global value = 0.533 (averaged across multiple data points if applicable) - avg@n: - Accuracy for this data point = 2/3 ≈ 0.666 (since 2 are correct=True) - Global value = 0.666 (averaged across multiple data points if applicable) ⇒ The metrics differ (0.533 vs 0.666) #### 4. Reasonableness of Differences 1. **Different Evaluation Goals**: - `accuracy`: Measures average matching quality of predictions (fine-grained evaluation) - `avg@n`: Measures the proportion of predictions exceeding the threshold (consistency evaluation) 2. **Task Adaptability**: - The Evaluator for the textvqa dataset (used in the example) requires soft correctness (other datasets may have reasonable variants) - Hard correctness is used for model robustness analysis 3. **Mathematical Validity**: - Both metrics are correctly calculated under their definitions - Differences stem from input data properties (soft vs. hard), not calculation errors #### 5. Special Cases of Equal Values When all $\text{avg_acc}_{ij} \in \{0, 1\}$ (i.e., perfect matches or complete mismatches): accuracy = avg@n #### 6. Example Summary | Feature | `accuracy (n runs average)` | `avg@n` | |-----------------|-------------------------------------|-------------------------------------| | **Measure Type**| Soft correctness (continuous) | Hard correctness (binary) | | **Calculation Level** | Data point average first → then replica average | Replica average first → then data point average | | **Core Formula**| $\frac{1}{nD} \sum \text{avg_acc}$ | $\frac{1}{D} \sum\frac{correct}{n}$ | | **Application Scenario** | Fine-grained quality evaluation | Consistency/robustness evaluation | ### Difference Attribution From the above analysis of metrics for the `textvqa` dataset, differences between `avg@n` and `accuracy (n runs average)` arise when the calculation logic for `accuracy` (or other precision metrics like `pass@1` in the `livecodebench` dataset) in the dataset's evaluation method (i.e., the `score` function implemented in the `Evaluator` class specified in `eval_cfg`) differs from the judgment logic in the `details` field. **Such differences are reasonable, statistically meaningful, and not caused by code errors.**