Reproducing Dataset Evaluation Results from Large Language Model (LLM) Papers (Technical Reports) — Taking the GPQA Dataset Used by DeepSeek R1 as an Example
Preface - Methodology
To reproduce the accuracy results reported in papers using the AISBench evaluation tool, it is essential to align with the testing methodology for the dataset as described in the model’s technical report or paper. The following configurations in the evaluation tool need to be aligned accordingly:
Example: Reproducing the Evaluation Results of the DeepSeek R1 Model on the GPQA Dataset
Select the Appropriate Model Configuration File Corresponding to the Endpoint
For execution efficiency, inference services are generally used as the subjects under test when reproducing model accuracy. Inference services can be accessed via various endpoints, and the industry standard mainly adopts OpenAI - style endpoints. There are two primary OpenAI endpoints: v1/completions and v1/chat/completions.
v1/completions: The model generates text based on a “prefix continuation” logic and does not inherently distinguish between “instructions” and “content”. Strong guidance through prompt engineering (e.g., adding “Please answer:”) is required; otherwise, it may produce imitative outputs rather than executing instructions. For instance, inputting “Translate the following English to Chinese: Hello” might result in the continuation “Translate the following Chinese to English: Nihao” instead of a direct translation. Therefore, it is suitable for single - turn text generation tasks (such as code completion, short - text writing, text continuation, and simple text classification) or scenarios that need to be compatible with legacy base models.
v1/chat/completions: The model natively understands the semantic roles of system/user/assistant, prioritizes executing user instructions, and ensures more stable dialogue consistency and intent alignment. It can complete tasks like translation and summarization without complex prompt wrapping. Hence, it is ideal for modern LLM application scenarios such as multi - turn dialogues (customer service, chatbots), instruction - driven tasks (translation, summarization, data analysis), tool integration (function calling, retrieval - augmented generation), and multimodal interactions.
💡 As of January 2025, nearly all newly released LLM models support the v1/chat/completions endpoint, and the v1/completions endpoint has been largely deprecated. Consequently, model configuration files typically only use the model tasks for accessing the v1/chat/completions endpoint: vllm_api_general_chat (accessing the service via a non - streaming interface) and vllm_api_stream_chat (accessing the service via a streaming interface).
Taking the model task vllm_api_general_chat as an example, the absolute path to its corresponding model configuration file can be obtained by running the following command:
ais_bench --models vllm_api_general_chat --search
⚠️ All subsequent model - related configurations will be modified in this configuration file.
Fully Align the Maximum Output Length
The following description can be found on the [DeepSeek R1 Hugging Face Model Card](https://huggingface.co/deepseek - ai/DeepSeek - R1):
4. Evaluation Results
DeepSeek - R1 - Evaluation
For all our models, the maximum generation length is set to 32,768 tokens…
This indicates that the maximum output length of the DeepSeek R1 model is set to 32,768 tokens.
Taking the model task vllm_api_general_chat as an example, the configuration for the maximum output length is as follows:
from ais_bench.benchmark.models import VLLMCustomAPIChat
models = [
dict(
attr="service",
type=VLLMCustomAPIChat,
abbr='vllm - api - general - chat',
# ......
max_out_len=32768, # Maximum number of tokens output by the inference service
# ......
)
]
Fully Align the Post - processing Parameters
The following description is available on the [DeepSeek R1 Hugging Face Model Card](https://huggingface.co/deepseek - ai/DeepSeek - R1):
4. Evaluation Results
DeepSeek - R1 - Evaluation
…, For benchmarks requiring sampling, we use a temperature of \(0.6\), a top - p value of \(0.95\), …
It can be seen from this that the post - processing parameters of the DeepSeek R1 model include a temperature of 0.6 and a top - p value of 0.95.
Taking the model task vllm_api_general_chat as an example, the configuration for the post - processing parameters is as follows:
from ais_bench.benchmark.models import VLLMCustomAPIChat
models = [
dict(
attr="service",
type=VLLMCustomAPIChat,
abbr='vllm - api - general - chat',
# ......
temperature=0.6, # Sampling temperature for text generation
top_p=0.95, # Top - p sampling parameter
# ......
)
]
Fully Align Prompt Engineering
In the [DeepSeek R1 technical report](https://github.com/deepseek - ai/DeepSeek - R1/blob/main/DeepSeek_R1.pdf), the prompt format for the GPQA dataset is specified as follows:
For GPQA, we use the 0 - shot chain - of - thought (CoT) prompt from the original GPQA paper. The prompt template is as follows: Q: [question] A: Let’s think step by step.
In the AISBench dataset configuration file, the prompt engineering can be aligned by modifying the reader configuration, as shown below:
# https://github.com/AISBench/benchmark/blob/master/ais_bench/benchmark/configs/datasets/gpqa/gpqa_gen_0_shot_cot_chat_prompt.py
gpqa_reader_cfg = dict(
# ......
prompt_template='Q: {question}\nA: Let\'s think step by step.',
# ......
)
Fully Align the Answer Extraction Method
The answer format in the GPQA dataset is option - based (options A, B, C, D). In the DeepSeek R1 paper, the answer extraction method is to extract the final answer option (A/B/C/D) from the model - generated reasoning process.
Therefore, in AISBench, a custom post - processing function for answer extraction needs to be implemented in the dataset configuration file, as shown below:
# https://github.com/AISBench/benchmark/blob/master/ais_bench/benchmark/configs/datasets/gpqa/gpqa_gen_0_shot_cot_chat_prompt.py
import re
def gpqa_extract_answer(text):
"""
Extract the final answer option (A/B/C/D) from the model - generated reasoning text
"""
ANSWER_PATTERN = r"Answer[ \t]*:[ \t]*\$?([A - D])\$?"
match = re.search(ANSWER_PATTERN, text)
if match:
return match.group(1)
return None
from ais_bench.benchmark.datasets import GPQADataset, GPQA_Simple_Eval_postprocess, GPQAEvaluator
gpqa_eval_cfg = dict(evaluator=dict(type=GPQAEvaluator),
pred_postprocessor=dict(type=GPQA_Simple_Eval_postprocess, func=gpqa_extract_answer)) # Pass in the custom answer extraction function, which can also be directly defined in the dataset configuration file
Align the Accuracy Evaluation Metrics
Typically, model evaluation results are presented in a table. Take the results from DeepSeek as an example:
Model |
AIME 2024 pass@1 |
AIME 2024 cons@64 |
MATH - 500 pass@1 |
GPQA Diamond pass@1 |
LiveCodeBench pass@1 |
CodeForces rating |
|---|---|---|---|---|---|---|
GPT - 4o - 0513 |
9.3 |
13.4 |
74.6 |
49.9 |
32.9 |
759 |
Claude - 3.5 - Sonnet - 1022 |
16.0 |
26.7 |
78.3 |
65.0 |
38.9 |
717 |
o1 - mini |
63.6 |
80.0 |
90.0 |
60.0 |
53.8 |
1820 |
Here, cons@64 and pass@1 represent accuracy evaluation metrics. For detailed explanations of these metrics, refer to [Accuracy Metric Description](…/base_tutorials/results_intro/accuracy_metric.md#ii - definition - and - relationship - between - passk - consk - and - avgn).
Taking GPQA as an example, the table shows that pass@1 is used as the accuracy evaluation metric. The description of pass@1 in the DeepSeek R1 paper is as follows:
…, and report pass@1 using a non - zero temperature. Specifically, we use a sampling temperature of 0.6 and a top - 𝑝 value of 0.95 to generate 𝑘 responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1 is then calculated as \({\text{pass@1}} = \frac{1}{n} \sum_{i = 1}^{n} p_i\)
Then in AISBench, configure the model configuration file as follows:
# https://github.com/AISBench/benchmark/blob/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py
models = [
dict(
... # Other parameters
generation_kwargs = dict(
num_return_sequences = 4, # n=4~64
... # Other parameters
),
...
)
]
Under normal circumstances, n == k or k=1. In scenarios where n == k, the inferred metric is path@k; in scenarios where k=1 (i.e., pass@1 in the DeepSeek formula), it is essentially avg@n. Configuring n alone is sufficient, so the 20251219 version of the AISBench evaluation tool does not yet support configuring k independently.
After the precision evaluation phase, the results will be recorded in the logs and printed to the running window, following the format in the example below (data is for reference only):
| dataset | version | metric | mode | vllm-api-stream-chat |
| --------- | --------- | ------------------------- | ---- | -------------------- |
| GPQA_diamond | 604a78 | accuracy (4 runs average) | gen | 18.00 |
| GPQA_diamond | 604a78 | avg@4 | gen | 18.00 |
| GPQA_diamond | 604a78 | pass@4 | gen | 53.33 |
| GPQA_diamond | 604a78 | cons@4 | gen | 13.33 |
Among them, avg@4 has the same meaning as pass@1 (average over 4 runs) in DeepSeek.
⚠️ While
nonly affects the fluctuation range of the evaluation results and not the mathematical expectation, a largernmeans more repeated runs for each test case, leading to higher resource consumption. When reproducing accuracy, adjustments should be made based on the actual resource availability.
💡 If a paper does not specify the accuracy evaluation metric for a dataset,
pass@1is generally used by default. Thus, omitting the configuration ofnandkin the AISBench dataset configuration file defaults topass@1.
References
DeepSeek R1 Hugging Face Model Card: https://huggingface.co/deepseek - ai/DeepSeek - R1
DeepSeek R1 Paper: https://github.com/deepseek - ai/DeepSeek - R1/blob/main/DeepSeek_R1.pdf