Dataset Preparation Guideο
Supported Dataset Typesο
The dataset types currently supported by AISBench Benchmark are as follows:
Open-Source DatasetsοΌCover multiple domains including general language understanding (e.g., ARC, SuperGLUE_BoolQ, MMLU), mathematical reasoning (e.g., GSM8K, AIME2024, Math), code generation (e.g., HumanEval, MBPP, LiveCodeBench), text summarization (e.g., XSum, LCSTS), and multimodal tasks (e.g., TextVQA, VideoBench, VocalSound). They meet the needs of comprehensive evaluation of language models in terms of multi-task, multimodal, and multilingual capabilities.
Randomly Synthesized DatasetsοΌSupport specifying the length of input/output sequences and the number of requests. They are suitable for performance testing scenarios that have requirements for sequence distribution and data scale.
Custom DatasetsοΌSupport converting user-defined data content into data in a fixed format for evaluation. They are applicable to customized accuracy and performance testing scenarios.
Open-Source Datasetsο
Open-source datasets refer to widely used, publicly accessible datasets in the community. They are typically used for model training, validation, and comparing the performance of different algorithms. AISBench Benchmark supports multiple mainstream open-source datasets, enabling users to quickly conduct standardized tests. Detailed introductions and acquisition methods are as follows:
LLM Datasetsο
Dataset Name |
Category |
Detailed Introduction & Acquisition Method |
|---|---|---|
DEMO |
Mathematical Reasoning |
|
ARC_c |
Reasoning (Common Sense + Science) |
|
ARC_e |
Reasoning (Common Sense + Science) |
|
SuperGLUE_BoolQ |
Natural Language Understanding (Q&A) |
|
agieval |
Comprehensive Exams / Reasoning |
|
aime2024 |
Mathematical Reasoning |
|
aime2025 |
Mathematical Reasoning |
|
aime2026 |
Mathematical Reasoning |
|
bbh |
Multi-Task (Big-Bench Hard) |
|
cmmlu |
Chinese Understanding / Knowledge Q&A |
|
ceval |
Chinese Professional Exams |
|
drop |
Reading Comprehension + Reasoning |
|
gsm8k |
Mathematical Reasoning |
|
gpqa |
Knowledge Q&A |
|
hellaswag |
Common Sense Reasoning |
|
humaneval |
Programming (Code Generation + Testing) |
|
humanevalx |
Programming (Multilingual) |
|
ifeval |
Programming (Function Generation) |
|
lambada |
Long Text Cloze |
|
lcsts |
Chinese Text Summarization |
|
livecodebench |
Programming (Real-Time Code) |
|
longbench |
Long Sequences |
|
longbenchv2 |
Long Sequences |
|
math |
Advanced Mathematical Reasoning |
|
mbpp |
Programming (Python) |
|
mgsm |
Multilingual Mathematical Reasoning |
|
mmlu |
Multidisciplinary Understanding (English) |
|
mmlu_pro |
Multidisciplinary Understanding (Professional Version) |
|
needlebench_v2 |
Long Sequences |
|
piqa |
Physical Common Sense Reasoning |
|
siqa |
Social Common Sense Reasoning |
|
triviaqa |
Knowledge Q&A |
|
winogrande |
Common Sense Reasoning (Pronoun Resolution) |
|
Xsum |
Text Generation (Summarization) |
|
BFCL |
Function Calling Capability Evaluation |
|
FewCLUE_bustm |
Short Text Semantic Matching |
|
FewCLUE_chid |
Reading Comprehension Cloze |
|
FewCLUE_cluewsc |
Pronoun Disambiguation |
|
FewCLUE_csl |
Keyword Recognition |
|
FewCLUE_eprstmt |
Sentiment Analysis |
|
FewCLUE_tnews |
News Classification |
|
dapo-math-17k |
Mathematical Reasoning (RL Evaluation) |
Multimodal Datasetsο
Dataset Name |
Category |
Detailed Introduction & Acquisition Method |
|---|---|---|
textvqa |
Multimodal Understanding (Image + Text) |
|
videobench |
Multimodal Understanding (Video) |
|
vocalsound |
Multimodal Understanding (Audio) |
|
Omnidocbench |
Image OCR (Image + Text) |
|
MMMU |
Multimodal Understanding (Image + Text) |
|
MMMU_Pro |
Multimodal Understanding (Image + Text) |
|
InfoVQA |
Multimodal Understanding (Image + Text) |
|
DocVQA |
Multimodal Understanding (Image + Text) |
|
MMStar |
Multimodal Understanding (Image + Text) |
|
Video-MME |
Multimodal Understanding (video + Text) |
|
OCRBench_v2 |
Multimodal Understanding (Image + Text, OCR Evaluation) |
|
RealWorldQA |
Multimodal Understanding (Image + Text) |
|
MathVision |
Multimodal Understanding (Image + Text) |
|
RefCOCO |
Visual Grounding (Image + Text) |
|
RefCOCO+ |
Visual Grounding (Image + Text) |
|
RefCOCOg |
Visual Grounding (Image + Text) |
|
HLE |
Multimodal Understanding (Image + Text) |
Multi-Turn Dialogue Datasetsο
Dataset Name |
Category |
Detailed Introduction & Acquisition Method |
|---|---|---|
sharegpt |
Multi-Turn Dialogue |
|
mtbench |
Multi-Turn Dialogue |
Tip: Users can uniformly place the acquired dataset folders in the ais_bench/datasets/ directory. AISBench Benchmark will automatically retrieve the dataset files in this directory based on the dataset configuration file for testing.
Configuring Open-Source Datasetsο
The configurations of AISBench Benchmarkβs open-source datasets are stored in the configs/datasets directory by dataset name. Each datasetβs corresponding folder contains multiple dataset configurations, with the file structure as shown below:
ais_bench/benchmark/configs/datasets
βββ agieval
βββ aime2024
βββ ARC_c
βββ ...
βββ gsm8k # Dataset
β βββ gsm8k_gen.py # Configuration files for different versions of the dataset
β βββ gsm8k_gen_0_shot_cot_str_perf.py
β βββ gsm8k_gen_0_shot_cot_chat_prompt.py
β βββ gsm8k_gen_0_shot_cot_str.py
β βββ gsm8k_gen_4_shot_cot_str.py
β βββ gsm8k_gen_4_shot_cot_chat_prompt.py
β βββ README_en.md
βββ ...
βββ vocalsound
βββ winogrande
βββ Xsum
The name of an open-source dataset configuration follows the format: {dataset_name}_{evaluation_method}_{number_of_shots}_shot_{chain_of_thought_rule}_{request_type}_{task_category}.py. Taking gsm8k/gsm8k_gen_0_shot_cot_chat_prompt.py as an example, this configuration file corresponds to the gsm8k dataset. The evaluation method is gen (generative evaluation, currently only generative evaluation is supported), the number of shot prompts is 0, the chain-of-thought rule is cot (indicating that the request includes chain-of-thought prompts; if not specified, there are no chain-of-thought prompts), chat_prompt indicates the request type is dialogue, and the task category is not specified (defaulting to accuracy testing). Similarly, gsm8k_gen_0_shot_cot_str_perf.py specifies the request type as str (string), and the request type perf indicates the template is used for performance evaluation tasks.
π‘ Tip: When specifying the dataset configuration name, the
.pysuffix can be omitted.
The configuration parameters of open-source datasets are also described using Python syntax. Taking gsm8k as an example, the parameter content is as follows:
gsm8k_datasets = [
dict(
abbr='gsm8k', # Unique identifier of the dataset in the evaluation task
type=GSM8KDataset, # Dataset class member, bound to the dataset; modification is not supported temporarily
path='ais_bench/datasets/gsm8k', # Dataset path; relative paths are relative to the source code root directory, and absolute paths are supported
reader_cfg=gsm8k_reader_cfg, # Data reading configuration; modification is not supported temporarily
infer_cfg=gsm8k_infer_cfg, # Inference evaluation configuration; modification is not supported temporarily
eval_cfg=gsm8k_eval_cfg) # Accuracy evaluation configuration; modification is not supported temporarily
]
Randomly Synthesized Datasetsο
Synthesized datasets are automatically generated by programs and are suitable for testing the generalization ability of models under different input lengths, distributions, and modes. AISBench Benchmark provides two types of synthesized datasets: random character sequences and random token sequences. No additional download is requiredβusers only need to set parameters through the configuration file to use them. For details, see: π Guide to Using Synthesized Random Dataset Configuration Files
Usage Methodο
The usage method is the same as that of open-source datasets. Simply select the required configuration file in the ais_bench/benchmark/configs/datasets/synthetic/ directory. Currently, synthetic_gen.py is available. An example command is as follows:
ais_bench --models vllm_api_stream_chat --datasets synthetic_gen
Custom Datasetsο
AISBench Benchmark supports users in integrating custom datasets to meet specific business needs. Users can organize private data into a standard format and seamlessly integrate it into the evaluation process through built-in interfaces. For details, see: π Guide to Using Custom Datasets