Dataset Preparation Guide

Supported Dataset Types

The dataset types currently supported by AISBench Benchmark are as follows:

  1. Open-Source Datasets:Cover multiple domains including general language understanding (e.g., ARC, SuperGLUE_BoolQ, MMLU), mathematical reasoning (e.g., GSM8K, AIME2024, Math), code generation (e.g., HumanEval, MBPP, LiveCodeBench), text summarization (e.g., XSum, LCSTS), and multimodal tasks (e.g., TextVQA, VideoBench, VocalSound). They meet the needs of comprehensive evaluation of language models in terms of multi-task, multimodal, and multilingual capabilities.

  2. Randomly Synthesized Datasets:Support specifying the length of input/output sequences and the number of requests. They are suitable for performance testing scenarios that have requirements for sequence distribution and data scale.

  3. Custom Datasets:Support converting user-defined data content into data in a fixed format for evaluation. They are applicable to customized accuracy and performance testing scenarios.

Open-Source Datasets

Open-source datasets refer to widely used, publicly accessible datasets in the community. They are typically used for model training, validation, and comparing the performance of different algorithms. AISBench Benchmark supports multiple mainstream open-source datasets, enabling users to quickly conduct standardized tests. Detailed introductions and acquisition methods are as follows:

LLM Datasets

Dataset Name

Category

Detailed Introduction & Acquisition Method

DEMO

Mathematical Reasoning

Detailed Introduction

ARC_c

Reasoning (Common Sense + Science)

Detailed Introduction

ARC_e

Reasoning (Common Sense + Science)

Detailed Introduction

SuperGLUE_BoolQ

Natural Language Understanding (Q&A)

Detailed Introduction

agieval

Comprehensive Exams / Reasoning

Detailed Introduction

aime2024

Mathematical Reasoning

Detailed Introduction

aime2025

Mathematical Reasoning

Detailed Introduction

aime2026

Mathematical Reasoning

Detailed Introduction

bbh

Multi-Task (Big-Bench Hard)

Detailed Introduction

cmmlu

Chinese Understanding / Knowledge Q&A

Detailed Introduction

ceval

Chinese Professional Exams

Detailed Introduction

drop

Reading Comprehension + Reasoning

Detailed Introduction

gsm8k

Mathematical Reasoning

Detailed Introduction

gpqa

Knowledge Q&A

Detailed Introduction

hellaswag

Common Sense Reasoning

Detailed Introduction

humaneval

Programming (Code Generation + Testing)

Detailed Introduction

humanevalx

Programming (Multilingual)

Detailed Introduction

ifeval

Programming (Function Generation)

Detailed Introduction

lambada

Long Text Cloze

Detailed Introduction

lcsts

Chinese Text Summarization

Detailed Introduction

livecodebench

Programming (Real-Time Code)

Detailed Introduction

longbench

Long Sequences

Detailed Introduction

longbenchv2

Long Sequences

Detailed Introduction

math

Advanced Mathematical Reasoning

Detailed Introduction

mbpp

Programming (Python)

Detailed Introduction

mgsm

Multilingual Mathematical Reasoning

Detailed Introduction

mmlu

Multidisciplinary Understanding (English)

Detailed Introduction

mmlu_pro

Multidisciplinary Understanding (Professional Version)

Detailed Introduction

needlebench_v2

Long Sequences

Detailed Introduction

piqa

Physical Common Sense Reasoning

Detailed Introduction

siqa

Social Common Sense Reasoning

Detailed Introduction

triviaqa

Knowledge Q&A

Detailed Introduction

winogrande

Common Sense Reasoning (Pronoun Resolution)

Detailed Introduction

Xsum

Text Generation (Summarization)

Detailed Introduction

BFCL

Function Calling Capability Evaluation

Detailed Introduction

FewCLUE_bustm

Short Text Semantic Matching

Detailed Introduction

FewCLUE_chid

Reading Comprehension Cloze

Detailed Introduction

FewCLUE_cluewsc

Pronoun Disambiguation

Detailed Introduction

FewCLUE_csl

Keyword Recognition

Detailed Introduction

FewCLUE_eprstmt

Sentiment Analysis

Detailed Introduction

FewCLUE_tnews

News Classification

Detailed Introduction

dapo-math-17k

Mathematical Reasoning (RL Evaluation)

Detailed Introduction

Multimodal Datasets

Dataset Name

Category

Detailed Introduction & Acquisition Method

textvqa

Multimodal Understanding (Image + Text)

Detailed Introduction

videobench

Multimodal Understanding (Video)

Detailed Introduction

vocalsound

Multimodal Understanding (Audio)

Detailed Introduction

Omnidocbench

Image OCR (Image + Text)

Detailed Introduction

MMMU

Multimodal Understanding (Image + Text)

Detailed Introduction

MMMU_Pro

Multimodal Understanding (Image + Text)

Detailed Introduction

InfoVQA

Multimodal Understanding (Image + Text)

Detailed Introduction

DocVQA

Multimodal Understanding (Image + Text)

Detailed Introduction

MMStar

Multimodal Understanding (Image + Text)

Detailed Introduction

Video-MME

Multimodal Understanding (video + Text)

Detailed Introduction

OCRBench_v2

Multimodal Understanding (Image + Text, OCR Evaluation)

Detailed Introduction

RealWorldQA

Multimodal Understanding (Image + Text)

Detailed Introduction

MathVision

Multimodal Understanding (Image + Text)

Detailed Introduction

RefCOCO

Visual Grounding (Image + Text)

Detailed Introduction

RefCOCO+

Visual Grounding (Image + Text)

Detailed Introduction

RefCOCOg

Visual Grounding (Image + Text)

Detailed Introduction

HLE

Multimodal Understanding (Image + Text)

Detailed Introduction

Multi-Turn Dialogue Datasets

Dataset Name

Category

Detailed Introduction & Acquisition Method

sharegpt

Multi-Turn Dialogue

Detailed Introduction

mtbench

Multi-Turn Dialogue

Detailed Introduction

Tip: Users can uniformly place the acquired dataset folders in the ais_bench/datasets/ directory. AISBench Benchmark will automatically retrieve the dataset files in this directory based on the dataset configuration file for testing.

Configuring Open-Source Datasets

The configurations of AISBench Benchmark’s open-source datasets are stored in the configs/datasets directory by dataset name. Each dataset’s corresponding folder contains multiple dataset configurations, with the file structure as shown below:

ais_bench/benchmark/configs/datasets
β”œβ”€β”€ agieval
β”œβ”€β”€ aime2024
β”œβ”€β”€ ARC_c
β”œβ”€β”€ ...
β”œβ”€β”€ gsm8k  # Dataset
β”‚   β”œβ”€β”€ gsm8k_gen.py  # Configuration files for different versions of the dataset
β”‚   β”œβ”€β”€ gsm8k_gen_0_shot_cot_str_perf.py
β”‚   β”œβ”€β”€ gsm8k_gen_0_shot_cot_chat_prompt.py
β”‚   β”œβ”€β”€ gsm8k_gen_0_shot_cot_str.py
β”‚   β”œβ”€β”€ gsm8k_gen_4_shot_cot_str.py
β”‚   β”œβ”€β”€ gsm8k_gen_4_shot_cot_chat_prompt.py
β”‚   └── README_en.md
β”œβ”€β”€ ...
β”œβ”€β”€ vocalsound
β”œβ”€β”€ winogrande
└── Xsum

The name of an open-source dataset configuration follows the format: {dataset_name}_{evaluation_method}_{number_of_shots}_shot_{chain_of_thought_rule}_{request_type}_{task_category}.py. Taking gsm8k/gsm8k_gen_0_shot_cot_chat_prompt.py as an example, this configuration file corresponds to the gsm8k dataset. The evaluation method is gen (generative evaluation, currently only generative evaluation is supported), the number of shot prompts is 0, the chain-of-thought rule is cot (indicating that the request includes chain-of-thought prompts; if not specified, there are no chain-of-thought prompts), chat_prompt indicates the request type is dialogue, and the task category is not specified (defaulting to accuracy testing). Similarly, gsm8k_gen_0_shot_cot_str_perf.py specifies the request type as str (string), and the request type perf indicates the template is used for performance evaluation tasks.

πŸ’‘ Tip: When specifying the dataset configuration name, the .py suffix can be omitted.

The configuration parameters of open-source datasets are also described using Python syntax. Taking gsm8k as an example, the parameter content is as follows:

gsm8k_datasets = [
    dict(
        abbr='gsm8k',                       # Unique identifier of the dataset in the evaluation task
        type=GSM8KDataset,                  # Dataset class member, bound to the dataset; modification is not supported temporarily
        path='ais_bench/datasets/gsm8k',    # Dataset path; relative paths are relative to the source code root directory, and absolute paths are supported
        reader_cfg=gsm8k_reader_cfg,    # Data reading configuration; modification is not supported temporarily
        infer_cfg=gsm8k_infer_cfg,      # Inference evaluation configuration; modification is not supported temporarily
        eval_cfg=gsm8k_eval_cfg)        # Accuracy evaluation configuration; modification is not supported temporarily
]

Randomly Synthesized Datasets

Synthesized datasets are automatically generated by programs and are suitable for testing the generalization ability of models under different input lengths, distributions, and modes. AISBench Benchmark provides two types of synthesized datasets: random character sequences and random token sequences. No additional download is requiredβ€”users only need to set parameters through the configuration file to use them. For details, see: πŸ“š Guide to Using Synthesized Random Dataset Configuration Files

Usage Method

The usage method is the same as that of open-source datasets. Simply select the required configuration file in the ais_bench/benchmark/configs/datasets/synthetic/ directory. Currently, synthetic_gen.py is available. An example command is as follows:

ais_bench --models vllm_api_stream_chat --datasets synthetic_gen

Custom Datasets

AISBench Benchmark supports users in integrating custom datasets to meet specific business needs. Users can organize private data into a standard format and seamlessly integrate it into the evaluation process through built-in interfaces. For details, see: πŸ“š Guide to Using Custom Datasets