Supporting New Datasets and Accuracy Evaluatorsο
Currently, AISBench supports the following data types: open-source datasets, custom datasets, and synthetic datasets. Before adapting a new dataset, it is recommended to first refer to the usage instructions for custom datasets and synthetic datasets to confirm whether they can meet actual needs.
For datasets that cannot meet requirements (for example, dataset loading methods or accuracy calculation rules differ significantly from other datasets), adaptation is needed. Before starting, it is recommended to first refer to the definition methods of prompt_template and meta_template to understand how AISBench constructs prompts, how to convert raw data into actual model inputs, and the functions of components involved.
Specific implementation reference is as follows:
Add a new dataset script
mydataset.pyin theais_bench/benchmark/datasetsfolder. This script needs to include:Dataset and its loading method: Need to define a
MyDatasetclass and implement the dataset loading methodload. This method is a static method and needs to return data of typedatasets.Dataset. Here we use HuggingFace Dataset as a unified interface for datasets to avoid introducing additional logic. Reference format is as follows:
import datasets from .base import BaseDataset class MyDataset(BaseDataset): @staticmethod def load(**kwargs) -> datasets.Dataset: ... # Implement dataset loading logic data_list = ... # Dataset list return datasets.Dataset.from_list(data_list) # Convert dataset list to HuggingFace Dataset object
It is recommended to add the new dataset class to
__init__.pyfor convenient automatic import later.For specific examples, refer to Aime2024Dataset
For multimodal data, it is necessary to use formatted concatenation in the
loadfunction to concatenate text, image, video, and audio data into one data item. In subsequent parsing, data will be restored and concatenated into model input according to markers for each data type.Concatenation format example:
<AIS_TEXT_START>{text}<AIS_CONTENT_TAG><AIS_IMAGE_START>{image}<AIS_CONTENT_TAG><AIS_VIDEO_START>{video}<AIS_CONTENT_TAG><AIS_AUDIO_START>{audio}<AIS_CONTENT_TAG>Where
{text},{image},{video},{audio}are the text, image, video, and audio content in the dataset.For specific examples, refer to MMCustomDataset
(Optional) Custom accuracy evaluator: If the existing accuracy evaluators in AISBench cannot meet needs, users need to define a
MyDatasetEvaluatorclass and implement the scoring methodscore. This method needs to return a dictionary containing metrics and their corresponding scores based on the inputpredictionsandreferenceslists. Since a dataset may have multiple metrics, the returned dictionary should include all relevant evaluation metrics. Specific example is as follows:
from typing import List from ais_bench.benchmark.openicl.icl_evaluator import BaseEvaluator class MyDatasetEvaluator(BaseEvaluator): def score(self, predictions: List, references: List) -> dict: # Implement evaluation logic # Return format: {"metric_name": score_value, ...} pass
For specific implementation, refer to MATHEvaluator
(Optional) Custom post-processing method: If the existing post-processing methods in AISBench cannot meet needs, users need to define a
mydataset_postprocessmethod to get corresponding post-processed results based on the input string. This method is usually used for scenarios such as cleaning model output and extracting answers. Specific example is as follows:
def mydataset_postprocess(text: str) -> str: # Implement post-processing logic, such as extracting answers, cleaning format, etc. # Return processed string pass
After defining dataset loading, evaluation, and data post-processing methods, add the following configuration
my_dataset.pyin the configuration directory β¦/ais_bench/benchmark/configs/datasets:from ais_bench.benchmark.datasets import MyDataset, MyDatasetEvaluator, mydataset_postprocess # Accuracy evaluation configuration mydataset_eval_cfg = dict( evaluator=dict(type=MyDatasetEvaluator), # Custom accuracy evaluator class name pred_postprocessor=dict(type=mydataset_postprocess) # Custom data post-processing method ) # Dataset reading configuration: Configure according to fields of each sample in the dataset, used to fill prompt_template mydataset_reader_cfg = dict( input_columns=["question"], # Input field list output_column="answer" # Output field (ground truth) ) # Inference configuration mydataset_infer_cfg = dict( prompt_template=dict( # Prompt template class name, configure according to data type: # - PromptTemplate: Pure text input # - MultiTurnPromptTemplate: Multi-turn dialogue input # - MMPromptTemplate: Multimodal input type=PromptTemplate, template=dict( round=[ dict( role="HUMAN", prompt="{question}\nRemember to put your final answer within \\boxed{}.", ), ], ), ), retriever=dict(type=ZeroRetriever), # Retriever configuration inferencer=dict(type=GenInferencer), # Inferencer configuration ) # Dataset configuration list mydataset_datasets = [ dict( type=MyDataset, # Custom dataset class name abbr='mydataset', # Unique dataset identifier # ... Other dataset initialization parameters ... reader_cfg=mydataset_reader_cfg, # Dataset reading configuration infer_cfg=mydataset_infer_cfg, # Inference configuration eval_cfg=mydataset_eval_cfg # Accuracy evaluation configuration ) ]
Then execute the command to start local evaluation task:
ais_bench --models vllm_api_stream_chat --datasets my_dataset
Add README documentation
Create a
README.mdfile in the configuration directoryais_bench/benchmark/configs/datasets/my_dataset/to explain the dataset deployment and usage methods. The README should include the following content:Dataset Introduction: Briefly introduce the basic information, characteristics, and uses of the dataset, and attach a link to the dataset homepage (if it exists). Example format:
# MyDataset δΈζ | [English](README_en.md) ## Dataset Introduction MyDataset is a benchmark dataset for evaluating model performance on XXX tasks. The dataset contains XXX samples covering XXX different categories. > π Dataset homepage link [https://example.com/mydataset](https://example.com/mydataset)
Dataset Deployment: Detailed steps for downloading and deploying the dataset, including:
Download link or method for obtaining the dataset
Deployment path and directory structure requirements
Deployment steps (recommended to provide executable command examples)
Directory structure verification method (recommended to use
treecommand to show expected directory structure)
Example format:
## Dataset Deployment - The dataset package can be downloaded from the link provided by XXX π [https://example.com/mydataset.zip](https://example.com/mydataset.zip). - It is recommended to deploy in the `{tool root path}/ais_bench/datasets` directory (default path set in dataset tasks). Taking deployment on Linux as an example, specific execution steps are as follows: ```bash # On Linux server, in tool root path cd ais_bench/datasets wget https://example.com/mydataset.zip unzip mydataset.zip rm mydataset.zip
Execute
tree mydataset/in the{tool root path}/ais_bench/datasetsdirectory to view the directory structure. If the directory structure is as shown below, the dataset deployment is successful.mydataset βββ data β βββ ... βββ ...
If the dataset is integrated through dependency packages (such as Python packages), explain the installation steps and environment requirements:
Example format:
## Dataset Deployment MyDataset dataset is integrated through Python dependency packages. Data files are included in the `mydataset-eval` dependency package and can be used directly after installing dependencies. ### Environment Requirements - **mydataset-eval** dependency package (contains complete dataset) ### Installation Steps \`\`\`bash pip3 install mydataset-eval \`\`\`
(Optional) Usage Examples: If the dataset has special usage requirements or configuration methods, provide detailed usage examples, including:
Model configuration examples (if the dataset requires specific model types or configurations)
Command examples for executing evaluation
Result display examples
Available Dataset Tasks: List all available dataset task configurations in table format. The table should include the following columns:
Task Name: Identifier for dataset configuration (used for
--datasetsparameter)Introduction: Brief description of the task
Evaluation Metrics: Evaluation metrics used (such as accuracy, score, etc.)
few-shot: Number of few-shot examples (such as 0-shot, 3-shot, 5-shot, etc.)
Prompt Format: Prompt format type (such as string format, dialogue format, etc.)
Corresponding Source Configuration File Path: Relative path of the configuration file
Example format:
## Available Dataset Tasks |Task Name|Introduction|Evaluation Metrics|few-shot|Prompt Format|Corresponding Source Configuration File Path| | --- | --- | --- | --- | --- | --- | |mydataset_gen_0_shot_str|MyDataset dataset generative task|accuracy|0-shot|String format|[mydataset_gen_0_shot_str.py](mydataset_gen_0_shot_str.py)| |mydataset_gen_5_shot_str|MyDataset dataset generative task|accuracy|5-shot|String format|[mydataset_gen_5_shot_str.py](mydataset_gen_5_shot_str.py)|(Optional) Dataset Classification: If the dataset contains multiple subcategories or test scenarios, they can be classified and explained according to different dimensions, for example:
Individual test categories: List configurations for each subcategory
Test groups: Explain how to batch test multiple related categories
Precise test configuration: Explain how to specify specific test cases for precise testing
(Optional) Usage Recommendations: Provide recommendations and notes for dataset usage to help users better use the dataset.
For specific examples, refer to:
C-Eval README
BFCL README
BBH README