Supporting New Datasets and Accuracy Evaluators

Currently, AISBench supports the following data types: open-source datasets, custom datasets, and synthetic datasets. Before adapting a new dataset, it is recommended to first refer to the usage instructions for custom datasets and synthetic datasets to confirm whether they can meet actual needs.

For datasets that cannot meet requirements (for example, dataset loading methods or accuracy calculation rules differ significantly from other datasets), adaptation is needed. Before starting, it is recommended to first refer to the definition methods of prompt_template and meta_template to understand how AISBench constructs prompts, how to convert raw data into actual model inputs, and the functions of components involved.

Specific implementation reference is as follows:

Add a new dataset script mydataset.py in the ais_bench/benchmark/datasets folder. This script needs to include:
- Dataset and its loading method: Need to define a MyDataset class and implement the dataset loading method load. This method is a static method and needs to return data of type datasets.Dataset. Here we use HuggingFace Dataset as a unified interface for datasets to avoid introducing additional logic. Reference format is as follows:
```
import datasets
from .base import BaseDataset

class MyDataset(BaseDataset):

    @staticmethod
    def load(**kwargs) -> datasets.Dataset:
        ... # Implement dataset loading logic
        data_list = ... # Dataset list
        return datasets.Dataset.from_list(data_list)  # Convert dataset list to HuggingFace Dataset object
```
It is recommended to add the new dataset class to __init__.py for convenient automatic import later.

For specific examples, refer to Aime2024Dataset

For multimodal data, it is necessary to use formatted concatenation in the load function to concatenate text, image, video, and audio data into one data item. In subsequent parsing, data will be restored and concatenated into model input according to markers for each data type.

Concatenation format example:
```
<AIS_TEXT_START>{text}<AIS_CONTENT_TAG><AIS_IMAGE_START>{image}<AIS_CONTENT_TAG><AIS_VIDEO_START>{video}<AIS_CONTENT_TAG><AIS_AUDIO_START>{audio}<AIS_CONTENT_TAG>
```
Where {text}, {image}, {video}, {audio} are the text, image, video, and audio content in the dataset.

For specific examples, refer to MMCustomDataset
- (Optional) Custom accuracy evaluator: If the existing accuracy evaluators in AISBench cannot meet needs, users need to define a MyDatasetEvaluator class and implement the scoring method score. This method needs to return a dictionary containing metrics and their corresponding scores based on the input predictions and references lists. Since a dataset may have multiple metrics, the returned dictionary should include all relevant evaluation metrics. Specific example is as follows:
```
from typing import List
from ais_bench.benchmark.openicl.icl_evaluator import BaseEvaluator

class MyDatasetEvaluator(BaseEvaluator):

    def score(self, predictions: List, references: List) -> dict:
        # Implement evaluation logic
        # Return format: {"metric_name": score_value, ...}
        pass
```
For specific implementation, refer to MATHEvaluator
- (Optional) Custom post-processing method: If the existing post-processing methods in AISBench cannot meet needs, users need to define a mydataset_postprocess method to get corresponding post-processed results based on the input string. This method is usually used for scenarios such as cleaning model output and extracting answers. Specific example is as follows:
```
def mydataset_postprocess(text: str) -> str:
    # Implement post-processing logic, such as extracting answers, cleaning format, etc.
    # Return processed string
    pass
```

After defining dataset loading, evaluation, and data post-processing methods, add the following configuration my_dataset.py in the configuration directory …/ais_bench/benchmark/configs/datasets:

from ais_bench.benchmark.datasets import MyDataset, MyDatasetEvaluator, mydataset_postprocess

# Accuracy evaluation configuration
mydataset_eval_cfg = dict(
    evaluator=dict(type=MyDatasetEvaluator),  # Custom accuracy evaluator class name
    pred_postprocessor=dict(type=mydataset_postprocess)  # Custom data post-processing method
)

# Dataset reading configuration: Configure according to fields of each sample in the dataset, used to fill prompt_template
mydataset_reader_cfg = dict(
    input_columns=["question"],  # Input field list
    output_column="answer"       # Output field (ground truth)
)

# Inference configuration
mydataset_infer_cfg = dict(
    prompt_template=dict(
        # Prompt template class name, configure according to data type:
        # - PromptTemplate: Pure text input
        # - MultiTurnPromptTemplate: Multi-turn dialogue input
        # - MMPromptTemplate: Multimodal input
        type=PromptTemplate,
        template=dict(
            round=[
                dict(
                    role="HUMAN",
                    prompt="{question}\nRemember to put your final answer within \\boxed{}.",
                ),
            ],
        ),
    ),
    retriever=dict(type=ZeroRetriever),      # Retriever configuration
    inferencer=dict(type=GenInferencer),     # Inferencer configuration
)

# Dataset configuration list
mydataset_datasets = [
    dict(
        type=MyDataset,                    # Custom dataset class name
        abbr='mydataset',                  # Unique dataset identifier
        # ... Other dataset initialization parameters ...
        reader_cfg=mydataset_reader_cfg,   # Dataset reading configuration
        infer_cfg=mydataset_infer_cfg,     # Inference configuration
        eval_cfg=mydataset_eval_cfg        # Accuracy evaluation configuration
    )
]

Then execute the command to start local evaluation task:

ais_bench --models vllm_api_stream_chat --datasets my_dataset

Add README documentation

Create a README.md file in the configuration directory ais_bench/benchmark/configs/datasets/my_dataset/ to explain the dataset deployment and usage methods. The README should include the following content:
- Dataset Introduction: Briefly introduce the basic information, characteristics, and uses of the dataset, and attach a link to the dataset homepage (if it exists). Example format:
```
# MyDataset
中文 | [English](README_en.md)
## Dataset Introduction
MyDataset is a benchmark dataset for evaluating model performance on XXX tasks. The dataset contains XXX samples covering XXX different categories.

> 🔗 Dataset homepage link [https://example.com/mydataset](https://example.com/mydataset)
```
- Dataset Deployment: Detailed steps for downloading and deploying the dataset, including:
  - Download link or method for obtaining the dataset
  - Deployment path and directory structure requirements
  - Deployment steps (recommended to provide executable command examples)
  - Directory structure verification method (recommended to use tree command to show expected directory structure)
  Example format:
```
## Dataset Deployment
- The dataset package can be downloaded from the link provided by XXX 🔗 [https://example.com/mydataset.zip](https://example.com/mydataset.zip).
- It is recommended to deploy in the `{tool root path}/ais_bench/datasets` directory (default path set in dataset tasks). Taking deployment on Linux as an example, specific execution steps are as follows:
```bash
  # On Linux server, in tool root path
  cd ais_bench/datasets
  wget https://example.com/mydataset.zip
  unzip mydataset.zip
  rm mydataset.zip
```
- Execute tree mydataset/ in the {tool root path}/ais_bench/datasets directory to view the directory structure. If the directory structure is as shown below, the dataset deployment is successful.
  mydataset ├── data │ └── ... └── ...
If the dataset is integrated through dependency packages (such as Python packages), explain the installation steps and environment requirements:

Example format:
```
  ## Dataset Deployment
  MyDataset dataset is integrated through Python dependency packages. Data files are included in the `mydataset-eval` dependency package and can be used directly after installing dependencies.
  
  ### Environment Requirements
  - **mydataset-eval** dependency package (contains complete dataset)
  
  ### Installation Steps
  \`\`\`bash
  pip3 install mydataset-eval
  \`\`\`
```
- (Optional) Usage Examples: If the dataset has special usage requirements or configuration methods, provide detailed usage examples, including:
  - Model configuration examples (if the dataset requires specific model types or configurations)
  - Command examples for executing evaluation
  - Result display examples
- Available Dataset Tasks: List all available dataset task configurations in table format. The table should include the following columns:
  - Task Name: Identifier for dataset configuration (used for --datasets parameter)
  - Introduction: Brief description of the task
  - Evaluation Metrics: Evaluation metrics used (such as accuracy, score, etc.)
  - few-shot: Number of few-shot examples (such as 0-shot, 3-shot, 5-shot, etc.)
  - Prompt Format: Prompt format type (such as string format, dialogue format, etc.)
  - Corresponding Source Configuration File Path: Relative path of the configuration file
  Example format:
```
## Available Dataset Tasks
|Task Name|Introduction|Evaluation Metrics|few-shot|Prompt Format|Corresponding Source Configuration File Path|
| --- | --- | --- | --- | --- | --- |
|mydataset_gen_0_shot_str|MyDataset dataset generative task|accuracy|0-shot|String format|[mydataset_gen_0_shot_str.py](mydataset_gen_0_shot_str.py)|
|mydataset_gen_5_shot_str|MyDataset dataset generative task|accuracy|5-shot|String format|[mydataset_gen_5_shot_str.py](mydataset_gen_5_shot_str.py)|
```
- (Optional) Dataset Classification: If the dataset contains multiple subcategories or test scenarios, they can be classified and explained according to different dimensions, for example:
  - Individual test categories: List configurations for each subcategory
  - Test groups: Explain how to batch test multiple related categories
  - Precise test configuration: Explain how to specify specific test cases for precise testing
- (Optional) Usage Recommendations: Provide recommendations and notes for dataset usage to help users better use the dataset.
For specific examples, refer to:
- C-Eval README
- BFCL README
- BBH README