Guide to Using Custom Datasets

This tutorial is only for temporary and informal use of datasets.

In this tutorial, we will explain how to test a new dataset without adding a dataset configuration file or modifying the source code of ais_bench. The supported task types include multiple-choice questions (mcq) and question-answering (qa). Currently, both mcq and qa only support gen (generation-based) inference.

Dataset Formats

We support two dataset formats: .jsonl and .csv.

Multiple-Choice Questions (mcq)

For mcq-type data, the default fields are as follows (for other fields, refer to the Special Fields section):

  • question: The main stem of the multiple-choice question.

  • A, B, C, … : Options represented by single uppercase letters (unlimited in number). By default, the system only parses consecutive letters starting from A as valid options.

  • answer: The correct answer to the multiple-choice question, which must be one of the above options (e.g., A, B, C).

  • If an exact match cannot be found during accuracy calculation, the Levenshtein Distance Algorithm will be used to select the closest answer. This may cause misjudgment and result in an artificially high accuracy score.

Example of .jsonl Format

{"question": "165+833+650+615=", "A": "2258", "B": "2263", "C": "2281", "answer": "B"}
{"question": "368+959+918+653+978=", "A": "3876", "B": "3878", "C": "3880", "answer": "A"}
{"question": "776+208+589+882+571+996+515+726=", "A": "5213", "B": "5263", "C": "5383", "answer": "B"}
{"question": "803+862+815+100+409+758+262+169=", "A": "4098", "B": "4128", "C": "4178", "answer": "C"}

Example of .csv Format

question,A,B,C,answer
127+545+588+620+556+199=,2632,2635,2645,B
735+603+102+335+605=,2376,2380,2410,B
506+346+920+451+910+142+659+850=,4766,4774,4784,C
504+811+870+445=,2615,2630,2750,B

Question-Answering (qa)

For qa-type data, the default fields are as follows (for other fields, refer to the Special Fields section):

  • question: The main stem of the question.

  • answer: The correct answer to the question (optional; if missing, the dataset is considered to have no correct answers).

Example of .jsonl Format

{"question": "752+361+181+933+235+986=", "answer": "3448"}
{"question": "712+165+223+711=", "answer": "1811"}
{"question": "921+975+888+539=", "answer": "3323"}
{"question": "752+321+388+643+568+982+468+397=", "answer": "4519"}

Example of .csv Format

question,answer
123+147+874+850+915+163+291+604=,3967
149+646+241+898+822+386=,3142
332+424+582+962+735+798+653+214=,4700
649+215+412+495+220+738+989+452=,4170

Specify Models and Custom Datasets via Command Line

Custom datasets can be directly invoked for evaluation via the command line.

Parameter

Description

Example

--models

Same as when using standard datasets. Specifies the name of the model inference backend task (corresponding to a pre-implemented default model config file under ais_bench/benchmark/configs/models). Multiple task names can be passed. For supported tasks, refer to the README in the parent directory (using standard datasets as examples).

--models vllm_api_general

--custom-dataset-path

Specifies the path to the custom dataset (absolute/relative paths supported). Invalid if --datasets is configured. Not configured by default.

--custom-dataset-path xxx/test_mcq.csv

--custom-dataset-meta-path

Specifies the path to the dataset supplementary info file (.meta.json; absolute/relative paths supported).

--custom-dataset-meta-path xxx/test_mcq.csv.meta.json

--custom-dataset-data-type

Specifies the task type of the custom dataset. Currently supports mcq (multiple-choice) and qa (question-answering). If not configured, the system automatically identifies the type based on the dataset format; if configured, parses the dataset according to the specified type.

--custom-dataset-data-type mcq

--custom-dataset-infer-method

Specifies the inference type for the custom dataset. Currently only supports gen. Defaults to gen if not configured.

--custom-dataset-infer-method gen

Other parameters are the same as those for standard datasets. The system also supports accessing inference services via two APIs: vllm and mindie.

Example Command 1 (mcq Dataset with vllm API)

ais_bench \
    --models vllm_api_general \
    --custom-dataset-path xxx/test_mcq.csv \
    --custom-dataset-data-type mcq \
    --mode all

Example Command 2 (qa Dataset with mindie API)

ais_bench \
    --models mindie_stream_api_general \
    --custom-dataset-path xxx/test_qa.jsonl \
    --custom-dataset-data-type qa \
    --custom-dataset-infer-method gen

In most cases, --custom-dataset-data-type and --custom-dataset-infer-method can be omitted. The ais_bench system will set them automatically based on the following logic:

  • If options (e.g., A, B, C) can be parsed from the dataset, the dataset is identified as mcq; otherwise, it is identified as qa.

  • The default value of --custom-dataset-infer-method is gen.

Specify Models and Custom Datasets via Config Files

This method currently only supports accuracy evaluation scenarios. Other parameters are the same as those for standard datasets, and the system also supports accessing inference services via vllm and mindie APIs.

Example Command

# Use vllm API
ais_bench ais_bench/configs/api_examples/infer_api_vllm_general.py

# Use mindie API
ais_bench ais_bench/configs/api_examples/infer_api_mindie_stream_general.py

Config File Modification

In the original config file, simply add a new entry to the datasets variable. Similar to standard datasets, this method supports mixing custom datasets with standard datasets.

datasets = [
    ...,  # Standard datasets
    {"path": "xxx/test_qa.jsonl", "data_type": "qa", "infer_method": "gen"},
    ...,  # Standard datasets
]

Guide to Using Dataset Supplementary Info (.meta.json)

This feature currently only supports performance evaluation scenarios. The ais_bench system will automatically attempt to parse the input dataset file, so in most cases, a .meta.json file is not required. However, if the original dataset does not specify max_tokens, or if you need to configure data sampling, you must define these settings in a .meta.json file.

File Structure

The .meta.json file is placed in the same directory as the dataset, named in the format [dataset_filename].meta.json. Example structure:

.
β”œβ”€β”€ test_mcq.csv
β”œβ”€β”€ test_mcq.csv.meta.json
β”œβ”€β”€ test_qa.jsonl
└── test_qa.jsonl.meta.json

Supported Fields

  • request_count (str/int): The final number of test cases generated from the dataset. If the original dataset has fewer cases, it will be cyclically padded; if more, only the first request_count cases will be used. Defaults to the length of the original dataset if not set.

  • sampling_mode (str): Dataset sampling mode. Optional values: shuffle (shuffle and sample), random (random sample), default (no sampling).

  • output_config: Controls model output settings for each request.

    • method (str): Type of data distribution. Optional values: uniform (uniform distribution), percentage (percentage distribution).

    • params (str): Parameters for data distribution settings.

      • min_value (str/int): Minimum length of generated data (valid only if method: uniform).

      • max_value (str/int): Maximum length of generated data (valid only if method: uniform).

      • percentage_distribute (list): Percentage distribution of output lengths (valid only if method: percentage). Format: 2D array, where the first element is the output length and the second is the percentage.

Example Configurations

1. Percentage Distribution
{
    "output_config": {
        "method": "percentage",
        "params": {
            "percentage_distribute": [
                [100, 0.5],
                [200, 0.3],
                [400, 0.2]
            ]
        }
    },
    "request_count": "10",
    "sampling_mode": "shuffle"
}
2. Uniform Distribution
{
    "output_config": {
        "method": "uniform",
        "params": {
            "min_value": 100,
            "max_value": 200
        }
    }
}

Special Fields

Maximum Output Length: max_tokens

In both .csv and .jsonl datasets, you can set the maximum output length per request by adding a max_tokens field (with a corresponding numeric value) to each object (in .jsonl) or each row (in .csv). This field is not yet applicable to performance stress testing scenarios.

Example 1: .jsonl Format
  • For mcq type:

    {"question": "165+833+650+615=", "A": "2258", "B": "2263", "C": "2281", "answer": "B", "max_tokens": 512}
    {"question": "368+959+918+653+978=", "A": "3876", "B": "3878", "C": "3880", "answer": "A", "max_tokens": 1024}
    {"question": "776+208+589+882+571+996+515+726=", "A": "5213", "B": "5263", "C": "5383", "answer": "B", "max_tokens": 2048}
    {"question": "803+862+815+100+409+758+262+169=", "A": "4098", "B": "4128", "C": "4178", "answer": "C", "max_tokens": 256}
    
  • For qa type:

    {"question": "165+833+650+615=", "answer": "2263", "max_tokens": 512}
    {"question": "368+959+918+653+978=", "answer": "3876", "max_tokens": 1024}
    {"question": "776+208+589+882+571+996+515+726=", "answer": "5263", "max_tokens": 2048}
    {"question": "803+862+815+100+409+758+262+169=", "answer": "4178", "max_tokens": 256}
    
Example 2: .csv Format
  • For mcq type:

    question,A,B,C,answer,max_tokens
    127+545+588+620+556+199=,2632,2635,2645,B,512
    735+603+102+335+605=,2376,2380,2410,B,1024
    506+346+920+451+910+142+659+850=,4766,4774,4784,C,2048
    504+811+870+445=,2615,2630,2750,B,256
    
  • For qa type:

    question,answer,max_tokens
    127+545+588+620+556+199=,2635,512
    735+603+102+335+605=,2380,1024
    506+346+920+451+910+142+659+850=,4784,2048
    504+811+870+445=,2630,256
    

Example of Dataset and Corresponding Performance Evaluation Results

  • Dataset Example: custom_dataset_example.img

  • Performance Evaluation Result Example: custom_results_example.img

2 Custom Multi-modal dataset

Dataset format

We support datasets in the .jsonl format. Currently, only custom data for multimodal understanding scenarios is supported, including input formats such as image + text , video + text, and audio + text, with each row in the dataset corresponding to one piece of data. The file formats of images, videos and audio are subject to the specific support of the service. Common formats are jpg for images, mp4 for videos and wav for audio.

The default fields of the custom multimodal dataset are as follows:

  • type: The types of data include "image","video" and "audio".

  • path: The path of multimodal data supports the input of multiple values of the same type.

  • question: Represent the text data corresponding to multimodal data.

  • answer: It indicates the corresponding answer. However, currently, it only supports performance evaluation and has not been used in practice for the time being.

.jsonl The format sample is as follows:

{"type": "image", "path": ["/data/mm_custom/59027d7563eba210.jpg"], "question": "what is the brand of this camera?", "answer": "dakota"}
{"type": "image", "path": ["/data/mm_custom/8abf34fc9c4016a6.jpg"], "question": "what does the small white text spell?", "answer": "copenhagen"}

Scene 1: Image text input

{"type": "image", "path": ["/data/mm_custom/59027d7563eba210.jpg"], "question": "what is the brand of this camera?", "answer": "dakota"}
{"type": "image", "path": ["/data/mm_custom/8abf34fc9c4016a6.jpg"], "question": "what does the small white text spell?", "answer": "copenhagen"}

Scene 2: Mixed input of pictures, videos and audio

When testing full-modal models such as Qwen-Omni, the input can be image + text, video + text, or audio + text

{"type": "image", "path": ["/data/mm_custom/59027d7563eba210.jpg"], "question": "what is the brand of this camera?", "answer": "dakota"}
{"type": "video", "path": ["/data/mm_custom/93.mp4"], "question": "describe this video", "answer": "xxx"}
{"type": "image", "path": ["/data/mm_custom/8abf34fc9c4016a6.jpg"], "question": "what does the small white text spell?", "answer": "copenhagen"}
{"type": "video", "path": ["/data/mm_custom/83.mp4"], "question": "describe this video", "answer": "xxx"}
{"type": "audio", "path": ["/data/mm_custom/f1874_0_cough.wav"], "question": "describe this audio", "answer": "xxx"}
{"type": "audio", "path": ["/data/mm_custom/m1855_0_sniff.wav"], "question": "describe this audio", "answer": "xxx"}

Scene 3: Multi-image input

The input is multiple images + text, and the scenarios of multiple video input and multiple audio input are similar to this.

{"type": "image", "path": ["/data/mm_custom/59027d7563eba210.jpg", "/data/mm_custom/8abf34fc9c4016a6.jpg"], "question": "compare these images", "answer": "dakota"}
{"type": "image", "path": ["/data/mm_custom/8abf34fc9c4016a6.jpg", "/data/mm_custom/59027d7563eba210.jpg", "/data/mm_custom/cd34jojof2334jo34.jpg"], "question": "describe these images", "answer": "copenhagen"}

Specify the model and custom dataset through the command line

Custom datasets can be directly invoked through the command line to start the evaluation.

Parameter

description

sample

–models

Same as when using standard datasets. Specifies the name of the model inference backend task (corresponding to a pre-implemented default model config file under ais_bench/benchmark/configs/models). Multiple task names can be passed. For supported tasks, refer to the README in the parent directory (using standard datasets as examples).

–models vllm_api_general

–datasets

Specified as mm_custom_gen, corresponding to mm_custom_gen.py,according to the need to modify the data set in the configuration file path prompt and the input data

–datasets mm_custom_gen

The remaining parameters are consistent with the ordinary dataset, and it also supports accessing the corresponding inference service through the vllm and mindie apis.

ais_bench \
    --models vllm_api_general \
    --datasets mm_custom_gen \
    --mode perf
ais_bench \
    --models mindie_stream_api_general \
    --datasets mm_custom_gen \
    --mode perf