Guide to Using Custom Datasetsο
This tutorial is only for temporary and informal use of datasets.
In this tutorial, we will explain how to test a new dataset without adding a dataset configuration file or modifying the source code of ais_bench. The supported task types include multiple-choice questions (mcq) and question-answering (qa). Currently, both mcq and qa only support gen (generation-based) inference.
Dataset Formatsο
We support two dataset formats: .jsonl and .csv.
Multiple-Choice Questions (mcq)ο
For mcq-type data, the default fields are as follows (for other fields, refer to the Special Fields section):
question: The main stem of the multiple-choice question.A,B,C, β¦ : Options represented by single uppercase letters (unlimited in number). By default, the system only parses consecutive letters starting fromAas valid options.answer: The correct answer to the multiple-choice question, which must be one of the above options (e.g.,A,B,C).If an exact match cannot be found during accuracy calculation, the Levenshtein Distance Algorithm will be used to select the closest answer. This may cause misjudgment and result in an artificially high accuracy score.
Example of .jsonl Formatο
{"question": "165+833+650+615=", "A": "2258", "B": "2263", "C": "2281", "answer": "B"}
{"question": "368+959+918+653+978=", "A": "3876", "B": "3878", "C": "3880", "answer": "A"}
{"question": "776+208+589+882+571+996+515+726=", "A": "5213", "B": "5263", "C": "5383", "answer": "B"}
{"question": "803+862+815+100+409+758+262+169=", "A": "4098", "B": "4128", "C": "4178", "answer": "C"}
Example of .csv Formatο
question,A,B,C,answer
127+545+588+620+556+199=,2632,2635,2645,B
735+603+102+335+605=,2376,2380,2410,B
506+346+920+451+910+142+659+850=,4766,4774,4784,C
504+811+870+445=,2615,2630,2750,B
Question-Answering (qa)ο
For qa-type data, the default fields are as follows (for other fields, refer to the Special Fields section):
question: The main stem of the question.answer: The correct answer to the question (optional; if missing, the dataset is considered to have no correct answers).
Example of .jsonl Formatο
{"question": "752+361+181+933+235+986=", "answer": "3448"}
{"question": "712+165+223+711=", "answer": "1811"}
{"question": "921+975+888+539=", "answer": "3323"}
{"question": "752+321+388+643+568+982+468+397=", "answer": "4519"}
Example of .csv Formatο
question,answer
123+147+874+850+915+163+291+604=,3967
149+646+241+898+822+386=,3142
332+424+582+962+735+798+653+214=,4700
649+215+412+495+220+738+989+452=,4170
Specify Models and Custom Datasets via Command Lineο
Custom datasets can be directly invoked for evaluation via the command line.
Parameter |
Description |
Example |
|---|---|---|
|
Same as when using standard datasets. Specifies the name of the model inference backend task (corresponding to a pre-implemented default model config file under |
|
|
Specifies the path to the custom dataset (absolute/relative paths supported). Invalid if |
|
|
Specifies the path to the dataset supplementary info file ( |
|
|
Specifies the task type of the custom dataset. Currently supports |
|
|
Specifies the inference type for the custom dataset. Currently only supports |
|
Other parameters are the same as those for standard datasets. The system also supports accessing inference services via two APIs: vllm and mindie.
Example Command 1 (mcq Dataset with vllm API)ο
ais_bench \
--models vllm_api_general \
--custom-dataset-path xxx/test_mcq.csv \
--custom-dataset-data-type mcq \
--mode all
Example Command 2 (qa Dataset with mindie API)ο
ais_bench \
--models mindie_stream_api_general \
--custom-dataset-path xxx/test_qa.jsonl \
--custom-dataset-data-type qa \
--custom-dataset-infer-method gen
In most cases, --custom-dataset-data-type and --custom-dataset-infer-method can be omitted. The ais_bench system will set them automatically based on the following logic:
If options (e.g.,
A,B,C) can be parsed from the dataset, the dataset is identified asmcq; otherwise, it is identified asqa.The default value of
--custom-dataset-infer-methodisgen.
Specify Models and Custom Datasets via Config Filesο
This method currently only supports accuracy evaluation scenarios. Other parameters are the same as those for standard datasets, and the system also supports accessing inference services via vllm and mindie APIs.
Example Commandο
# Use vllm API
ais_bench ais_bench/configs/api_examples/infer_api_vllm_general.py
# Use mindie API
ais_bench ais_bench/configs/api_examples/infer_api_mindie_stream_general.py
Config File Modificationο
In the original config file, simply add a new entry to the datasets variable. Similar to standard datasets, this method supports mixing custom datasets with standard datasets.
datasets = [
..., # Standard datasets
{"path": "xxx/test_qa.jsonl", "data_type": "qa", "infer_method": "gen"},
..., # Standard datasets
]
Guide to Using Dataset Supplementary Info (.meta.json)ο
This feature currently only supports performance evaluation scenarios. The ais_bench system will automatically attempt to parse the input dataset file, so in most cases, a .meta.json file is not required. However, if the original dataset does not specify max_tokens, or if you need to configure data sampling, you must define these settings in a .meta.json file.
File Structureο
The .meta.json file is placed in the same directory as the dataset, named in the format [dataset_filename].meta.json. Example structure:
.
βββ test_mcq.csv
βββ test_mcq.csv.meta.json
βββ test_qa.jsonl
βββ test_qa.jsonl.meta.json
Supported Fieldsο
request_count(str/int): The final number of test cases generated from the dataset. If the original dataset has fewer cases, it will be cyclically padded; if more, only the firstrequest_countcases will be used. Defaults to the length of the original dataset if not set.sampling_mode(str): Dataset sampling mode. Optional values:shuffle(shuffle and sample),random(random sample),default(no sampling).output_config: Controls model output settings for each request.method(str): Type of data distribution. Optional values:uniform(uniform distribution),percentage(percentage distribution).params(str): Parameters for data distribution settings.min_value(str/int): Minimum length of generated data (valid only ifmethod: uniform).max_value(str/int): Maximum length of generated data (valid only ifmethod: uniform).percentage_distribute(list): Percentage distribution of output lengths (valid only ifmethod: percentage). Format: 2D array, where the first element is the output length and the second is the percentage.
Example Configurationsο
1. Percentage Distributionο
{
"output_config": {
"method": "percentage",
"params": {
"percentage_distribute": [
[100, 0.5],
[200, 0.3],
[400, 0.2]
]
}
},
"request_count": "10",
"sampling_mode": "shuffle"
}
2. Uniform Distributionο
{
"output_config": {
"method": "uniform",
"params": {
"min_value": 100,
"max_value": 200
}
}
}
Special Fieldsο
Maximum Output Length: max_tokensο
In both .csv and .jsonl datasets, you can set the maximum output length per request by adding a max_tokens field (with a corresponding numeric value) to each object (in .jsonl) or each row (in .csv). This field is not yet applicable to performance stress testing scenarios.
Example 1: .jsonl Formatο
For
mcqtype:{"question": "165+833+650+615=", "A": "2258", "B": "2263", "C": "2281", "answer": "B", "max_tokens": 512} {"question": "368+959+918+653+978=", "A": "3876", "B": "3878", "C": "3880", "answer": "A", "max_tokens": 1024} {"question": "776+208+589+882+571+996+515+726=", "A": "5213", "B": "5263", "C": "5383", "answer": "B", "max_tokens": 2048} {"question": "803+862+815+100+409+758+262+169=", "A": "4098", "B": "4128", "C": "4178", "answer": "C", "max_tokens": 256}
For
qatype:{"question": "165+833+650+615=", "answer": "2263", "max_tokens": 512} {"question": "368+959+918+653+978=", "answer": "3876", "max_tokens": 1024} {"question": "776+208+589+882+571+996+515+726=", "answer": "5263", "max_tokens": 2048} {"question": "803+862+815+100+409+758+262+169=", "answer": "4178", "max_tokens": 256}
Example 2: .csv Formatο
For
mcqtype:question,A,B,C,answer,max_tokens 127+545+588+620+556+199=,2632,2635,2645,B,512 735+603+102+335+605=,2376,2380,2410,B,1024 506+346+920+451+910+142+659+850=,4766,4774,4784,C,2048 504+811+870+445=,2615,2630,2750,B,256
For
qatype:question,answer,max_tokens 127+545+588+620+556+199=,2635,512 735+603+102+335+605=,2380,1024 506+346+920+451+910+142+659+850=,4784,2048 504+811+870+445=,2630,256
Example of Dataset and Corresponding Performance Evaluation Resultsο
Dataset Example:

Performance Evaluation Result Example:

2 Custom Multi-modal datasetο
Dataset formatο
We support datasets in the .jsonl format. Currently, only custom data for multimodal understanding scenarios is supported, including input formats such as image + text , video + text, and audio + text, with each row in the dataset corresponding to one piece of data. The file formats of images, videos and audio are subject to the specific support of the service. Common formats are jpg for images, mp4 for videos and wav for audio.
The default fields of the custom multimodal dataset are as follows:
type: The types of data include"image","video"and"audio".path: The path of multimodal data supports the input of multiple values of the same type.question: Represent the text data corresponding to multimodal data.answer: It indicates the corresponding answer. However, currently, it only supports performance evaluation and has not been used in practice for the time being.
.jsonl The format sample is as follows:
{"type": "image", "path": ["/data/mm_custom/59027d7563eba210.jpg"], "question": "what is the brand of this camera?", "answer": "dakota"}
{"type": "image", "path": ["/data/mm_custom/8abf34fc9c4016a6.jpg"], "question": "what does the small white text spell?", "answer": "copenhagen"}
Scene 1: Image text inputο
{"type": "image", "path": ["/data/mm_custom/59027d7563eba210.jpg"], "question": "what is the brand of this camera?", "answer": "dakota"}
{"type": "image", "path": ["/data/mm_custom/8abf34fc9c4016a6.jpg"], "question": "what does the small white text spell?", "answer": "copenhagen"}
Scene 2: Mixed input of pictures, videos and audioο
When testing full-modal models such as Qwen-Omni, the input can be image + text, video + text, or audio + text
{"type": "image", "path": ["/data/mm_custom/59027d7563eba210.jpg"], "question": "what is the brand of this camera?", "answer": "dakota"}
{"type": "video", "path": ["/data/mm_custom/93.mp4"], "question": "describe this video", "answer": "xxx"}
{"type": "image", "path": ["/data/mm_custom/8abf34fc9c4016a6.jpg"], "question": "what does the small white text spell?", "answer": "copenhagen"}
{"type": "video", "path": ["/data/mm_custom/83.mp4"], "question": "describe this video", "answer": "xxx"}
{"type": "audio", "path": ["/data/mm_custom/f1874_0_cough.wav"], "question": "describe this audio", "answer": "xxx"}
{"type": "audio", "path": ["/data/mm_custom/m1855_0_sniff.wav"], "question": "describe this audio", "answer": "xxx"}
Scene 3: Multi-image inputο
The input is multiple images + text, and the scenarios of multiple video input and multiple audio input are similar to this.
{"type": "image", "path": ["/data/mm_custom/59027d7563eba210.jpg", "/data/mm_custom/8abf34fc9c4016a6.jpg"], "question": "compare these images", "answer": "dakota"}
{"type": "image", "path": ["/data/mm_custom/8abf34fc9c4016a6.jpg", "/data/mm_custom/59027d7563eba210.jpg", "/data/mm_custom/cd34jojof2334jo34.jpg"], "question": "describe these images", "answer": "copenhagen"}
Specify the model and custom dataset through the command lineο
Custom datasets can be directly invoked through the command line to start the evaluation.
Parameter |
description |
sample |
|---|---|---|
βmodels |
Same as when using standard datasets. Specifies the name of the model inference backend task (corresponding to a pre-implemented default model config file under |
βmodels vllm_api_general |
βdatasets |
Specified as mm_custom_gen, corresponding to mm_custom_gen.pyοΌaccording to the need to modify the data set in the configuration file path prompt and the input data |
βdatasets mm_custom_gen |
The remaining parameters are consistent with the ordinary dataset, and it also supports accessing the corresponding inference service through the vllm and mindie apis.
ais_bench \
--models vllm_api_general \
--datasets mm_custom_gen \
--mode perf
ais_bench \
--models mindie_stream_api_general \
--datasets mm_custom_gen \
--mode perf