# GEdit-Bench ## Introduction to GEdit-Bench [**GEdit-Bench (Genuine Edit-Bench)**](https://github.com/stepfun-ai/Step1X-Edit/blob/main/GEdit-Bench/) is an authoritative benchmark for **real-world instruction-based image editing** launched by StepFun in April 2025. Its core value is to test the practical capabilities of models using real user requirements. ### Core Positioning and Background - **Full Name**: Genuine Edit-Bench - **Developer**: StepFun AI, released together with their image editing model **Step1X-Edit** - **Core Objective**: To address the limitations of existing benchmarks that rely on synthetic instructions and are detached from real-world scenarios, providing **evaluation standards closer to actual user usage** ### Core Dataset Information - **Data Source**: Collected **over 1000 real user editing requests** from communities like Reddit, after deduplication, privacy removal, and manual annotation - **Final Scale**: **606 test samples** (including English GEdit-Bench-EN and Chinese GEdit-Bench-CN), totaling 1212 samples in the entire dataset - **Task Coverage**: 11 categories of high-frequency real editing scenarios 1. Background replacement/modification (background_change) 2. Color/tone adjustment (color_alter) 3. Material/texture transformation (material_alter) 4. Action/pose editing (motion_change) 5. Portrait beautification/retouching (ps_human) 6. Style transfer (style_change) 7. Object addition/removal/replacement (subject-add) 8. Text editing (text_change) 9. Local detail refinement (subject-remove) 10. Composition adjustment (subject-replace) 11. Composite editing (multiple instruction combinations) (tone_transfer) ### Evaluation Metrics (MLLM Automatic Scoring, Full Score 10 Points) - **G_SC, Q_SC (Semantic Consistency)**: Matching degree between editing results and instructions - **G_PQ, Q_PQ (Image Quality)**: Clarity, detail preservation, artifact-free - **G_O, Q_0 (Overall Score)**: Weighted combination of G_SC and G_PQ > Note: `G_` indicates using GPT-4o API as the judge model for scoring, `Q_` indicates using Qwen-2.5-VL-72B-Instruct as the judge model for scoring. ## AISBench GEdit-Bench Evaluation Practice ### Evaluating Qwen-Image-Edit Model Based on MindIE Framework For the inference implementation of Qwen-Image-Edit model, refer to https://modelers.cn/models/MindIE/Qwen-Image-Edit-2509. #### Hardware Requirements Ascend Server: - 800I A2 (single chip 64GB video memory) - 800I A3 #### Environment Preparation (Taking 800I A2 Hardware as Example) Complete the evaluation using the image provided by MindIE. 1. **Pull MindIE Image** ``` docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.3.0-800I-A2-py311-openeuler24.03-lts ``` 2. **Run Container** ``` docker run --name ${NAME} -it -d --net=host --shm-size=500g \ --privileged=true \ -w /home \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device=/dev/devmm_svm \ --entrypoint=bash \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /usr/local/sbin:/usr/local/sbin \ -v ${PATH_TO_WORKSPACE}:${PATH_TO_WORKSPACE} \ -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \ ${IMAGES_ID} ``` > Where: - `${NAME}`: Container name - `${PATH_TO_WORKSPACE}`: Local workspace directory path - `${IMAGES_ID}`: MindIE image ID 3. **Install Latest Version of AISBench** Clone the latest AISBench code in the container-mounted `${PATH_TO_WORKSPACE}` directory: ```bash git clone https://github.com/AISBench/benchmark.git ``` Enter the container: ```bash docker exec -it ${NAME} bash ``` In the container, refer to AISBench's [Installation Instructions](../../get_started/install.md) to install the latest AISBench tool. 4. **Install Additional Dependencies for Qwen-Image-Edit** ```shell pip install diffusers==0.35.1 pip install transformers==4.52.4 pip install yunchang==0.6.0 ``` 5. **Prepare Model Weights and Dataset** Refer to [Qwen-Image-Edit-2509](https://huggingface.co/Qwen/Qwen-Image-Edit-2509) to obtain model weights. Refer to [GEdit-Bench Dataset](https://huggingface.co/datasets/stepfun-ai/GEdit-Bench) to obtain the dataset. Place the dataset in the `${PATH_TO_WORKSPACE}/benchmark/ais_bench/datasets` directory (using symbolic links is also acceptable). #### Evaluation Configuration Preparation In the container, navigate to the `${PATH_TO_WORKSPACE}/benchmark/ais_bench/configs/lmm_example` directory, open the `multi_device_run_qwen_image_edit.py` file, and edit the following content to set the model configuration: ```python # ...... # ====== User configuration parameters ========= qwen_image_edit_models[0]["path"] = "/path/to/Qwen-Image-Edit-2509/" # Modify to actual model weight path qwen_image_edit_models[0]["infer_kwargs"]["num_inference_steps"] = 50 # Modify to the required inference steps device_list = [0] # [0, 1, 2, 3] Modify to the actual available NPU device ID list, not necessarily in order, each device will separately load a weight # ====== User configuration parameters ========= # ...... ``` Note: This configuration file supports splitting the Gedit-Bench dataset into multiple parts on average and distributing them to multiple model instances for inference to improve inference efficiency. Execute the following command to find the path where the `gedit_gen_0_shot_llmjudge.py` dataset configuration is located: ```bash ais_bench --datasets gedit_gen_0_shot_llmjudge --search ``` Edit the judge model related configuration in the `gedit_gen_0_shot_llmjudge.py` file. The judge model configuration is the same as the regular API model configuration (you can refer to the relevant configuration tutorial in Quick Start [Model Configuration Introduction](../../get_started/quick_start.md#task-corresponding-configuration-file-modification)), but in the `judge_model` field: ```python # ...... judge_model=dict( attr="service", type=VLLMCustomAPIChat, abbr=f"{metric}_judge", # Be added after dataset abbr path="", model="", stream=True, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8080, url="", max_out_len=512, batch_size=16, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ), # ...... ``` #### Start Evaluation In the container, navigate to the `${PATH_TO_WORKSPACE}/benchmark/ais_bench/configs/lmm_example` directory and execute the following command to start the evaluation: ```bash ais_bench multi_device_run_qwen_image_edit.py --max-num-workers {MAX_NUM_WORKERS} ``` Where `{MAX_NUM_WORKERS}` is the maximum number of concurrent workers. It is recommended to set it to twice the number of devices used. For example, if `device_list = [0, 1, 2, 3]`, use `--max-num-workers 8`. After the evaluation command completes (taking 4 devices as an example), logs similar to the following will be printed: ```shell The markdown format results is as below: | dataset | version | metric | mode | qwen-image-edit-0 | qwen-image-edit-1 | qwen-image-edit-2 | qwen-image-edit-3 | |----- | ----- | ----- | ----- | ----- | ----- | ----- | -----| | gedit-0-SC_judge | 16dd59 | SC | gen | 7.20 | - | - | - | | gedit-0-PQ_judge | 16dd59 | PQ | gen | 7.08 | - | - | - | | gedit-1-SC_judge | 16dd59 | SC | gen | - | 6.63 | - | - | | gedit-1-PQ_judge | 16dd59 | PQ | gen | - | 6.73 | - | - | | gedit-2-SC_judge | 16dd59 | SC | gen | - | - | 7.37 | - | | gedit-2-PQ_judge | 16dd59 | PQ | gen | - | - | 7.22 | - | | gedit-3-SC_judge | 16dd59 | SC | gen | - | - | - | 7.31 | | gedit-3-PQ_judge | 16dd59 | PQ | gen | - | - | - | 7.24 | [2026-03-04 15:40:45,583] [ais_bench] [INFO] write markdown summary to /workplace/benchmark/ais_bench/configs/lmm_example/outputs/default/20260213_150110/summary/summary_20260304_152835.md ``` This log prints the metadata of the multi-device evaluation. In the `/workplace/benchmark/ais_bench/configs/lmm_example` path, you need to further call the following command-line tool to process the metadata: ```bash # # python3 -m ais_bench.tools.dataset_processors.gedit.display_results --config_path {CONFIG_PATH} --timestamp_path {TIMESTAMP_PATH} python3 -m ais_bench.tools.dataset_processors.gedit.display_results --config_path ./multi_device_run_qwen_image_edit.py --timestamp_path outputs/default/20260213_150110/ ``` Where `{CONFIG_PATH}` is the path of the configuration used to start the ais_bench command (i.e., the `multi_device_run_qwen_image_edit.py` file), `{TIMESTAMP_PATH}` is the timestamp path where the ais_bench command results are written (i.e., `outputs/default/20260213_150110/`). After this command executes, logs similar to the following will be printed, showing the final GEdit-Bench evaluation metric results: ```shell [2026-03-04 15:57:52,522] [__main__] [INFO] Finish dumping csv to: outputs/default/20260213_150110/results/gedit_gathered_result.csv language SC_point PQ_point O_point ---------- ---------- ---------- --------- zh 7.1230 7.0694 6.9896 en 7.1280 7.0623 6.9983 all case 7.1254 7.0660 6.9937 ``` In the `outputs/default/20260213_150110/results/gedit_gathered_result.csv` file, the specific accuracy score for each case is saved. #### (Optional Extension) Using AISBench Inference Results in GEdit-Bench Tool Execute the following command: ```bash # python3 -m ais_bench.tools.dataset_processors.gedit.display_results --config_path {CONFIG_PATH} --timestamp_path {TIMESTAMP_PATH} python3 -m ais_bench.tools.dataset_processors.gedit.convert_results --config_path ./multi_device_run_qwen_image_edit.py --timestamp_path outputs/default/20260213_150110/ ``` After this command executes, a `fullset` folder will be generated in the `outputs/default/20260213_150110/results/` directory. This folder can be directly used for evaluation in the [GEdit-Bench Tool](https://github.com/stepfun-ai/Step1X-Edit/blob/main/GEdit-Bench/EVAL.md).