Introduction to Evaluation Scenarios

Accuracy Evaluation

Service-Oriented Accuracy Evaluation

Function Description: Evaluate the prediction accuracy of a model deployed as a service on specific datasets. Currently supports accuracy evaluation based on generative and PPL (Perplexity-based) modes.
Requirements: The model has been deployed, and its actual service capabilities need to be tested.
Model Tasks and Dataset Tasks Supported by This Scenario:
- Model Tasks: 📚 Service-Oriented Inference Backend
- Dataset Tasks: 📚 Open-Source Datasets and 📚 Custom Datasets
Constraint: Currently, PPL mode accuracy evaluation tasks only support vllm_api_general and vllm_api_general_chat model configurations; other configurations are not supported.

After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Service-Oriented Accuracy Evaluation Guide

Pure Model Accuracy Evaluation

Function Description: Evaluate the accuracy of locally loaded models (non-service-oriented) on different datasets.
Requirements: Offline model weights and a deployment environment.
Supported Items:
- Model Tasks: 📚 Local Model Backend
- Dataset Tasks: 📚 Open-Source Datasets and 📚 Custom Datasets
Constraint: PPL mode evaluation tasks are not supported.

After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Pure Model Accuracy Evaluation Guide

Performance Evaluation

Service-Oriented Performance Evaluation

Function Description: Evaluate the operational efficiency (throughput, latency) of a service model in a real deployment environment.
Requirements: The model inference service must support access via a streaming interface.
Supported Items:
- Model Tasks: Streaming interface types in 📚 Service-Oriented Inference Backend
- Dataset Tasks: All data types in 📚 Supported Dataset Types
Note: The cache size occupied by performance evaluation is proportional to the context length of requests and the number of requests, so it usually increases positively with the evaluation duration.
Constraint: PPL mode evaluation tasks are not supported.

After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Service-Oriented Performance Evaluation Guide