Introduction to Evaluation Scenarios
Accuracy Evaluation
Service-Oriented Accuracy Evaluation
Function Description: Evaluate the prediction accuracy of a model deployed as a service on specific datasets. Currently supports accuracy evaluation based on generative and PPL (Perplexity-based) modes.
Requirements: The model has been deployed, and its actual service capabilities need to be tested.
Model Tasks and Dataset Tasks Supported by This Scenario:
Model Tasks: 📚 Service-Oriented Inference Backend
Dataset Tasks: 📚 Open-Source Datasets and 📚 Custom Datasets
Constraint: Currently, PPL mode accuracy evaluation tasks only support
vllm_api_generalandvllm_api_general_chatmodel configurations; other configurations are not supported.
After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Service-Oriented Accuracy Evaluation Guide
Pure Model Accuracy Evaluation
Function Description: Evaluate the accuracy of locally loaded models (non-service-oriented) on different datasets.
Requirements: Offline model weights and a deployment environment.
Supported Items:
Model Tasks: 📚 Local Model Backend
Dataset Tasks: 📚 Open-Source Datasets and 📚 Custom Datasets
Constraint: PPL mode evaluation tasks are not supported.
After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Pure Model Accuracy Evaluation Guide
Performance Evaluation
Service-Oriented Performance Evaluation
Function Description: Evaluate the operational efficiency (throughput, latency) of a service model in a real deployment environment.
Requirements: The model inference service must support access via a streaming interface.
Supported Items:
Model Tasks: Streaming interface types in 📚 Service-Oriented Inference Backend
Dataset Tasks: All data types in 📚 Supported Dataset Types
Note: The cache size occupied by performance evaluation is proportional to the context length of requests and the number of requests, so it usually increases positively with the evaluation duration.
Constraint: PPL mode evaluation tasks are not supported.
After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Service-Oriented Performance Evaluation Guide