Introduction to Evaluation Scenarios

Accuracy Evaluation

Service-Oriented Accuracy Evaluation

  • Function Description: Evaluate the prediction accuracy of a model deployed as a service on specific datasets. Currently supports accuracy evaluation based on generative and PPL (Perplexity-based) modes.

  • Requirements: The model has been deployed, and its actual service capabilities need to be tested.

  • Model Tasks and Dataset Tasks Supported by This Scenario:

  • Constraint: Currently, PPL mode accuracy evaluation tasks only support vllm_api_general and vllm_api_general_chat model configurations; other configurations are not supported.

After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Service-Oriented Accuracy Evaluation Guide

Pure Model Accuracy Evaluation

  • Function Description: Evaluate the accuracy of locally loaded models (non-service-oriented) on different datasets.

  • Requirements: Offline model weights and a deployment environment.

  • Supported Items:

  • Constraint: PPL mode evaluation tasks are not supported.

After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Pure Model Accuracy Evaluation Guide

Performance Evaluation

Service-Oriented Performance Evaluation

  • Function Description: Evaluate the operational efficiency (throughput, latency) of a service model in a real deployment environment.

  • Requirements: The model inference service must support access via a streaming interface.

  • Supported Items:

  • Note: The cache size occupied by performance evaluation is proportional to the context length of requests and the number of requests, so it usually increases positively with the evaluation duration.

  • Constraint: PPL mode evaluation tasks are not supported.

After selecting the model task and dataset task according to your usage needs, refer to the document for detailed usage of this scenario: 📚 Service-Oriented Performance Evaluation Guide