You have to be logged in to leave a comment.

comments	description	keywords
true	Discover how to enhance Ultralytics YOLO model performance using Intel's OpenVINO toolkit. Boost latency and throughput efficiently.	Ultralytics YOLO, OpenVINO optimization, deep learning, model inference, throughput optimization, latency optimization, AI deployment, Intel's OpenVINO, performance tuning

OpenVINO Inference Optimization for YOLO

Introduction

When deploying deep learning models, particularly those for object detection such as Ultralytics YOLO models, achieving optimal performance is crucial. This guide delves into leveraging Intel's OpenVINO toolkit to optimize inference, focusing on latency and throughput. Whether you're working on consumer-grade applications or large-scale deployments, understanding and applying these optimization strategies will ensure your models run efficiently on various devices.

Optimizing for Latency

Latency optimization is vital for applications requiring immediate response from a single model given a single input, typical in consumer scenarios. The goal is to minimize the delay between input and inference result. However, achieving low latency involves careful consideration, especially when running concurrent inferences or managing multiple models.

Key Strategies for Latency Optimization:

Single Inference per Device: The simplest way to achieve low latency is by limiting to one inference at a time per device. Additional concurrency often leads to increased latency.
Leveraging Sub-Devices: Devices like multi-socket CPUs or multi-tile GPUs can execute multiple requests with minimal latency increase by utilizing their internal sub-devices.
OpenVINO Performance Hints: Utilizing OpenVINO's ov::hint::PerformanceMode::LATENCY for the ov::hint::performance_mode property during model compilation simplifies performance tuning, offering a device-agnostic and future-proof approach.

Managing First-Inference Latency:

Model Caching: To mitigate model load and compile times impacting latency, use model caching where possible. For scenarios where caching isn't viable, CPUs generally offer the fastest model load times.
Model Mapping vs. Reading: To reduce load times, OpenVINO replaced model reading with mapping. However, if the model is on a removable or network drive, consider using ov::enable_mmap(false) to switch back to reading.
AUTO Device Selection: This mode begins inference on the CPU, shifting to an accelerator once ready, seamlessly reducing first-inference latency.

Optimizing for Throughput

Throughput optimization is crucial for scenarios serving numerous inference requests simultaneously, maximizing resource utilization without significantly sacrificing individual request performance.

Approaches to Throughput Optimization:

OpenVINO Performance Hints: A high-level, future-proof method to enhance throughput across devices using performance hints.

import openvino.properties.hint as hints

config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT}
compiled_model = core.compile_model(model, "GPU", config)

Explicit Batching and Streams: A more granular approach involving explicit batching and the use of streams for advanced performance tuning.

Designing Throughput-Oriented Applications:

To maximize throughput, applications should:

Process inputs in parallel, making full use of the device's capabilities.
Decompose data flow into concurrent inference requests, scheduled for parallel execution.
Utilize the Async API with callbacks to maintain efficiency and avoid device starvation.

Multi-Device Execution:

OpenVINO's multi-device mode simplifies scaling throughput by automatically balancing inference requests across devices without requiring application-level device management.

Real-World Performance Gains

Implementing OpenVINO optimizations with Ultralytics YOLO models can yield significant performance improvements. As demonstrated in benchmarks, users can experience up to 3x faster inference speeds on Intel CPUs, with even greater accelerations possible across Intel's hardware spectrum including integrated GPUs, dedicated GPUs, and VPUs.

For example, when running YOLOv8 models on Intel Xeon CPUs, the OpenVINO-optimized versions consistently outperform their PyTorch counterparts in terms of inference time per image, without compromising on accuracy.

Practical Implementation

To export and optimize your Ultralytics YOLO model for OpenVINO, you can use the export functionality:

from ultralytics import YOLO

# Load a model
model = YOLO("yolov8n.pt")

# Export the model to OpenVINO format
model.export(format="openvino", half=True)  # Export with FP16 precision

After exporting, you can run inference with the optimized model:

# Load the OpenVINO model
ov_model = YOLO("yolov8n_openvino_model/")

# Run inference with performance hints for latency
results = ov_model("path/to/image.jpg", verbose=True)

Conclusion

Optimizing Ultralytics YOLO models for latency and throughput with OpenVINO can significantly enhance your application's performance. By carefully applying the strategies outlined in this guide, developers can ensure their models run efficiently, meeting the demands of various deployment scenarios. Remember, the choice between optimizing for latency or throughput depends on your specific application needs and the characteristics of the deployment environment.

For more detailed technical information and the latest updates, refer to the OpenVINO documentation and Ultralytics YOLO repository. These resources provide in-depth guides, tutorials, and community support to help you get the most out of your deep learning models.

Ensuring your models achieve optimal performance is not just about tweaking configurations; it's about understanding your application's needs and making informed decisions. Whether you're optimizing for real-time responses or maximizing throughput for large-scale processing, the combination of Ultralytics YOLO models and OpenVINO offers a powerful toolkit for developers to deploy high-performance AI solutions.

FAQ

How do I optimize Ultralytics YOLO models for low latency using OpenVINO?

Optimizing Ultralytics YOLO models for low latency involves several key strategies:

Single Inference per Device: Limit inferences to one at a time per device to minimize delays.
Leveraging Sub-Devices: Utilize devices like multi-socket CPUs or multi-tile GPUs which can handle multiple requests with minimal latency increase.
OpenVINO Performance Hints: Use OpenVINO's ov::hint::PerformanceMode::LATENCY during model compilation for simplified, device-agnostic tuning.

For more practical tips on optimizing latency, check out the Latency Optimization section of our guide.

Why should I use OpenVINO for optimizing Ultralytics YOLO throughput?

OpenVINO enhances Ultralytics YOLO model throughput by maximizing device resource utilization without sacrificing performance. Key benefits include:

Performance Hints: Simple, high-level performance tuning across devices.
Explicit Batching and Streams: Fine-tuning for advanced performance.
Multi-Device Execution: Automated inference load balancing, easing application-level management.

Example configuration:

import openvino.properties.hint as hints

config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT}
compiled_model = core.compile_model(model, "GPU", config)

Learn more about throughput optimization in the Throughput Optimization section of our detailed guide.

What is the best practice for reducing first-inference latency in OpenVINO?

To reduce first-inference latency, consider these practices:

Model Caching: Use model caching to decrease load and compile times.
Model Mapping vs. Reading: Use mapping (ov::enable_mmap(true)) by default but switch to reading (ov::enable_mmap(false)) if the model is on a removable or network drive.
AUTO Device Selection: Utilize AUTO mode to start with CPU inference and transition to an accelerator seamlessly.

For detailed strategies on managing first-inference latency, refer to the Managing First-Inference Latency section.

How do I balance optimizing for latency and throughput with Ultralytics YOLO and OpenVINO?

Balancing latency and throughput optimization requires understanding your application needs:

Latency Optimization: Ideal for real-time applications requiring immediate responses (e.g., consumer-grade apps).
Throughput Optimization: Best for scenarios with many concurrent inferences, maximizing resource use (e.g., large-scale deployments).

Using OpenVINO's high-level performance hints and multi-device modes can help strike the right balance. Choose the appropriate OpenVINO Performance hints based on your specific requirements.

Can I use Ultralytics YOLO models with other AI frameworks besides OpenVINO?

Yes, Ultralytics YOLO models are highly versatile and can be integrated with various AI frameworks. Options include:

TensorRT: For NVIDIA GPU optimization, follow the TensorRT integration guide.
CoreML: For Apple devices, refer to our CoreML export instructions.
TensorFlow.js: For web and Node.js apps, see the TF.js conversion guide.

Explore more integrations on the Ultralytics Integrations page.

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

optimizing-openvino-latency-vs-throughput-modes.md 10 KB

History Raw

OpenVINO Inference Optimization for YOLO

Introduction

Optimizing for Latency

Key Strategies for Latency Optimization:

Managing First-Inference Latency:

Optimizing for Throughput

Approaches to Throughput Optimization:

Designing Throughput-Oriented Applications:

Multi-Device Execution:

Real-World Performance Gains

Practical Implementation

Conclusion

FAQ

How do I optimize Ultralytics YOLO models for low latency using OpenVINO?

Why should I use OpenVINO for optimizing Ultralytics YOLO throughput?

What is the best practice for reducing first-inference latency in OpenVINO?

How do I balance optimizing for latency and throughput with Ultralytics YOLO and OpenVINO?

Can I use Ultralytics YOLO models with other AI frameworks besides OpenVINO?

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Ultralytics / ultralytics connected to https://github.com/ultralytics/ultralytics.git

optimizing-openvino-latency-vs-throughput-modes.md 10 KB History Raw

OpenVINO Inference Optimization for YOLO

Introduction

Optimizing for Latency

Key Strategies for Latency Optimization:

Managing First-Inference Latency:

Optimizing for Throughput

Approaches to Throughput Optimization:

Designing Throughput-Oriented Applications:

Multi-Device Execution:

Real-World Performance Gains

Practical Implementation

Conclusion

FAQ

How do I optimize Ultralytics YOLO models for low latency using OpenVINO?

Why should I use OpenVINO for optimizing Ultralytics YOLO throughput?

What is the best practice for reducing first-inference latency in OpenVINO?

How do I balance optimizing for latency and throughput with Ultralytics YOLO and OpenVINO?

Can I use Ultralytics YOLO models with other AI frameworks besides OpenVINO?

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Ultralytics
/
ultralytics
connected to https://github.com/ultralytics/ultralytics.git

optimizing-openvino-latency-vs-throughput-modes.md 10 KB

History Raw