Why does GPU speed up inference?

https://datascience.stackexchange.com/questions/86077

17-12-2020
|

Question

I understand that GPU can speed up training for each batch multiple data records can be fed to the network which can be parallelized for computation. However, for inference, typically, each time the network only processes one record, for instance, for text classification, only one text (i.e., a tweet) is fed to the network. In such a case, how can GPU speed up?

Solution

Although what you describe is correct, such online/real-time usage is far from being the only (or even the most frequent) use case for DL inference. The keyword here is "batch"; there are several applications where the inference can be also run in batches of incoming data instead of on single samples.

Take the example mentioned by NVIDIA in their AI Inference Platform technical overview (p.3):

Inference also batch hundreds of samples to achieve optimal throughput on jobs run overnight in data centers to process substantial amounts of stored data. These jobs tend to emphasize throughput over latency. However, for real-time usages, high batch sizes also carry a latency penalty. For these usages, lower batch sizes (as low as a single sample) are used, trading off throughput for lowest latency. A hybrid approach, sometimes referred to as “auto-batching,” sets a time threshold—say, 10 milliseconds (ms)—and batches as many samples as possible within those 10ms before sending them on for inference. This approach achieves better throughput while maintaining a set latency amount.

Although, as correctly pointed out in another thread comment, it is on NVIDIA's best interest to convince you that you need GPUs for inference (which is indeed not always true), hopefully you can see the pattern: whenever we want to emphasize throughput over latency, GPUs will be useful for speeding up inference.

Practically, any application that runs on existing archives of data (videos, audio, music, text, documents) instead of waiting for incoming streams in real-time can meaningfully rely on GPUs for inference. And here "archives" does not imply necessarily time spans of years or months (although it can be so, e.g. in astronomy applications); the archive consisting of the photos uploaded to Facebook in the last 3 minutes (or since I have started writing this...) is huge, and it, too, can benefit from GPU-accelerated inference.

Videos, in specific, since they are usually broke up in frames to be processed, may benefit from sped up GPU inference even in near-real time applications.

Bu if you want to just set up your web app that will process low-traffic incoming photos or tweets to respond in real-time, then indeed GPU may not offer anything substantial in performance.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange