What does images per second mean when benchmarking Deep Learning GPU?

https://datascience.stackexchange.com/questions/18907

22-10-2019
|

Question

I've been reviewing performance of several NVIDIA GPU's and I see that typically results are presented in terms of "images per second" that can be processed. Experiments are typically being performed on classical network architectures such as Alex Net or GoogLeNet.

I'm wondering if a given number of images per second, say 15000, means that 15000 images can be processed by iteration or for fully learning the network with that amount of images?. I suppose that if I have 15000 images and want to calculate how fast will a given GPU train that network, I would have to multiply by certain values of my specific configuration (for example number of iterations). In case this is not true, is there a default configuration being used for this tests?

Here an example of benchmark Deep Learning Inference on P40 GPUs (mirror)

Solution

I'm wondering if a given number of images per second, say 15000, means that 15000 images can be processed by iteration or for fully learning the network with that amount of images?.

Typically they specify somewhere whether they talk about the forward (a.k.a. inference a.k.a. test) time, e.g. from the page you mentioned in your question:

Another example from https://github.com/soumith/convnet-benchmarks (mirror):

OTHER TIPS

I'm wondering if a given number of images per second, say 15000, means that 15000 images can be processed by iteration or for fully learning the network with that amount of images?. I suppose that if I have 15000 images and want to calculate how fast will a given GPU train that network, I would have to multiply by certain values of my specific configuration (for example number of iterations). In case this is not true, is there a default configuration being used for this tests?

The report you refer to "Deep Learning Inference on P40 GPUs" uses a set of images described in the paper: "ImageNet Large Scale Visual Recognition Challenge", the dataset is available at: "ImageNet Large Scale Visual Recognition Challenge (ILSVRC)".

Here is Figure 7 from the paper referenced above, and the associated text:

An excellent example of the successes in deep learning can be illustrated with the ImageNet Challenge $[14]$. This challenge is a contest involving several different components. One of the components is an image classification task where algorithms are given an image and they must identify what is in the image, as shown in Fig. 6. The training set consists of 1.2 million images, each of which is labeled with one of 1000 object categories that the image contains. For the evaluation phase, the algorithm must accurately identify objects in a test set of images, which it hasn’t previously seen.

Fig. 7 shows the performance of the best entrants in the ImageNet contest over a number of years. One sees that the accuracy of the algorithms initially had an error rate of 25% or more. In 2012, a group from the University of Toronto used graphics processing units (GPUs) for their high compute capability and a deep neural network approach, named AlexNet, and dropped the error rate by approximately 10% $[3]$.

Their accomplishment inspired an outpouring of deep learning style algorithms that have resulted in a steady stream of improvements.

In conjunction with the trend to deep learning approaches for the ImageNet Challenge, there has been a corresponding increase in the number of entrants using GPUs. From 2012 when only 4 entrants used GPUs to 2014 when almost all the entrants (110) were using them. This reflects the almost complete switch from traditional computer vision approaches to deep learning-based approaches for the competition.

In 2015, the ImageNet winning entry, ResNet $[15]$, exceeded human-level accuracy with a top-5 error rate4 below 5%. Since then, the error rate has dropped below 3% and more focus is now being placed on more challenging components of the competition, such as object detection and localization. These successes are clearly a contributing factor to the wide range of applications to which DNNs are being applied.

$[3]$ A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in NIPS, 2012.

$[14]$ O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

$[15]$ K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in CVPR, 2016.

Each year there is a different challenge, for example ILSVRC 2017 required (partial list):

Main Challenges

Object localization
The data for the classification and localization tasks will remain unchanged from ILSVRC 2012 . The validation and test data will consist of 150,000 photographs, collected from flickr and other search engines, hand labeled with the presence or absence of 1000 object categories. The 1000 object categories contain both internal nodes and leaf nodes of ImageNet, but do not overlap with each other. A random subset of 50,000 of the images with labels will be released as validation data included in the development kit along with a list of the 1000 categories. The remaining images will be used for evaluation and will be released without labels at test time. The training data, the subset of ImageNet containing the 1000 categories and 1.2 million images, will be packaged for easy downloading. The validation and test data for this competition are not contained in the ImageNet training data.
...
The winner of the object localization challenge will be the team which achieves the minimum average error across all test images.

Object detection
The training and validation data for the object detection task will remain unchanged from ILSVRC 2014. The test data will be partially refreshed with new images based upon last year's competition(ILSVRC 2016). There are 200 basic-level categories for this task which are fully annotated on the test data, i.e. bounding boxes for all categories in the image have been labeled. The categories were carefully chosen considering different factors such as object scale, level of image clutterness, average number of object instance, and several others. Some of the test images will contain none of the 200 categories.
...
The winner of the detection challenge will be the team which achieves first place accuracy on the most object categories.

Object detection from video
This is similar in style to the object detection task. We will partially refresh the validation and test data for this year's competition. There are 30 basic-level categories for this task, which is a subset of the 200 basic-level categories of the object detection task. The categories were carefully chosen considering different factors such as movement type, level of video clutterness, average number of object instance, and several others. All classes are fully labeled for each clip.
...
The winner of the detection from video challenge will be the team which achieves best accuracy on the most object categories.

See the link above for complete information.

In short: The benchmark results don't mean that you can take 15,000 of your own images and classify them in one second to achieve a rating of 15K/sec.

It means that there was a contest where each entrant was given a training set of (recently) 1.2M images and a question consisting of 150,000 images, the winner correctly identifies the images in the question fastest.

You should expect that if you use the same hardware and an arbitrary set of images, of similar complexity, that you will obtain a classification rate of approximately however many the particular algorithm is rated at. Theoretically, repeating the test using the same hardware and images will produce a result very close to what was obtained in the contest.

The speed at which the "15,000 images" mentioned in your question could be classified depends on their complexity compared to the test set. Some of the contest winners provide complete information about their methods and some contestants hide proprietary information, you can choose an open method and implement it on your hardware - presumably you'll want to choose the fastest publically available method.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange