Question

When we have an image to be used as an input to a CNN and we want to classify only part of the image, we usually feed the classifier with a crop of the image.

Lets say my image is called frame and x, y, w and h are xmin, ymin, xmax and ymax, respectively:

frame = frame[y:y + h, x:x + w] #Crop a part of the image

What does y:y or x:x mean and why do we sum them to h and w, respectively?

I've been seeing some people performing the crop in the following way:

frame = frame[y:h, x:w] #Crop a part of the image without adding to `w` and `h`

I saw the second approach being used in some places like in the following line: https://github.com/balajisrinivas/Face-Mask-Detection/blob/master/detect_mask_video.py#L51

What's the difference?

Was it helpful?

Solution

Lets say my image is called frame and x, y, w and h are xmin, ymin, xmax and ymax

You're confusing $w$ with $xmax$ and $h$ with $ymax$: Usually $w$ is the width of the crop whereas $xmax$ is the horizontal position of the end of the crop. Similarly $h$ is the height and $ymax$ is the vertical position of the end of the crop.

Logically since $x$ is the (horizontal) start of the crop and $w$ is the width, we can obtain $xmax$ like this: $xmax=x+w$.

Example: in a 100x100 image, let's say we want to crop a 20x20 square in the centre: $x=40, y=40, w=20, h=20, xmax=60, ymax=60$.

In the following code:

frame = frame[y:y + h, x:x + w]

the operator : is used to represent a sequence (for instance 3:7 means 3,4,5,6) so y:y + h represents the sequence from y to y+h, i.e. from $y$ to $ymax$. Same for x+w, so this line would select the part of the array corresponding to the crop.

Your second example is wrong due to the same confusion, the actual code is:

face = frame[startY:endY, startX:endX]

In this case the author is directly using the end coordinate endY (same as $ymax$) instead of calculating it as startY+h.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top