What do the parameters used in crop mean?
-
17-12-2020 - |
Question
When we have an image to be used as an input to a CNN and we want to classify only part of the image, we usually feed the classifier with a crop of the image.
Lets say my image is called frame and x
, y
, w
and h
are xmin
, ymin
, xmax
and ymax
, respectively:
frame = frame[y:y + h, x:x + w] #Crop a part of the image
What does y:y
or x:x
mean and why do we sum them to h
and w
, respectively?
I've been seeing some people performing the crop in the following way:
frame = frame[y:h, x:w] #Crop a part of the image without adding to `w` and `h`
I saw the second approach being used in some places like in the following line: https://github.com/balajisrinivas/Face-Mask-Detection/blob/master/detect_mask_video.py#L51
What's the difference?
Solution
Lets say my image is called frame and x, y, w and h are xmin, ymin, xmax and ymax
You're confusing $w$ with $xmax$ and $h$ with $ymax$: Usually $w$ is the width of the crop whereas $xmax$ is the horizontal position of the end of the crop. Similarly $h$ is the height and $ymax$ is the vertical position of the end of the crop.
Logically since $x$ is the (horizontal) start of the crop and $w$ is the width, we can obtain $xmax$ like this: $xmax=x+w$.
Example: in a 100x100 image, let's say we want to crop a 20x20 square in the centre: $x=40, y=40, w=20, h=20, xmax=60, ymax=60$.
In the following code:
frame = frame[y:y + h, x:x + w]
the operator :
is used to represent a sequence (for instance 3:7
means 3,4,5,6
) so y:y + h
represents the sequence from y
to y+h
, i.e. from $y$ to $ymax$. Same for x+w
, so this line would select the part of the array corresponding to the crop.
Your second example is wrong due to the same confusion, the actual code is:
face = frame[startY:endY, startX:endX]
In this case the author is directly using the end coordinate endY
(same as $ymax$) instead of calculating it as startY+h
.