To answer your question, yes it is possible; do an inverse FFT and then crop the image normally. If it seems like a cop-out it is because you're attempting to do a time-domain task in frequency domain which isn't going to be very natural.
If you insist that the calculation be done in frequency domain I think you should be able to phase shift the image to the origin (x1 + y1) then inverse FFT and discard samples outside (x2 - x1, y2 - y1).
The fundamental problem is that in frequency domain each bin (or pixel for a 2D FFT) represents a frequency and phase across the entire image in time domain. Discarding a single pixel in frequency domain results in a loss of that frequency information for the whole image and cannot be localized.