If I remember correctly, a cross-correlation is the same as convolution with one of the signals time-reversed. A convolution in turn is efficiently calculated by multiplying the spectra of the two signals; i.e., take the FFT of each signal padded at least to the sum of the size of both signals, multiply the FFT transformed spectra, do an inverse IFFT, and search for your peak.
For Java, you can use JTransforms to do the FFT/IFFT.
If you want to play with this approach before actually implementing it, you can try my application FScape; it has a convolution module that takes two sound files (you tagged the question "audio-processing", so I assume you can generate sound files).