It would seem you have a good idea of what's required. This is a bit of a can of worms - so as ever I would recommend getting hold of the famous Hartley & Zisserman book for the canonical explanation. Here's a link to the relevant chapter.
But in brief...
I've not used the opencvstereovision wrapper class directly, but it sounds like it has taken the headache out of calibrating (camera intrinsics and extrinsics) and has calculated rectification via the Homography matrix (H) for planar geometry, or Fundamental matrix (F) for more complex epipolar geometry.
Probably similar to this original post.
What this rectification then means is that it has established the mathematical mapping between the same point in each image.
In a previous answer (from me) you can then do the maths by using the Fundamental matrix to perform triangulation - i.e. calculate distance.
However, note that this distance is only in the image coordinate frame (i.e. in pixels).
What is actually required to perform "real world" measurements (i.e. actual physical distance) is the calculation of the Essential matrix (E) which combines the Fundamental matrix and the cameras intrinsics (K) to, if you will, project the distances into the real world.