Based on my experience, the short answer is:
1) You cannot reliably estimate the 3D pose of the cameras independently from the 3D of the scene. Moreover, since your cameras are moving independently, I think SfM is the right way to approach your problem.
2) You need to estimate the cameras' intrinsics in order to estimate useful (i.e. Euclidian) poses and scene reconstruction. If you cannot use the standard calibration procedure, with chessboard and co, you can have a look at the autocalibration techniques (see also chapter 19 in Hartley's & Zisserman's book). This calibration procedure is done independently for each camera and only require several image samples at different positions, which seems appropriate in your case.