I know next to nothing about matlabs (not entirely true but i avoid matlab wherever I can, and that would be almost always possible) tracking utilities but here's some general info.
Pixel dimension on the sensor has nothing to do with the size of the pixel on screen, or in model space. For all purposes a camera produces a picture that has no meaningful units. A tracking process is unaware of the scale of the scene. (the perspective projection takes care of that). You can re insert a scale by taking 2 tracked points and measuring the distance between those points. This is the solver spaces distance is pretty much arbitrary. Now if you know the real distance between these points you can get a conversion factor. By doing:
real distance / solver space distance.
There's really now way to knowing this distance form the cameras settings as the camera is unable to differentiate between different scales of scenes. So a perfect 1:100 replica is no different for the solver than the real deal. So you must allays relate to something you can measure separately for each measuring session. The camera always produces something that's relative in nature.