What you are looking for is a Structure from Motion (SFM) pipeline. Writing one yourself will take some time; its a complex system. The steps are
- Detect which points in the images show the same point of the scene (feature matching).
- Estimate the camera position of each image.
- Estimate scene geometry using multiview stereo (dense reconstruction).
- Turn your scene geometry into a triangle mesh.
There are tools that do all this like VisualSFM freely available. You put in images and get a 3D model out. Parts of VisualSFM are open source and the Bundler project is another good resource. Still, it will require a bit of research if you want to piece together your own system.
If you want to take a look into the research behind it, "Visual modeling with a hand-held camera" by Pollefeys et al. is a good start.