Question

So I'm looking for ideas on how to best replicate the functionality seen on digg. Essentially, you submit a URL of your page of interest, digg then crawl's the DOM to find all of the IMG tags (likely only selecting a few that are above a certain height/width) and then creates a thumbnail from them and asks you which you would like to represent your submission.

While there's a lot going on there, I'm mainly interested in the best method to retrieve the images from the submitted page.

Was it helpful?

Solution

While you could try to parse the web page HTML can be such a mess that you would be best with something close but imperfect.

  1. Extract everything that looks like an image tag reference.
  2. Try to fetch the URL
  3. Check if you got an image back

Just looking for and capturing the content of src="..." would get you there. Some basic manipulation to deal with relative vs. absolute image references and you're there.

Obviously anytime you fetch a web asset on demand from a third party you need to take care you aren't being abused.

OTHER TIPS

I suggest cURL + regexp.

You can also use PHP Simple HTML DOM Parser which will help you search all the image tags.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top