Question

I'm looking to create a PHP script where, a user will provide a link to a webpage, and it will get the contents of that webpage and based on it's contents, parse the contents.

For example, if a user provides a YouTube link:

http://www.youtube.com/watch?v=xxxxxxxxxxx

Then, it will grab the basic information about that video (thumbnail, embed code?)

Or they might provide a vimeo link:

 http://www.vimeo.com/xxxxxx

Or even if they were to provide any link, without a video attached, such as:

 http://www.google.com/

And it could grab just the page Title or some meta content.

I'm thinking I'd have to use file_get_contents, but I'm not exactly sure how to use it in this context.

I'm not looking for someone to write the entire code, but perhaps provide me with some tools so that I can accomplish this.

Was it helpful?

Solution

You can use either the curl or the http library. You send a http request, and can use the library to get the information from the http response.

OTHER TIPS

I know this question is quite old, but I'll answer just in case someone hits it looking for the same thing.

Use oEmbed (http://oembed.com/) for YouTube, Vimeo, Wordpress, Slideshare, Hulu, Flickr and many other services. If not in the list or you want to make it more precise, you can use this:

http://simplehtmldom.sourceforge.net/

It's a sort of jQuery for PHP, meaning you can use HTML selectors to get portions of the code (i.e.: all the images, get the contents of a div, return only text (no HTML) contents of a node, etc).

You could do something like this (could be done more elegantly but this is just an example):

    require_once("simple_html_dom.php");
function getContent ($item, $contentLength) 
{
    $raw;
    $content = "";
    $html;
    $images = "";

    if (isset ($item->content) && $item->content != "")
    {
        $raw = $item->content;
        $html = str_get_html ($raw);            
        $content = str_replace("\n", "<BR /><BR />\n\n", trim($html->plaintext));

        try
        {
            foreach($html->find('img') as $image) {
                if ($image->width != "1") 
                {
                    // Don't include images smaller than 100px height
                    $include = false;
                    $height = $image->width;
                    if ($height != "" && $height >= 100)
                    {
                        $include = true;
                    }
                    /*else
                    {
                        list($width, $height, $type, $attr) = getimagesize($image->src);
                            if ($height != "" && $height >= 100)
                                $include = true;
                    }*/                 

                    if ($include == true)
                    {
                        $images = $images . '<div class="theImage"><a href="'.$image->src.'" title="'.$image->alt.'"><img src="'.$image->src.'" alt="'.$image->alt.'" class="postImage" border="0" /></a></div>';
                    }
                }
            }
        }
        catch (Exception $e) {
            // Do nothing
        }

        $images = '<div id="images">'.$images.'</div>';
    }
    else
    {
        $raw = $item->summary;
        $content = str_get_html ($raw)->plaintext;
    }

    return (substr($content, 0 , $contentLength) . (strlen ($content) > $contentLength ? "..." : "") . $images);
}

file_get_contents() would work in this case assuming that you have allow_fopen_url set to true in your php.ini. What you would do is something like:

$pageContent = @file_get_contents($url);
if ($pageContent) {
    preg_match_all('#<embed.*</embed>#', $pageContent, $matches);
    $embedStrings = $matches[0];
}

That said, file_get_contents() won't give you much in the way of error handling other receiving the content on success or false on failure. If you would like to have more rich control over the request and access the HTTP response codes, use the curl functions and in particular, curl_get_info, to look at the response codes, mime types, encoding, etc. Once you get the content via either curl or file_get_contents() your code for parsing it to look for the HTML of interest will be the same.

Maybe Thumbshots or Snap already have some of the functionality you want?

I know that's not exactly what you are looking for, but at least for the embedded stuff that might be handy. Also txwikinger already answered your other question. But maybe that helps ypu anyway.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top