Question

Ok, I'm using SimpleXML to parse RSS feeds, and as many feeds contain embedded html, I'd like to be able to isolate any image addresses contained in embedded html. Sounds like an easy enough task, but I'm running into an issue with parsing the data from the SimpleXMLElement objects. Here's the relevant code.

for($i = 0; $i < count($articles); $i++) {
    foreach($articles[$i] as $feedDeet) {
        $str = (string)$feedDeet;
        $result = strpos($str, '"');
        if($result === false) {
            echo 'There are apparently no quotes in this string: '.$str;
        }
        $explodedString = explode('"', $str);
        echo "<br>";
        if($explodedString[0] == $str) {
            echo 'ExplodedString is equal to str. Apparently, once again, the string contains no quotes.';
        }
        echo "<hr>";
    }
}

In this situation, $articles is an array of SimpleXMLElement objects each representing an RSS article, and containing many child SimpleXMLElement objects representing properties and details of that article. Basically, I'd like to iterate through those properties one by one, cast them as strings, and then explode the strings using any quotes as delimiters (because any image addresses would be contained inside of quotes). I would then parse through the exploded array and search for any strings that appear to be an image address. However, neither explode() nor strpos() is behaving as I would expect it to. To give an example of what I mean, one of the outputs of the above code is as follows:

There are apparently no quotes in this string: <p style="text-align: center;"><img class="alignnone size-full wp-image-243922" alt="gold iPhone Shop Le Monde" src="http://media.idownloadblog.com/wp-content/uploads/2013/08/gold-iPhone-Shop-Le-Monde.jpg" width="593" height="515" /></p> <p>Folks still holding out hope that the gold iPhone rumors aren’t true may want to brace themselves, the speculation has just been confirmed by the Wall Street Journal-owned blog AllThingsD. And given the site’s near perfect (perfect?) track record with predicting future Apple plans, and <a href="http://www.idownloadblog.com/2013/08/16/is-this-apples-gold-colored-iphone-5s/">corroborating evidence</a>, we’d say Apple is indeed going for the gold…(...)<br/>Read the rest of <a href="http://www.idownloadblog.com/2013/08/19/allthingsd-gold-iphone-yes/">AllThingsD confirms gold iPhone coming</a></p> <hr /> <p><small> "<a href="http://www.idownloadblog.com/2013/08/19/allthingsd-gold-iphone-yes/">AllThingsD confirms gold iPhone coming</a>" is an article by <a href="http://www.idownloadblog.com">iDownloadBlog.com</a>. <br/>Make sure to <a href="http://twitter.com/iDownloadBlog">follow us on Twitter</a>, <a href="http://www.facebook.com/iPhoneDownloadBlog">Facebook</a>, and <a href="https://plus.google.com/u/0/b/111910843959038324995/">Google+</a>. </small></p>
ExplodedString is equal to str. Apparently, once again, the string contains no quotes.

Sorry if that was a little hard to read, it's copied verbatim from the output.

As you can see, there are clearly quotes in the string in question, yet, strpos is returning false, meaning that the specified string could not be found, and explode is returning an array with the original string inside, signifying that the specified delimiter could not be found. What is going on here? I've been stumped by this for hours, and I feel like I'm losing my mind.

Thanks!

Was it helpful?

Solution

The mistake you've made here is that your debug output is an HTML page, so the messages you print are being interpreted as HTML by your browser. To see their actual contents, you either need to view the page source, or use <pre> tags to preserve whitespace, and htmlspecialchars() to add a layer of HTML escaping: echo '<pre>' . htmlspecialchars($str) . '</pre>';

If the output in the browser looks like <p style="text-align: center;">, then clearly the input is already escaped with HTML-entities, and probably actually looks like &lt;p style=&quot;text-align: center;&quot;&gt;. Although that &quot; looks like ", it is not the same string, so strpos() won't find it.

In order to undo this extra layer of escaping, you could run html_entity_decode() on the string before processing it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top