Question

I am trying to do a simple extraction, but I keep ending up with unpredictable results.

I have this HTML code

<div class="thread" style="margin-bottom:25px;"> 

<div class="message"> 

<span class="profile">Suzy Creamcheese</span> 

<span class="time">December 22, 2010 at 11:10 pm</span> 

<div class="msgbody"> 

<div class="subject">New digs</div> 

Hello thank you for trying our soap. <BR>  Jim.

</div> 
</div> 


<div class="message reply"> 

<span class="profile">Lars Jörgenmeier</span> 

<span class="time">December 22, 2010 at 11:45 pm</span> 

<div class="msgbody"> 

I never sold you any soap.

</div> 

</div> 

</div> 

And I am trying to extract the outertext from "msgbody" but only when the "profile" is equal to something. Like so.

$contents  = $html->find('.msgbody');
$elements = $html->find('.profile'); 

           $length = sizeof($contents);

           while($x != sizeof($elements)) {

            $var = $elements[$x]->outertext;

                        //If profile = the right name
            if ($var = $name) {

                                    $text = $contents[$x]->outertext;
                echo $text;

            }



            $x++;
         }    

I get text from the wrong profiles, not the ones with the associations I need. Is there a way to just pull the desired info with one line of code?

Like if span-profile = "correct name" then pull its div-msgbody

Was it helpful?

Solution

Okay I'm going to go with DOMXpath on this one. I'm not sure what 'outer text' is supposed to mean, but I'll go with this requirement:

Like if span-profile = "correct name" then pull its div-msgbody

First off, Here's the minified HTML test case I used:

<html>
<body>
<div class="thread" style="margin-bottom:25px;"> 

<div class="message"> 

<span class="profile">Suzy Creamcheese</span> 

<span class="time">December 22, 2010 at 11:10 pm</span> 

<div class="msgbody"> 

<div class="subject">New digs</div> 

Hello thank you for trying our soap. <BR>  Jim.

</div> 
</div> 


<div class="message reply"> 

<span class="profile">Lars Jörgenmeier</span> 

<span class="time">December 22, 2010 at 11:45 pm</span> 

<div class="msgbody"> 

I never sold you any soap.

</div> 

</div> 

</div>
</body>
</html>

So, we'll make an XPath query for this. Let's show the whole thing, then break it down:

$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");

The break down:

//span

Give me spans

//span[@class='profile']

Give me spans where the class is profile

//span[@class='profile' and contains(.,'$profile_name')]

Give me spans where the class is profile and the inside of the span contains $profile_name, which is the name you're after

//span[@class='profile' and contains(.,'$profile_name')]/../

Give me spans where the class is profile and the inside of the span contains $profile_name, which is the name you're after now go up a level, which gets us to <div class="message">

//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']

Give me spans where the class is profile and the inside of the span contains $profile_name, which is the name you're after now go up a level, which gets us to <div class="message"> and finally, give me all divs under <div class="message"> where the class is msgbody

Now then, here's a sample of the PHP code:

$doc = new DOMDocument();
$doc->loadHTMLFile("test.html");

$xpath = new DOMXpath($doc);
$profile_name = 'Lars Jörgenmeier';
$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");
foreach ($messages as $message) {
  echo trim("{$message->nodeValue}") . "\n";
}

XPath is very powerful like this. I recommend looking over a basic tutorial, then you can check the XPath standard if you want to see more advanced usage.

OTHER TIPS

This is a Simple HTML DOM working example.

I changed your example html so there would be more than one profile for Suzy Creamcheese as follows: (file: test_class_class.htm)

 <div class="message"> 
   <span class="profile">Suzy Creamcheese</span> 
   <span class="time">December 22, 2010 at 11:10 pm</span> 
   <div class="msgbody"> 
     <div class="subject">New digs</div> 
       Hello thank you for trying our soap. <BR>  Jim.
     </div> 
   </div> 

   <div class="message reply"> 
     <span class="profile">Lars Jörgenmeier</span> 
     <span class="time">December 22, 2010 at 11:45 pm</span> 
     <div class="msgbody"> 
       I never sold you any soap.
     </div> 
   </div> 
 </div>

 <div class="message"> 
   <span class="profile">Suzy Yogurt</span> 
   <span class="time">December 22, 2010 at 11:10 pm</span> 
   <div class="msgbody"> 
     <div class="subject">No Creamcheese</div> 
       This is not Suzy Creamcheese <BR>  Jim.
     </div> 
   </div> 

   <div class="message reply"> 
     <span class="profile">Suzy Creamcheese</span> 
     <span class="time">December 22, 2010 at 11:45 pm</span> 
     <div class="msgbody"> 
       A reply from Suzy Creamcheese.
     </div> 
   </div> 
 </div>

</div>

Here is my test using Simple HTML DOM: include('simple_html_dom.php');

function getMessage_for_profile($iUrl,$iProfile)
{
    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aoProfile = $html->find('span[class=profile]'); 
    echo "Found ".count($aoProfile)." profiles.<br />";

    foreach ($aoProfile as $key=>$oProfile)
    {
      if ($oProfile->plaintext == $iProfile)
      {
        echo "<b>Profile ".$key.": ".$oProfile->plaintext."</b><br />";
// Using $e->next_sibling ()
        $oCurrent = $oProfile;
        while ($oNext = $oCurrent->next_sibling())
        {
           if ( $oNext->class == "msgbody" )
           {
             echo "<hr />";
             echo $oNext->outertext;
             echo "<hr />";
           }
           $oCurrent = $oNext;
        }
      }         
    }

    // clean up memory
    $html->clear();
    unset($html);

    return;
}
// --------------------------------------------
// test it!
// user_agent header...
ini_set('user_agent', 'My-Application/2.5');

getMessage_for_profile('test_class_class.htm','Suzy Creamcheese');
echo "<br /><br /><br />";
getMessage_for_profile('test_class_class.htm','Suzy Yogurt');

My output was:

Found 4 profiles.
Profile 0: Suzy Creamcheese
--------------------------------
New digs
Hello thank you for trying our soap.
Jim.
---------------------------------
Profile 3: Suzy Creamcheese
---------------------------------
A reply from Suzy Creamcheese.
---------------------------------



Found 4 profiles.
Profile 2: Suzy Yogurt
---------------------------------
No Creamcheese
This is not Suzy Creamcheese
Jim.
---------------------------------

See it can be done with Simple HTML DOM and since I already know how the DOM works... or enough to get in trouble... I did not have to learn any knew syntax!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top