Pregunta

I am trying to do a simple extraction, but I keep ending up with unpredictable results.

I have this HTML code

<div class="thread" style="margin-bottom:25px;"> 

<div class="message"> 

<span class="profile">Suzy Creamcheese</span> 

<span class="time">December 22, 2010 at 11:10 pm</span> 

<div class="msgbody"> 

<div class="subject">New digs</div> 

Hello thank you for trying our soap. <BR>  Jim.

</div> 
</div> 


<div class="message reply"> 

<span class="profile">Lars Jörgenmeier</span> 

<span class="time">December 22, 2010 at 11:45 pm</span> 

<div class="msgbody"> 

I never sold you any soap.

</div> 

</div> 

</div> 

And I am trying to extract the outertext from "msgbody" but only when the "profile" is equal to something. Like so.

$contents  = $html->find('.msgbody');
$elements = $html->find('.profile'); 

           $length = sizeof($contents);

           while($x != sizeof($elements)) {

            $var = $elements[$x]->outertext;

                        //If profile = the right name
            if ($var = $name) {

                                    $text = $contents[$x]->outertext;
                echo $text;

            }



            $x++;
         }    

I get text from the wrong profiles, not the ones with the associations I need. Is there a way to just pull the desired info with one line of code?

Like if span-profile = "correct name" then pull its div-msgbody

¿Fue útil?

Solución

Okay I'm going to go with DOMXpath on this one. I'm not sure what 'outer text' is supposed to mean, but I'll go with this requirement:

Like if span-profile = "correct name" then pull its div-msgbody

First off, Here's the minified HTML test case I used:

<html>
<body>
<div class="thread" style="margin-bottom:25px;"> 

<div class="message"> 

<span class="profile">Suzy Creamcheese</span> 

<span class="time">December 22, 2010 at 11:10 pm</span> 

<div class="msgbody"> 

<div class="subject">New digs</div> 

Hello thank you for trying our soap. <BR>  Jim.

</div> 
</div> 


<div class="message reply"> 

<span class="profile">Lars Jörgenmeier</span> 

<span class="time">December 22, 2010 at 11:45 pm</span> 

<div class="msgbody"> 

I never sold you any soap.

</div> 

</div> 

</div>
</body>
</html>

So, we'll make an XPath query for this. Let's show the whole thing, then break it down:

$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");

The break down:

//span

Give me spans

//span[@class='profile']

Give me spans where the class is profile

//span[@class='profile' and contains(.,'$profile_name')]

Give me spans where the class is profile and the inside of the span contains $profile_name, which is the name you're after

//span[@class='profile' and contains(.,'$profile_name')]/../

Give me spans where the class is profile and the inside of the span contains $profile_name, which is the name you're after now go up a level, which gets us to <div class="message">

//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']

Give me spans where the class is profile and the inside of the span contains $profile_name, which is the name you're after now go up a level, which gets us to <div class="message"> and finally, give me all divs under <div class="message"> where the class is msgbody

Now then, here's a sample of the PHP code:

$doc = new DOMDocument();
$doc->loadHTMLFile("test.html");

$xpath = new DOMXpath($doc);
$profile_name = 'Lars Jörgenmeier';
$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");
foreach ($messages as $message) {
  echo trim("{$message->nodeValue}") . "\n";
}

XPath is very powerful like this. I recommend looking over a basic tutorial, then you can check the XPath standard if you want to see more advanced usage.

Otros consejos

This is a Simple HTML DOM working example.

I changed your example html so there would be more than one profile for Suzy Creamcheese as follows: (file: test_class_class.htm)

 <div class="message"> 
   <span class="profile">Suzy Creamcheese</span> 
   <span class="time">December 22, 2010 at 11:10 pm</span> 
   <div class="msgbody"> 
     <div class="subject">New digs</div> 
       Hello thank you for trying our soap. <BR>  Jim.
     </div> 
   </div> 

   <div class="message reply"> 
     <span class="profile">Lars Jörgenmeier</span> 
     <span class="time">December 22, 2010 at 11:45 pm</span> 
     <div class="msgbody"> 
       I never sold you any soap.
     </div> 
   </div> 
 </div>

 <div class="message"> 
   <span class="profile">Suzy Yogurt</span> 
   <span class="time">December 22, 2010 at 11:10 pm</span> 
   <div class="msgbody"> 
     <div class="subject">No Creamcheese</div> 
       This is not Suzy Creamcheese <BR>  Jim.
     </div> 
   </div> 

   <div class="message reply"> 
     <span class="profile">Suzy Creamcheese</span> 
     <span class="time">December 22, 2010 at 11:45 pm</span> 
     <div class="msgbody"> 
       A reply from Suzy Creamcheese.
     </div> 
   </div> 
 </div>

</div>

Here is my test using Simple HTML DOM: include('simple_html_dom.php');

function getMessage_for_profile($iUrl,$iProfile)
{
    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aoProfile = $html->find('span[class=profile]'); 
    echo "Found ".count($aoProfile)." profiles.<br />";

    foreach ($aoProfile as $key=>$oProfile)
    {
      if ($oProfile->plaintext == $iProfile)
      {
        echo "<b>Profile ".$key.": ".$oProfile->plaintext."</b><br />";
// Using $e->next_sibling ()
        $oCurrent = $oProfile;
        while ($oNext = $oCurrent->next_sibling())
        {
           if ( $oNext->class == "msgbody" )
           {
             echo "<hr />";
             echo $oNext->outertext;
             echo "<hr />";
           }
           $oCurrent = $oNext;
        }
      }         
    }

    // clean up memory
    $html->clear();
    unset($html);

    return;
}
// --------------------------------------------
// test it!
// user_agent header...
ini_set('user_agent', 'My-Application/2.5');

getMessage_for_profile('test_class_class.htm','Suzy Creamcheese');
echo "<br /><br /><br />";
getMessage_for_profile('test_class_class.htm','Suzy Yogurt');

My output was:

Found 4 profiles.
Profile 0: Suzy Creamcheese
--------------------------------
New digs
Hello thank you for trying our soap.
Jim.
---------------------------------
Profile 3: Suzy Creamcheese
---------------------------------
A reply from Suzy Creamcheese.
---------------------------------



Found 4 profiles.
Profile 2: Suzy Yogurt
---------------------------------
No Creamcheese
This is not Suzy Creamcheese
Jim.
---------------------------------

See it can be done with Simple HTML DOM and since I already know how the DOM works... or enough to get in trouble... I did not have to learn any knew syntax!

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top