Question

Here I am parsing a webpage and getting the 'name' field from that webpage. Here is the code that does the parsing:

foreach($dom->getElementsByTagName('table') as $table) {
    if($table->getAttribute('class')=='dataTable'){
        foreach($table->getElementsByTagName('tr') as $tr){
            if(isset($tr->getElementsByTagName('td')->item(0)->nodeValue)){
                $out[$i]['name'] = $tr->getElementsByTagName('td')->item(0)->nodeValue;
            }
        }
    }
}

In the webpage I am parsing I have the nodevalue for name in the form of 'Mark&nbspSmith'. So when I get the result I have 'name' value as 'Mark Smith' in IDE and 'Mark Smith' in command promt.

Now I want to split the 'name' string in such a way so that I get firstname('Mark') and lastname('smith') separately.

I tried:

explode(" ", $out[$i]['name']) as well as
explode(" " , $out[$i]['name'])

But nothing seems to work for me. Help me to split the string as firstname and lastname?

Hope I am clear with my question.

Was it helpful?

Solution

@user1518659 here try this, to fix the issue just replace the   with a space before passing to DOMDocument, I also added the split of firstname last name :) hope it helps.

<?php 
header('Content-Type: text/html; charset=utf-8'); //Required if your outputting, as the description contains utf-8 characters
//Load the source (input)
$html_source = file_get_contents('http://www.reuters.com/finance/stocks/companyOfficers?symbol=AOS');
$html_source = str_replace('&nbsp;',' ',$html_source);

//Dom document
$dom = new DOMDocument('1.0');
@$dom->loadHTML($html_source);

$out =array();
$i=0;
foreach($dom->getElementsByTagName('table') as $table) {
    if($table->getAttribute('class')=='dataTable'){

        foreach($table->getElementsByTagName('tr') as $tr){
            if(isset($tr->getElementsByTagName('td')->item(0)->nodeValue)){

                $out[$i]['fullname'] = $tr->getElementsByTagName('td')->item(0)->nodeValue;

                $name = explode(' ',$out[$i]['fullname']);
                $out[$i]['first_name'] = $name[0];
                $out[$i]['last_name'] = $name[1];

                if(!isset($tr->getElementsByTagName('td')->item(2)->nodeValue)){

                    foreach ($out as $key=>$value){
                        if($value['fullname'] == $tr->getElementsByTagName('td')->item(0)->nodeValue &&
                        !is_numeric(substr($tr->getElementsByTagName('td')->item(1)->nodeValue,0,1)) && 
                        $tr->getElementsByTagName('td')->item(1)->nodeValue != "--" ){
                            $out[$key]['description']= $tr->getElementsByTagName('td')->item(1)->nodeValue;
                        }
                    }

                }else{
                    if(!isset($tr->getElementsByTagName('td')->item(2)->nodeValue)){continue;}
                    if(isset($tr->getElementsByTagName('td')->item(3)->nodeValue)){
                        $out[$i]['age']= $tr->getElementsByTagName('td')->item(1)->nodeValue;
                        $out[$i]['since']= $tr->getElementsByTagName('td')->item(2)->nodeValue;
                        $out[$i]['position']= $tr->getElementsByTagName('td')->item(3)->nodeValue;
                    }
                }
                $i++;
            }
        }
    }
}

//Clean up
$return = array();
foreach ($out as $key=>$row){
    if(isset($row['fullname']) && isset($row['age']) && isset($row['since']) && isset($row['position']) && isset($row['description'])){
        $return[$key] = $out[$key];
    }
}


print_r($return);

/*
Array
(
    [0] => Array
        (
            [fullname] => Paul Jones
            [first_name] => Paul
            [last_name] => Jones
            [age] => 63
            [since] => 2011
            [position] => Chairman of the Board, Chief Executive Officer
            [description] => Mr. Paul W. Jones serves as the Chairman of the Board, Chief Executive Officer of A. O. Smith Corp. He has been a director of company since 2004. He is a member of the Investment Policy Committee of the Board. He was elected chairman of the board, president and chief executive officer effective December 31, 2005. He was president and chief operating officer from 2004 to 2005. Prior to joining the company, he was chairman and chief executive officer of U.S. Can Company, Inc. from 1998 to 2002. He previously was president and chief executive officer of Greenfield Industries, Inc. from 1993 to 1998 and president from 1989 to 1992. Mr. Jones has been a director of Federal Signal Corporation since 1998, where he chairs the Nominating and Governance Committee and is a member of the Compensation and Benefits Committee and the Executive Committee, and Integrys Energy Group, Inc. since 2011, where he is a member of the Compensation and Financial Committees. He was also a director of Bucyrus International, Inc. from 2006 until its acquisition by Caterpillar, Inc. in 2011, and chaired the Compensation Committee.
        )

    [1] => Array
        (
            [fullname] => Ajita Rajendra
            [first_name] => Ajita
            [last_name] => Rajendra
            [age] => 60
            [since] => 2011
            [position] => President, Chief Operating Officer, Director
            [description] => Mr. Ajita G. Rajendra serves as the President, Chief Operating Officer and Director of A. O. Smith Corp. He was elected a director of company in December 2011, based on the recommendation of the Nominating and Governance Committee, following his election as President and Chief Operating Officer in September 2011. Mr. Rajendra joined the company as President of A. O. Smith Water Products Company in 2005, and was named Executive Vice President of the company in 2006. Prior to joining the company, Mr. Rajendra was Senior Vice President at Kennametal, Inc., a manufacturer of cutting tools, from 1998 to 2004. Mr. Rajendra also serves on the board of Donaldson Company, Inc., where he is a member of the Audit Committee and Human Resources Committee. Further, Mr. Rajendra was a director of Industrial Distribution Group, Inc. from 2007 until its acquisition by Eiger Holdco, LLC in 2008.
        )
        ...
        ...
*/
?>

OTHER TIPS

Have a try to fix the &nbsp issue:

explode(chr(0xC2).chr(0xA0), $str)

The non-breaking space exist in UTF-8 of two bytes: 0xC2 and 0xA0.

Ref - PHP Parsing Problem - &nbsp; and Â

The non-breaking space character entity you have there is broken. It should be &nbsp;. Note the semicolon.

To answer your question you can simply do:

var_dump(explode('&nbsp', $out[$i]['name']));

Or if the entity is fixed:

var_dump(explode('&nbsp;', $out[$i]['name']));

This should do it

<?php

$a = preg_split( "/[\s]|[&nbsp;]|[ ]/", $out[$i]['name'] ); //$a is an array with keys: Mark, Smith

explode() only allows one delimiter to split the string/array while preg_split() is used for a variety of delimiters

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top