Question

I have extracted records from a database and stored them on an HTML page with only text. Each record is stored in a <p> paragraph field and separated by a line break <br /> and a line <hr>. For example:

Company Name<br/>
555-555-555<br />
Address Line 1<br />
Address Line 2<br />
Website: www.example.com<br />

I just need to place these records into a CSV file. I used fputcsv in combination with array() and file_get_contents() but it read my the entire source code of the webpage into a .csv file and alot of data was missing as well. These are multiple records stored in the same format. So after an entire record block as seen above, it is separate by an <hr> line tag. I want to read the company name into the Name column, the Phone number into the Phone column, the addresses into the Address column and the Website into the Website column as shown below.

http://i.stack.imgur.com/00Gxw.png
How can i do this?

Snippet of the HTML:

            1 Stop Signs<br />
            480-961-7446<br />
500 N. 56th Street<br />
        Chandler, AZ  85226<br />

<br />
                Website: www.1stopsigns.com<br />
            <br />
            </p><br /><hr><br />

It's spaced like this in the source of the HTML.

Was it helpful?

Solution

Assuming the html that shown above is well formed,my approach to this problem must be in 2 phases. First. Clear a little bit the html text to be more efficient to export or manage the information. Here try to clear the items you want to save and delete those you know you don't want to require in the near future.

$html = preg_replace("|\s{2,}|si"," ",$html); // clear non neccesary spaces
$html = preg_replace("|\n{2,}|si","\n",$html); // convert more return line to only one
$html = preg_replace("|<br />|si","##",$html); // replace those tags with this one

Then you'll have a more clean html to work with similar to this....

1 Stop Signs##
480-961-7446##
500 N. 56th Street##
Chandler, AZ  85226##
Website: www.1stopsigns.com##
##
</p>##<hr>##

Second. Now you can explode the fields or make an implode into a comma separate value to form a csv

// here you'll have the fields to work with into the array called $csv_parts
$csv_parts = explode("##",$html);

// imploding, so there you have the formatted csv similar to 1 Stop Signs,480-961-7446,..
$csv = implode(",",$csv_parts);

Now you'll have a two ways to work with the html for extracting the fields or exporting the csv.


Hope this helps or give you an idea to develop what you need.

OTHER TIPS

Assuming that your data follows a pattern where every record is separated by a <hr> tag and every field within is separated by a <br /> then you should be able to split out the data.

There are loads of ways to do this, but a naive way that might work using explode() might be something like:

// open a file pointer to csv
$fp = fopen('records.csv', 'w');

// first, split each record into a separate array element
$records = explode('<hr>', $str);

// then iterate over this array
foreach ($records as $record) {

    // strip tags and trim enclosing whitespace
    $stripped = trim(strip_tags($record));

    // explode by end-of-line
    $fields = explode(PHP_EOL, $stripped);

    // array walk over each field and trim whitespace
    array_walk($fields, function(&$field) {
        $field = trim($field);
    });

    // create row
    $row = array(
        $fields[0], // name
        $fields[1], // phone
        sprintf('%s, %s', $fields[2], $fields[3]), // address
        $fields[6], // web
    );

    // write cleaned array of fields to csv
    fputcsv($fp, $row);
}

// done
fclose($fp);

Where $str is the page data you are parsing. Hope this helps.

EDIT

Didn't notice the specific field requirements originally. Updated the example.

By far the easiest way would be to simply take the block, drop everything from the <hr> tag forward then split the string as a string array on the <br /> tags.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top