scrape email addresses

https://stackoverflow.com/questions/3470332

28-09-2019
|

Question

fff.html is an email with email addresses in it some have href mailto links and some don't, i want to scrape them and output them into the following format

Lorem@ipsum.com,dolor@sit.com,amet@consectetur.com

I have a simple scraper to get the ones that are href linked but something is wierd

  <?php
    $url = "fff.html";
    $raw = file_get_contents($url);

    $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
    $content = str_replace($newlines, "", html_entity_decode($raw));

    $start = strpos($content,'<a href="mailto:');
    $end = strpos($content,'"',$start) + 8;
    $mail = substr($content,$start,$end-$start);

    print "$mail<br />";
    ?>

I should get extra points for the original use of lorem ipsum

Solution

The problem is what if you have more than one email address in the HTML page. substr will only return the first instance. Here is a script that will parse all email addresses. You may need to tweak it some for your use. It will output the results in the CSV form you requested.

<?php
$url = "fff.html";
$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content, '<body>');
$end = strpos($content, '</body>');
$data = substr($content, $start, $end-$start);

$pattern = '#a[^>]+href="mailto:([^"]+)"[^>]*?>#is';
preg_match_all($pattern, $data, $matches);

foreach ($matches[1] as $key => $email) {
    $emails[] = $email;
}
echo implode(', ', $emails );
?>

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow