Question

Ok so I have a text file that will change regularly that I need to scrape to display on screen and potentially insert into a database. The text is formatted as follows:

"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products

I only need the song title (the information between the quotes), who it is written by and who it is performed by. As you can see the written by lines can be more than one row.

I've searched through the questions and this one is similar Scraping a plain text file with no HTML? and I was able to modify the solution https://stackoverflow.com/a/8432563/827449 below so that it will at least find the information between the quotes and put those in the array. However I can't figure out where and how to put the next preg_match statements for the written by and performed by so that it will add it to the array with the correct information, assuming I have the right regex of course. Here is the modified code.

<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();

function dump_record($r) {
    print_r($r);
}
    $current = array();
    while ($line = fgets($fh)) {

        /* Skip empty lines (any number of whitespaces is 'empty' */
        if (preg_match('/^\s*$/', $line)) continue;

        /* Search for 'things between quotes' stanzas */
        if (preg_match('/(?<=\")(.*?)(?=\")/', $line, $start)) {
            /* If we already parsed a record, this is the time to dump it */
            if (!empty($current)) dump_record($current);

        /* Let's start the new record */
        $current = array( 'id' => $start[1] );
    }
    else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
        /* Otherwise parse a plain 'key: value' stanza */
        $current[ $keyval[1] ] = $keyval[2];
    }
    else {
        error_log("parsing error: '$line'");
    }
}
/* Don't forget to dump the last parsed record, situation
 * we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);

fclose($in);

Any help would be great as I am now over my head with my limited PHP and Regex knowledge.

Was it helpful?

Solution

How about:

$str =<<<EOD
"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products

EOD;

preg_match_all('/"([^"]+)".*?Written by (.*?)Performed by (.*?)Courtesy/s', $str, $m, PREG_SET_ORDER);
print_r($m);

output:

Array
(
    [0] => Array
        (
            [0] => "Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy
            [1] => Stranglehold
            [2] => Ted Nugent

            [3] => Ted Nugent

        )

    [1] => Array
        (
            [0] => "Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy
            [1] => Chateau Lafltte '59 Boogie
            [2] => David Peverett
and Rod Price

            [3] => Foghat

        )

)

OTHER TIPS

Here's a regex solution to the problem. Bear in mind, that you don't really need regex here. See the second option below.

<?php

$string = '"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte \'59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products';

// Titles delimit a record
$title_pattern = '#"(?<title>[^\n]+)"\n(?<meta>.*?)(?=\n"|$)#s';
// From the meta section we need these tokens
$meta_keys = array(
    'Written by ' => 'written',
    'Performed by ' => 'performed',
    'Courtesy of ' => 'courtesy',
    "By Arrangement with\n" => 'arranged',
);
$meta_pattern = '#(?<key>' . join(array_keys($meta_keys), "|") . ')(?<value>[^\n$]+)(?:\n|$)#ims';


$songs = array();
if (preg_match_all($title_pattern, $string, $matches, PREG_SET_ORDER)) {
    foreach ($matches as $match) {
        $t = array(
            'title' => $match['title'],
        );

        if (preg_match_all($meta_pattern, $match['meta'], $_matches, PREG_SET_ORDER)) {
            foreach ($_matches as $_match) {
                $k = $meta_keys[$_match['key']];
                $t[$k] = $_match['value'];
            }
        }

        $songs[] = $t;
    }
}

will result in

$songs = array (
  array (
    'title'     => 'Stranglehold',
    'written'   => 'Ted Nugent',
    'performed' => 'Ted Nugent',
    'courtesy'  => 'Epic Records',
    'arranged'  => 'Sony Music Licensing',
  ),
  array (
    'title'     => 'Chateau Lafltte \'59 Boogie',
    'written'   => 'David Peverett',
    'performed' => 'Foghat',
    'courtesy'  => 'Rhino Entertainment',
    'arranged'  => 'Warner Special Products',
  ),
);

A solution without regex is also possible, though slightly more verbose:

<?php

$string = '"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte \'59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products';

$songs = array();
$current = array();
$lines = explode("\n", $string);
// can't use foreach if we want to extract "By Arrangement"
// cause it spans two lines
for ($i = 0, $_length = count($lines); $i < $_length; $i++) {
    $line = $lines[$i];
    $length = strlen($line); // might want to use mb_strlen()

    // if line is enclosed in " it's a title
    if ($line[0] == '"' && $line[$length - 1] == '"') {
        if ($current) {
            $songs[] = $current;
        }

        $current = array(
            'title' => substr($line, 1, $length - 2),
        );

        continue;
    }

    $meta_keys = array(
        'By Arrangement with' => 'arranged', 
    );

    foreach ($meta_keys as $key => $k) {
        if ($key == $line) {
            $i++;
            $current[$k] = $lines[$i];
            continue;
        }
    }

    $meta_keys = array(
        'Written by ' => 'written', 
        'Performed by ' => 'performed', 
        'Courtesy of ' => 'courtesy',
    );

    foreach ($meta_keys as $key => $k) {
        if (strpos($line, $key) === 0) {
            $current[$k] = substr($line, strlen($key));
            continue 2;
        }
    }    
}

if ($current) {
    $songs[] = $current;
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top