Perl: why does this web scraper regex work inconsistently?

https://stackoverflow.com/questions/9193711

27-04-2021
|

Question

I have run into another problem in relation to a site I am trying to scrape.

Basically I have stripped most of what I don't want from the page content and thanks to some help given here have managed to isolate the dates I wanted. Most of it seems to be working fine, despite some initial problems matching a non-breaking space. However, I am now having difficulty with the final regex, which is intended to split each line of data into fields. Each line represents the price of a share price index. The fields on each line are:

A name of arbitrary length made from characters from the latin alphabet and sometimes a comma or ampersand, no numerics.
A number with two digits after the decimal point (the absolute value of the index).
A number with two digits after the decimal point (the change in the value).
A number with two digits after the decimal point followed by a percent sign (the percentage change in value).

Here is an example string, before splitting: "Fishery, Agriculture & Forestry243.45-1.91-0.78% Mining360.74-4.15-1.14% Construction465.36-1.01-0.22% Foods783.2511.281.46% Textiles & Apparels412.070.540.13% Pulp & Paper333.31-0.29-0.09% Chemicals729.406.010.83% "

The regex I am using to split this line is this:

$mystr =~ s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

It works sometimes but not other times and I cannot work out why this should be. (The doubled equal signs in the example output below are used to make the field split more easily visible.)

Fishery, Agriculture & Forestry == 243.45 == -1.91 == -0.78%
Mining360.74-4.15-1.14%
Construction == 465.36 == -1.01 == -0.22%
Foods783.2511.281.46%

I thought the minus sign was an issue for those indices that saw a negative change in the price of the index, but sometimes it works despite the minus sign.

Q. Why is the final regex shown below failing to split the fields consistently?

Example code follows.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";

my $content = get($url_full);
# get dates:
(my @dates) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
foreach my $date (@dates) { # convert to yyyy-mm-dd
    $date =~ s/\//-/ig;
}
my $tree = HTML::Tree->new();
$tree->parse($content);
my $mystr = $tree->as_text;

$mystr =~ s/\xA0//gi; # remove non-breaking spaces
# remove first chunk of text:
$mystr =~
  s/^(TSE.*?)IndustryIndexChange ?/IndustryIndexChange\n$dates[0]\n\n/gi;
$mystr =~ s/IndustryIndexChange ?/IndustryIndexChange/ig;
$mystr =~ s/IndustryIndexChange/Industry Index Change\n/ig;
$mystr =~ s/% /%\n/gi; # percent symbol is market for end of line
# indicate breaks between days:
$mystr =~ s/Stock.*?IndustryIndexChange/\nDAY DELIMITER\n/gi;
$mystr =~ s/Exemption from Liability.*$//g; # remove boilerplate at bottom

# and here's the problem regex...
# try to split it:
$mystr =~
  s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

print $mystr;

Solution

It appears to be doing every other one.

My guess is that your records have a single \n between them, but your pattern starts and ends with a \n. So the final \n on the first match consumes the \n that the second match needed to find the second record. The net result is that it picks up every other record.

You might be better off wrapping your pattern in ^ and $ (instead of \n and \n), and using the m flag on the s///.

OTHER TIPS

The problem is that you have \n both at the start and at the end of the regex.

Consider something like this:

$s = 'abababa';
$s =~ s/aba/axa/g;

that will set $s to axabaxa, not axaxaxa, because there are only two non-overlapping occurrences of aba.

My interpretation (pseudocode) -

one   = [a-zA-Z,& ]+
two   = \d{1,4}.\d\d
three = <<two>>
four  = <<two>>%

regex = (<<one>>)(<<two>>)(<<three>>)(<<four>>)
      = ([a-zA-Z,& ]+)(\d{1,4}.\d\d)(\d{1,4}.\d\d)(\d{1,4}.\d\d%)

However, you are already presented with 'structured' data in the form of HTML. Why not take advantage of this?

HTML parsing in perl references MOJO for DOM based parsing in perl, and unless there are serious performance reasons, I'd highly recommend such an approach.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow