Question

I would like to extract the profile information for each of the lines listed on the following table for all the multiple pages:

http://reports.finance.yahoo.com/z1?b=1&so=a&sf=m&tc=1&stt=-&pr=0&cpl=-1&cpu=-1&yl=-1&yu=-1&ytl=-1&ytu=-1&mtl=-1&mtu=-1&rl=5&ru=-1&cll=0

Here is sample of one of the links to one of the lines that was listed on the table (which are all in the "Issue" column):

http://reports.finance.yahoo.com/z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000

I'd like to store all the information contained for each Issue for all the lines and pages in a mysql database. I assume PERL would be a good tool to use for this, but my experience with it is very limited.

I think I would need to gather all the links in the issue column for all the pages of the table (which was 2600+ pages at the time), and somehow extract the information from each of those pages from the links.

Any help would be greatly appreciated.

Was it helpful?

Solution

This will let you get started in some way and show you a general technique of doing this with regexes (which could be hard to understand if you are not very familiar with perl and regex matching).

I did it for the first page only and I did put as many comments in my code as possible to help you understand it. If you are not able to understand what this code actually does I would suggest trying to use a different tool (or maybe try a module like Web::Scraper or Mojo::DOM). Read some perl docs if you would really like to get your job done in perl...

http://perldoc.perl.org/perlre.html

#!/usr/bin/perl                                                                                                                                                                                                                                                               
use strict;
use warnings;
use LWP::Simple;
use feature 'say';

my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';
my $page_content = get($start_url);
die "Oops, something went wrong!" unless defined $page_content;

process_bond_results_page($page_content);

sub process_bond_results_page {
    my $content = shift;
    # iterates $content as long as /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g regex matches                                                                                                                                                                                 
    # puts row content (content between <tr...>(...)</tr> in a special $1 variable)                                                                                                                                                                                                     
    while($content =~ /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g) {
        # uncomment line below to see what $1 contains                                                                                                                                                                                                                        
        # say $1;                                                                                                                                                                                                                                                             

        # cleanup not needed HTML tags                                                                                                                                                                                                                                        
        my $tr_data = cleanup_html_tags($1);

        # match content in between <td> & </td> tags and put them on @tds list                                                                                                                                                                                                
        my (@tds) = $tr_data =~ /<td>(.*?)<\/td>/g;

        # 2nd element of @tds list contains <a href="link_to_issue">ISSUE NAME</a> text                                                                                                                                                                                       
        # Line below extracts link_to_issue and $issue_name and assigns them to respective variables                                                                                                                                                                          
        my ($link_to_issue, $issue_name) = $tds[1] =~ /<a[^>]*?href=\"([^\"]*?)\"[^>]*?>(.+?)<\/a>/g;

        # Replace 2nd element of list that contains data like <a href="link_to_issue">ISSUE NAME</a>                                                                                                                                                                          
        # with just ISSUE NAME                                                                                                                                                                                                                                                
        $tds[1] = $issue_name;

        # Append $link_to_issue at the end of @tds list                                                                                                                                                                                                                       
        push(@tds,$link_to_issue);

        # Print @tds array with values seaparated by TABs                                                                                                                                                                                                                     
        say join("\t", @tds);
    }

    # Does it have Next link?                                                                                                                                                                                                                                                 
    my ($next_link) = $content =~ /<a[^>]*?href=\"([^\"]+?)\">Next<\/a><\/b>/g;
    say 'NEXT: ' . $next_link if $next_link;

    return;
}

sub cleanup_html_tags {
    my $html = shift;
    $html =~ s/<\/?(font|div)[^>]*?>//g; # remove <font...>, <div...>, </font>, </div>                                                                                                                                                                                        
    $html =~ s/<td[^>]*?>/<td>/g;        # replace all <td...> with just <td>                                                                                                                                                                                                 
    $html =~ s/<\/?nobr>//g;             # remove <nobr> and </nobr>                                                                                                                                                                                                          
    return $html;
}

Above will print:

Corp    MERRILL LYNCH CO INC MTN BE 100.63  5.000    3-Feb-2014 -19.649 4.969   A   No  /z2?ce=5314754150501796218050&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CME GROUP INC   100.84  5.750   15-Feb-2014 -8.334  5.702   AA  No  /z2?ce=5715449144561716016149&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CAPITAL ONE BK MTN BE   100.80  5.125   15-Feb-2014 -8.334  5.084   A   No  /z2?ce=5715254147581635317455&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HESS CORP   100.92  7.000   15-Feb-2014 -8.351  6.937   BBB No  /z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    PACCAR INC  100.90  6.875   15-Feb-2014 -8.295  6.813   A   No  /z2?ce=5214751144551836016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    WACHOVIA CORP NEW   100.78  4.875   15-Feb-2014 -8.337  4.837   A   No  /z2?ce=4915445142581546016054&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CATERPILLAR FINL SVCS MTNS BE   100.89  6.125   17-Feb-2014 -7.597  6.071   A   No  /z2?ce=5715245150561764615951&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    KRAFT FOODS INC 100.97  6.750   19-Feb-2014 -6.921  6.685   BBB No  /z2?ce=5315654144531746017754&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    WESTERN UN CO   101.05  6.500   26-Feb-2014 -5.154  6.432   BBB No  /z2?ce=4915145143581556015548&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    AMERICA MOVIL SAB DE CV 101.06  5.500    1-Mar-2014 -4.615  5.443   A   No  /z2?ce=5815451145541816015954&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HARTFORD FINL SVCS GROUP INC    100.96  4.750    1-Mar-2014 -4.454  4.705   BBB No  /z2?ce=5415548146571526017250&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HEWLETT PACKARD CO  101.12  6.125    1-Mar-2014 -4.599  6.057   BBB No  /z2?ce=5415446149551516016556&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    RYDER SYS MTN BE    101.08  5.850    1-Mar-2014 -4.495  5.788   BBB No  /z2?ce=5114851146531605117352&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HSBC FIN CORP HSBC FIN  100.72  2.000   15-Mar-2014 -3.011  1.986   A   No  /z2?ce=5415650149491807117451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    SYSCO CORP  101.06  4.600   15-Mar-2014 -2.772  4.552   A   No  /z2?ce=5014953143561486015756&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
NEXT: z1?b=2&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000

OTHER TIPS

Since user3195726 suggested, here it is using Mojo::UserAgent and Mojo::DOM

#!/usr/bin/perl                                                                                                                                                                                                                                                               
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;

my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';

my $dom = Mojo::UserAgent->new->get($start_url)->res->dom;
$dom->find('tr.yfnc_tabledata1')->each(sub{
  my $tds = $_->find('td');
  my $anchor = $tds->[1]->at('a');
  my $link = $anchor->{href};
  my $name = $anchor->all_text;
  $tds = $tds->all_text;
  $tds->[1] = $name;
  push @$tds, $link;
  say $tds->join("\t");
});

say 'Next: ' . $dom->find('a')->first(sub{ $_->all_text eq 'Next'})->{href};

The finds are all using CSS3 selectors the rest is just transforms.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top