Domanda

I would like to extract the profile information for each of the lines listed on the following table for all the multiple pages:

http://reports.finance.yahoo.com/z1?b=1&so=a&sf=m&tc=1&stt=-&pr=0&cpl=-1&cpu=-1&yl=-1&yu=-1&ytl=-1&ytu=-1&mtl=-1&mtu=-1&rl=5&ru=-1&cll=0

Here is sample of one of the links to one of the lines that was listed on the table (which are all in the "Issue" column):

http://reports.finance.yahoo.com/z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000

I'd like to store all the information contained for each Issue for all the lines and pages in a mysql database. I assume PERL would be a good tool to use for this, but my experience with it is very limited.

I think I would need to gather all the links in the issue column for all the pages of the table (which was 2600+ pages at the time), and somehow extract the information from each of those pages from the links.

Any help would be greatly appreciated.

È stato utile?

Soluzione

This will let you get started in some way and show you a general technique of doing this with regexes (which could be hard to understand if you are not very familiar with perl and regex matching).

I did it for the first page only and I did put as many comments in my code as possible to help you understand it. If you are not able to understand what this code actually does I would suggest trying to use a different tool (or maybe try a module like Web::Scraper or Mojo::DOM). Read some perl docs if you would really like to get your job done in perl...

http://perldoc.perl.org/perlre.html

#!/usr/bin/perl                                                                                                                                                                                                                                                               
use strict;
use warnings;
use LWP::Simple;
use feature 'say';

my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';
my $page_content = get($start_url);
die "Oops, something went wrong!" unless defined $page_content;

process_bond_results_page($page_content);

sub process_bond_results_page {
    my $content = shift;
    # iterates $content as long as /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g regex matches                                                                                                                                                                                 
    # puts row content (content between <tr...>(...)</tr> in a special $1 variable)                                                                                                                                                                                                     
    while($content =~ /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g) {
        # uncomment line below to see what $1 contains                                                                                                                                                                                                                        
        # say $1;                                                                                                                                                                                                                                                             

        # cleanup not needed HTML tags                                                                                                                                                                                                                                        
        my $tr_data = cleanup_html_tags($1);

        # match content in between <td> & </td> tags and put them on @tds list                                                                                                                                                                                                
        my (@tds) = $tr_data =~ /<td>(.*?)<\/td>/g;

        # 2nd element of @tds list contains <a href="link_to_issue">ISSUE NAME</a> text                                                                                                                                                                                       
        # Line below extracts link_to_issue and $issue_name and assigns them to respective variables                                                                                                                                                                          
        my ($link_to_issue, $issue_name) = $tds[1] =~ /<a[^>]*?href=\"([^\"]*?)\"[^>]*?>(.+?)<\/a>/g;

        # Replace 2nd element of list that contains data like <a href="link_to_issue">ISSUE NAME</a>                                                                                                                                                                          
        # with just ISSUE NAME                                                                                                                                                                                                                                                
        $tds[1] = $issue_name;

        # Append $link_to_issue at the end of @tds list                                                                                                                                                                                                                       
        push(@tds,$link_to_issue);

        # Print @tds array with values seaparated by TABs                                                                                                                                                                                                                     
        say join("\t", @tds);
    }

    # Does it have Next link?                                                                                                                                                                                                                                                 
    my ($next_link) = $content =~ /<a[^>]*?href=\"([^\"]+?)\">Next<\/a><\/b>/g;
    say 'NEXT: ' . $next_link if $next_link;

    return;
}

sub cleanup_html_tags {
    my $html = shift;
    $html =~ s/<\/?(font|div)[^>]*?>//g; # remove <font...>, <div...>, </font>, </div>                                                                                                                                                                                        
    $html =~ s/<td[^>]*?>/<td>/g;        # replace all <td...> with just <td>                                                                                                                                                                                                 
    $html =~ s/<\/?nobr>//g;             # remove <nobr> and </nobr>                                                                                                                                                                                                          
    return $html;
}

Above will print:

Corp    MERRILL LYNCH CO INC MTN BE 100.63  5.000    3-Feb-2014 -19.649 4.969   A   No  /z2?ce=5314754150501796218050&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CME GROUP INC   100.84  5.750   15-Feb-2014 -8.334  5.702   AA  No  /z2?ce=5715449144561716016149&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CAPITAL ONE BK MTN BE   100.80  5.125   15-Feb-2014 -8.334  5.084   A   No  /z2?ce=5715254147581635317455&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HESS CORP   100.92  7.000   15-Feb-2014 -8.351  6.937   BBB No  /z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    PACCAR INC  100.90  6.875   15-Feb-2014 -8.295  6.813   A   No  /z2?ce=5214751144551836016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    WACHOVIA CORP NEW   100.78  4.875   15-Feb-2014 -8.337  4.837   A   No  /z2?ce=4915445142581546016054&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CATERPILLAR FINL SVCS MTNS BE   100.89  6.125   17-Feb-2014 -7.597  6.071   A   No  /z2?ce=5715245150561764615951&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    KRAFT FOODS INC 100.97  6.750   19-Feb-2014 -6.921  6.685   BBB No  /z2?ce=5315654144531746017754&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    WESTERN UN CO   101.05  6.500   26-Feb-2014 -5.154  6.432   BBB No  /z2?ce=4915145143581556015548&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    AMERICA MOVIL SAB DE CV 101.06  5.500    1-Mar-2014 -4.615  5.443   A   No  /z2?ce=5815451145541816015954&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HARTFORD FINL SVCS GROUP INC    100.96  4.750    1-Mar-2014 -4.454  4.705   BBB No  /z2?ce=5415548146571526017250&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HEWLETT PACKARD CO  101.12  6.125    1-Mar-2014 -4.599  6.057   BBB No  /z2?ce=5415446149551516016556&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    RYDER SYS MTN BE    101.08  5.850    1-Mar-2014 -4.495  5.788   BBB No  /z2?ce=5114851146531605117352&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HSBC FIN CORP HSBC FIN  100.72  2.000   15-Mar-2014 -3.011  1.986   A   No  /z2?ce=5415650149491807117451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    SYSCO CORP  101.06  4.600   15-Mar-2014 -2.772  4.552   A   No  /z2?ce=5014953143561486015756&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
NEXT: z1?b=2&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000

Altri suggerimenti

Since user3195726 suggested, here it is using Mojo::UserAgent and Mojo::DOM

#!/usr/bin/perl                                                                                                                                                                                                                                                               
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;

my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';

my $dom = Mojo::UserAgent->new->get($start_url)->res->dom;
$dom->find('tr.yfnc_tabledata1')->each(sub{
  my $tds = $_->find('td');
  my $anchor = $tds->[1]->at('a');
  my $link = $anchor->{href};
  my $name = $anchor->all_text;
  $tds = $tds->all_text;
  $tds->[1] = $name;
  push @$tds, $link;
  say $tds->join("\t");
});

say 'Next: ' . $dom->find('a')->first(sub{ $_->all_text eq 'Next'})->{href};

The finds are all using CSS3 selectors the rest is just transforms.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top