html parsing with grep and regex

Question 1

Using the first HTML sample:

grep '<strong>First ascent:</strong>' | sed 's/.*by \([^>]*\)<.*/\1/'

Output:

Sir Edmund Hillary and Tenzing Norgay
Achille Compagnoni and Lino Lacedelli
George Band and Joe Brown
Kurt Diemberger, Peter Diener, Nawang Dorje, Nima Dorje, Ernst Forrer and Albin Schelbert
Hermann Buhl
Maurice Herzog and Louis Lachenal
Andrew Kauffman and Peter Schoening
Hermann Buhl, Kurt Diemberger, Marcus Schmuck and Fritz Wintersteller

It finds all lines with the 'First ascent' label and grabs everything between by and the <br /> tag.

Edit:

The original answer doesn't filter by the name of the mountain. In addition, the <strong>First ascent:</strong> is too specific for the page (sometimes there is a space after the :). The following should work.

grep -i "$1" -A3 | grep 'First ascent:' | sed 's/.*by \([^>]*\)<.*/\1/'

Explanation: grep -i "$1" -A3 selects the line with the mountain. -i makes the search case insensitive. The -A3 selects the 3 lines following the matched line, which gets the line with the list of climbers. The quotes around "$1" are for mountains with names that have spaces.

Question 2

You can use my Xidel which does pattern matching on the html tree:

xidel http://www.alpineascents.com/8000m-peaks.asp -e "<tr><strong>Everest</strong><td/>{3}<td>{.}</td></tr>"

Just 109 characters...

(Replace Everest with $1 if it is inside a script with that as parameter)

Or for the other site:

xidel http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/ -e "<p class=\"wp-caption-text\">Everest</p><strong>First ascent:</strong>{text()}"

Question 3

Firstly, go with the first page in your question. Here's a Java scraper for the "curl" downloaded file:

import java.util.Scanner;
import java.io.*;

public class PageInfo {
    public static void main(String[] args) {
        Scanner scan = new Scanner(new File(args[0]));  //file you downloaded
        PrintWriter output = new PrintWriter("climbers.txt");
        while (scan.hasNextLine()) {
            String s = scan.nextLine();
            if (s.contains("wp-caption-text\">") {
                s = s.split("wp-caption-text\">")[1];
                if (s.length() > 1) output.println(s.split("</p>")[0]);
            } else if (s.contains("First ascent:")) {
                s = s.split("by ")[1];
                output.println(s.split("<br")[0]);
            }
        }
        scan.close();
        output.close();
    }
}