Question

I'm making a shell script that gets a mountain (only over 8000m) as a parameter and returns the name or names of those who were the first to climb it. I found a page from where i can parse my info which i can download with curl but i don't really know my way too well around regex ... can anyone help me from a html code like this given the mountains name how can i get the climbers ... thx anticipated

site: http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/

html sample

    <p class="wp-caption-text">Everest</p></div></div></div><p><strong>Other names: </strong>Sagamartha, Chomolangma or Qomolangma<br
/> <strong>Altitude:</strong> 8848 m<br
/> <strong>Location: </strong>Tibet / Nepal<br
/> <strong>First ascent:</strong> May 29, 1953 by Sir Edmund Hillary and Tenzing Norgay<br
/> <strong>Expedition</strong><strong>: </strong>New Zeeland/India</p><blockquote><p>&nbsp;</p><p><strong>Difficulty</strong> : <em>Mostly a non-technical climb regardless on which of the two normal routes you choose. On the south you have to deal with a dangerous ice fall and The Hillary Step, a short section of rock, on the north side there are some short technical passages. On both routes (permanent) fixed ropes are placed at the tricky sections. The altitude is main obstacle. Nowadays also crowding is mentioned as a factor of difficulty</em>.</p>

found another site maybe it's easier: http://www.alpineascents.com/8000m-peaks.asp

html sample

<tr>
         <td><strong>Everest</strong></td>
         <td>8,850m <br /></td>
         <td>29,035ft</td>
         <td><div align="center">Nepal/Tibet </div></td>
         <td>1953; Sir E. Hillary, T. Norgay</td>
       </tr>
Was it helpful?

Solution

Using the first HTML sample:

grep '<strong>First ascent:</strong>' | sed 's/.*by \([^>]*\)<.*/\1/'

Output:

Sir Edmund Hillary and Tenzing Norgay
Achille Compagnoni and Lino Lacedelli
George Band and Joe Brown
Kurt Diemberger, Peter Diener, Nawang Dorje, Nima Dorje, Ernst Forrer and Albin Schelbert
Hermann Buhl
Maurice Herzog and Louis Lachenal
Andrew Kauffman and Peter Schoening
Hermann Buhl, Kurt Diemberger, Marcus Schmuck and Fritz Wintersteller

It finds all lines with the 'First ascent' label and grabs everything between by and the <br /> tag.

Edit:

The original answer doesn't filter by the name of the mountain. In addition, the <strong>First ascent:</strong> is too specific for the page (sometimes there is a space after the :). The following should work.

grep -i "$1" -A3 | grep 'First ascent:' | sed 's/.*by \([^>]*\)<.*/\1/'

Explanation: grep -i "$1" -A3 selects the line with the mountain. -i makes the search case insensitive. The -A3 selects the 3 lines following the matched line, which gets the line with the list of climbers. The quotes around "$1" are for mountains with names that have spaces.

OTHER TIPS

You can use my Xidel which does pattern matching on the html tree:

xidel http://www.alpineascents.com/8000m-peaks.asp -e "<tr><strong>Everest</strong><td/>{3}<td>{.}</td></tr>"

Just 109 characters...

(Replace Everest with $1 if it is inside a script with that as parameter)

Or for the other site:

xidel http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/ -e "<p class=\"wp-caption-text\">Everest</p><strong>First ascent:</strong>{text()}"

Firstly, go with the first page in your question. Here's a Java scraper for the "curl" downloaded file:

import java.util.Scanner;
import java.io.*;

public class PageInfo {
    public static void main(String[] args) {
        Scanner scan = new Scanner(new File(args[0]));  //file you downloaded
        PrintWriter output = new PrintWriter("climbers.txt");
        while (scan.hasNextLine()) {
            String s = scan.nextLine();
            if (s.contains("wp-caption-text\">") {
                s = s.split("wp-caption-text\">")[1];
                if (s.length() > 1) output.println(s.split("</p>")[0]);
            } else if (s.contains("First ascent:")) {
                s = s.split("by ")[1];
                output.println(s.split("<br")[0]);
            }
        }
        scan.close();
        output.close();
    }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top