Question

my first time posting!

The problem I'm having is I'm using XPath and Tag-Soup to parse a webpage and read in the data. As these are news articles sometimes they have links embedded in the content and these are what is messing with my program.

The XPath I'm using is storyPath = "//html:article//html:p//text()"; where the page has a structure of:

<article ...>
   <p>Some text from the story.</p>
   <p>More of the story, which proves <a href="">what a great story this is</a>!</p>
   <p>More of the story without links!</p>
</article>

My code relating to the xpath evaluation is this:

NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
    for (int i=0; i<nL.getLength(); i++) {
        Node n = nL.item(i);

        String tmp = n.toString();
        tmp = tmp.replace("[#text:", "");
        tmp = tmp.replace("]", "");
        tmp = tmp.replaceAll("’", "'");
        tmp = tmp.replaceAll("‘", "'");
        tmp = tmp.replaceAll("–", "-");
        tmp = tmp.replaceAll("¬", "");
        tmp = tmp.trim();

        story.add(tmp);
    }

this.setStory(story);
...

private void setStory(LinkedList<String> story) {
    String tmp = "";
    for (String p : story) {
        tmp = tmp + p + "\n\n";
    }

    this.story = tmp.trim();
}

The output this gives me is

Some text from the story.

More of the story, which proves 

what a great story this is

!

More of the story without links!

Does anyone have a way of me eliminating this error? Am I taking a wrong approach somewhere? (I understand I could well be with the setStory code, but don't see another way.

And without the tmp.replace() codes, all the results appear like [#text: what a great story this is] etc

EDIT:

I am still having troubles, though possibly of a different kind.. what is killing me here is again a link, but the way the BBC have their website, the link is on a separate line, thus it still reads in with the same problem as described before (note that problem was fixed with the example given). The section of code on the BBC page is:

    <p>    Former Queens Park Rangers trainee Sterling, who 

    <a  href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a> 

    had not started a senior match for the Reds before this season.
    </p>

which appears in my output as:

    Former Queens Park Rangers trainee Sterling, who 

    moved to the Merseyside club in February 2010 aged 15, 

         had not started a senior match for the Reds before this season.
Was it helpful?

Solution

For the problem with your edit where new lines in the html source code come out into your text document, you'll want to remove them before you print them. Instead of System.out.print(text.trim()); do System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));

OTHER TIPS

First find paragraphs,: storyPath = "//html:article//html:p, then for each paragraph, get out all the text with another xpath query and concatenate them without new lines and put two new lines just at the end of the paragraph.

On another note, you shouldn't have to replaceAll("’", "'"). That is a sure sign that you are opening your file incorrectly. When you open your file you need to pass a Reader into tag soup. You should initialize the Reader like this: Reader r = new BufferedReader(new InputStreamReader(new FileInputStream("myfilename.html"),"Cp1252")); Where you specify the correct character set for the file. A list of character sets is here: http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html My guess is that it is Windows latin 1.

The [#text: thing is simply the toString() representation of a DOM Text node. The toString() method is intended to be used when you want a string representation of the node for debugging purposes. Instead of toString() use getTextContent() which returns the actual text.

If you don't want the link content to appear on separate lines then you could remove the //text() from your XPath and just take the textContent of the element nodes directly (getTextContent() for an element returns the concatenation of all the descendant text nodes)

String storyPath = "//html:article//html:p";
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
    Node n = nL.item(i);
    story.add(n.getTextContent().trim());
}

The fact that you are having to manually fix up things like "’" suggests your HTML is actually encoded in UTF-8 but you're reading it using a single-byte character set such as Windows1252. Rather than try and fix it post-hoc you should instead work out how to read the data in the correct encoding in the first place.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top