Troubles with XPath and Links

Question 1

For the problem with your edit where new lines in the html source code come out into your text document, you'll want to remove them before you print them. Instead of System.out.print(text.trim()); do System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));

Question 2

First find paragraphs,: storyPath = "//html:article//html:p, then for each paragraph, get out all the text with another xpath query and concatenate them without new lines and put two new lines just at the end of the paragraph.

On another note, you shouldn't have to replaceAll("‚Äô", "'"). That is a sure sign that you are opening your file incorrectly. When you open your file you need to pass a Reader into tag soup. You should initialize the Reader like this: Reader r = new BufferedReader(new InputStreamReader(new FileInputStream("myfilename.html"),"Cp1252")); Where you specify the correct character set for the file. A list of character sets is here: http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html My guess is that it is Windows latin 1.

Question 3

The [#text: thing is simply the toString() representation of a DOM Text node. The toString() method is intended to be used when you want a string representation of the node for debugging purposes. Instead of toString() use getTextContent() which returns the actual text.

If you don't want the link content to appear on separate lines then you could remove the //text() from your XPath and just take the textContent of the element nodes directly (getTextContent() for an element returns the concatenation of all the descendant text nodes)

String storyPath = "//html:article//html:p";
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
    Node n = nL.item(i);
    story.add(n.getTextContent().trim());
}

The fact that you are having to manually fix up things like "‚Äô" suggests your HTML is actually encoded in UTF-8 but you're reading it using a single-byte character set such as Windows1252. Rather than try and fix it post-hoc you should instead work out how to read the data in the correct encoding in the first place.