Java: How do I extract separated text from nested <div> in HTML?

https://stackoverflow.com//questions/24008961

21-12-2019
|

Question

for Example:

<div>
    this is first
    <div>
        second
   </div>
</div>

I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"

Help me out please!

EDIT

Using ownText() method will create problem in the following html code:

<div style="top:+0.2em; font-size:95%;">
    the
    <a href="/wiki/Free_content" title="Free content">
        free
    </a>
    <a href="/wiki/Encyclopedia" title="Encyclopedia">
        encyclopedia
    </a>
    that
    <a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">              
        anyone can edit
    </a>
    .
</div>

It will print:

the that.

free

encyclopedia

anyone can edit

But it must be:

the

that

encyclopedia

anyone can edit

Solution

If i extract text for first it will show "this is first second"

Use ownText() instead of text() and you'll get only the element contains directly.

Here's an example:

final String html = "<div>\n"
        + "    this is first\n"
        + "    <div>\n"
        + "        second\n"
        + "   </div>\n"
        + "</div>";

Document doc = Jsoup.parse(html); // Get your Document from somewhere


Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text

Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();

System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);

OTHER TIPS

You can use XML parser, in whatever language you are using. Here is for Java: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/

It seems like you're using textContent in the div's to extract the content, which will get you the content of that element, and all descendent elements. (Java: this would be the getTextContent method on the Element)

Instead examine the childNodes (Java: getChildNodes method on the Element). The nodes have a property "nodeType" (Java: getNodeType) which you can look at to work out whether the node is a Text Node (Java: Node.TEXT_NODE), or an Element (Java: Node.ELEMENT_NODE). So to take you example you have a tree of Nodes which look like this...

div (Element)
    this is first (TextNode)
    div (Element)
        second (TextNode)

The outer div directly contains only two nodes - the first piece of text, and the inner div. That inner div then contains the text "second".

So loop over the nodes in the outer div, if the node is a text node, translate, otherwise recurse into the Element. Note that there are other kinds of nodes, Comments and the like, but for your purposes you can probably ignore those.

Assuming you're using the w3c DOM API http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html

 Elements divs=doc.getElementsByTag("div");

     for (Element element : divs) {
            System.out.println(element.text());

        }

This should work if you are using jsoup HTML parser.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow