Unable to retrieve web data in java using tidy and Xpath

https://stackoverflow.com/questions/11432121

20-06-2021
|

문제

What I am trying to do is scrape a simple inner HTML from a XHTML file. I have narrowed down my search to the element node, but I fail to retrieve the information.

PLEASE NOTE: the element node has no child node. I get a null pointer exception for doing that

here is the HTML SNIPPET

    <div id="dvTitle" class="titlebtmbrdr01" style="line-height: 22px;">BAJAJ AUTO LTD.       </div>

PLease also NOTE that this file has namespace as http://www.w3.org/1999/xhtml

You can see that I have the div element from which I want BAJAJ AUTO LTD.

Here is the code that i am using

    import java.io.IOException;
     import java.net.MalformedURLException; 
      import java.net.URL;
      import java.util.Vector;

    import javax.xml.xpath.XPath;
    import javax.xml.xpath.XPathConstants;
    import javax.xml.xpath.XPathExpression;
      import javax.xml.xpath.XPathExpressionException;
    import javax.xml.xpath.XPathFactory;

    import jxl.read.biff.BiffException;
    import jxl.write.WriteException;
    import jxl.write.biff.RowsExceededException;

    import org.w3c.dom.Document;
    import org.w3c.dom.Element;
      import org.w3c.dom.Node;
      import org.w3c.dom.NodeList;
    import org.w3c.dom.Text;

    import com.sun.org.apache.xml.internal.serialize.Serializer;


    public class BSEQuotesExtractor implements valueExtractor {

@Override
public Vector<String> getName(Document d) throws XPathExpressionException,            RowsExceededException, BiffException, WriteException, IOException {
    // TODO Auto-generated method stub
    XPathFactory factory = XPathFactory.newInstance();
    XPath xpath = factory.newXPath();
    xpath.setNamespaceContext(new MynamespaceContext());


    Object result = xpath.evaluate("//*[@id='dvTitle']",d, XPathConstants.NODESET);
    NodeList nodes = (NodeList) result;

    System.out.println(nodes.getLength());
    System.out.println(nodes.item(0).getNodeName());
    System.out.println(nodes.item(0).getAttributes().item(1).getNodeName());
    System.out.println(nodes.item(0).getAttributes().item(1).getNodeValue());
    System.out.println(nodes.item(0).getTextContent());

    return null;
}

public static void main(String[] args) throws MalformedURLException, IOException, XPathExpressionException, RowsExceededException, BiffException, WriteException{
    BSEQuotesExtractor q = new BSEQuotesExtractor();
    DOMParser parser = new DOMParser(new URL("http://www.bseindia.com/bseplus/StockReach/StockQuote/Equity/BAJAJ%20AUTO%20LTD/BAJAJAUT/532977/Scrips").openStream());
    Document d = parser.getDocument();
    q.getName(d);

}

        }

And this is the output I get

1
div
dvTitle
null

Now why do I get that null? I should get BAJAJ AUTO LTD.

해결책

When I open the page your code references, that div actually is empty for me:

<div class="titlebtmbrdr01" id="dvTitle" style="line-height: 22px;"></div>

So perhaps you should save the page content to some file to examine if it is the same for you. If it is, but your browser displays things differently, then figure out what combination of cookies and other headers makes a difference there.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow