質問

I am parsing a dirty html page with XmlSlurper, and I get the following error:

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

Now, I have the html I feed it and print it before doing so. If I open it and try to go to the line mentioned in the error, 1157, there is no 'src' in there (but there are hundreds of such string in the file). So I guess some additional stuff is inserted (maybe <script> or something like that) that changes line numbers.

Is there a good way to find exactly the offending line or html piece?

役に立ちましたか?

解決

Which SAXParser are you using? HTML is not strict XML, so using XMLSlurper with the default parser is probably going to result in continued errors.

A cursory google search for "Groovy html slurper" led me to HTML Scraping With Groovy which points to a SaxParser called TagSoup.

Give that a whirl and see if it parses the dirty page.

他のヒント

You could add an attribute named _lineNum to each element, which can then be used.

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

The above adds the line num attribute. You can perhaps try to set your own error handler which can read the line number from the locator.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top