Question

I inherited a data-storage which was using simple text-files to save documents.

Documents had some attributes (date, title, and text), and these were encoded in a filename: <date>-<title>.txt, with the body of the file being the text.

However in reality Documents in the system have many more attributes, and even more again were proposed to be added.

It seemed logical to switch to an XML format, and I have done so, with each document now encoded in it's own XML file.

However, reading the files in from XML is now RIDICULOUSLY slow! (Where 2000 articles in the .txt format took seconds, now 2000 articles in the .xml format takes more than 10 minutes).

I WAS using a DOM parser, and after I discovered how slow the reading was, I switched to a SAX parser, however it's STILL just as slow (well, faster, but still 10 minutes).

Is XML JUST THAT slow, or am I doing something strange? Any thoughts would be appreciated.

The system is written in JavaSE 1.6. The Parser is created like this:


/*
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
*/
  SAXParserFactory factory = SAXParserFactory.newInstance();
  SAXParser saxParser;
  try {
    saxParser = factory.newSAXParser();
    ArticleSaxHandler handler = new ArticleSaxHandler();
    saxParser.parse(is, handler);
    return handler.getArticle();
  } catch (ParserConfigurationException e) {
    throw new IOException(e);
  } catch (SAXException e) {
    throw new IOException(e);
  } finally { 
    if (is != null) {
      try {
        is.close();
      } catch (IOException e) {
        logger.error(e);
      }
    }
  }
}

private class ArticleSaxHandler extends DefaultHandler {
        private URI uri = null;
        private String source = null;
        private String author = null;
        private DateTime articleDatetime = null;
        private DateTime processedDatetime = null;
        private String title = null;
        private String text = null;
        private ArticleElement currentElement;
        private final StringBuilder builder = new StringBuilder();

        public Article getArticle() {
            return new Article(uri, source, author, articleDatetime, processedDatetime, title, text);
        }

        /** Receive notification of the start of an element. */
        public void startElement(String uri, String localName, String qName, Attributes attributes) {
            if (builder.length() != 0) {
                throw new RuntimeException(new SAXParseException(currentElement + " was not finished before " + qName + " was started", null));
            }
            currentElement = ArticleElement.getElement(qName);
        }

        public void endElement(String uri, String localName, String qName) {
            final String elementText = builder.toString();
            builder.delete(0, builder.length());
            if (currentElement == null) {
                return;
            }
            switch (currentElement) {
                case ARTICLE:
                    break;
                case URI:
                    try {
                        this.uri = new URI(elementText);
                    } catch (URISyntaxException e) {
                        throw new RuntimeException(e);
                    }
                    break;
                case SOURCE:
                    source = elementText;
                    break;
                case AUTHOR:
                    author = elementText;
                    break;
                case ARTICLE_DATE_TIME:
                    articleDatetime = getDateTimeFormatter().parseDateTime(elementText);
                    break;
                case PROCESSED_DATE_TIME:
                    processedDatetime = getDateTimeFormatter().parseDateTime(elementText);
                    break;
                case TITLE:
                    title = elementText;
                    break;
                case TEXT:
                    this.text = elementText;
                    break;
                default:
                    throw new IllegalStateException("Unexpected ArticleElement: " + currentElement);
            }
            currentElement = null;
        }

        /** Receive notification of character data inside an element. */
        public void characters(char[] ch, int start, int length) {
            builder.append(ch, start, length);
        }

        public void error(SAXParseException e) {
            fatalError(e);
        }

        public void fatalError(SAXParseException e) {
            logger.error("currentElement: " + currentElement + " ||builder: " + builder.toString() + "\n\n" + e.getMessage(), e);
        }
    }

    private enum ArticleElement {
        ARTICLE(ARTICLE_ELEMENT_NAME), URI(URI_ELEMENT_NAME), SOURCE(SOURCE_ELEMENT_NAME), AUTHOR(AUTHOR_ELEMENT_NAME), ARTICLE_DATE_TIME(
                ARTICLE_DATETIME_ELEMENT_NAME), PROCESSED_DATE_TIME(PROCESSED_DATETIME_ELEMENT_NAME), TITLE(TITLE_ELEMENT_NAME), TEXT(TEXT_ELEMENT_NAME);
        private String name;

        private ArticleElement(String name) {
            this.name = name;
        }

        public static ArticleElement getElement(String qName) {
            for (ArticleElement element : ArticleElement.values()) {
                if (element.name.equals(qName)) {
                    return element;
                }
            }
            return null;
        }
    }

Was it helpful?

Solution

Reading data from an unbuffered stream could explain these performance problems. This is not directly related to the change from text to XML but maybe by chance your new implementation doesn't use a BufferedInputStream anymore.


Follwing that path, in detail, check if this is is buffered:

saxParser.parse(is, handler);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top