Slow html parser. How to increase the speed?

Question 1

Several fixes can be applied (without changing modules you are currently using):

use lxml parser instead of html5lib - it is much much (and 3 more muches) faster
parse only a part of document with SoupStrainer (note that html5lib doesn't support SoupStrainer - it will always parse the whole document slowly)

Here's how the code would look like after the changes. Brief performance test shows at least 3x improvement:

import urllib2
import xml.etree.cElementTree as ET
from datetime import date

from bs4 import SoupStrainer, BeautifulSoup
import nltk
from dateutil.rrule import rrule, DAILY
from nltk.corpus import stopwords


def main_parser():
    a = b = date(2014, 3, 27)
    articles = ET.Element("articles")
    for dt in rrule(DAILY, dtstart=a, until=b):
        url = "http://www.reuters.com/resources/archive/us/" + dt.strftime("%Y") + dt.strftime("%m") + dt.strftime(
            "%d") + ".html"

        links = SoupStrainer("div", "headlineMed")
        soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=links)

        article_date = ET.SubElement(articles, "article_date")
        article_date.text = str(dt)
        for link in soup.find_all('a'):
            if not 'video' in link['href']:
                try:
                    article_time = ET.SubElement(article_date, "article_time")
                    article_time.text = str(link.text[-11:])

                    article_header = ET.SubElement(article_time, "article_name")
                    article_header.text = str(link.text)

                    article_link = ET.SubElement(article_time, "article_link")
                    article_link.text = str(link['href']).encode('utf-8')

                    try:
                        article_text = ET.SubElement(article_time, "article_text")
                        article_text.text = str(remove_stop_words(extract_article(link['href']))).encode('ascii', 'ignore')
                    except Exception:
                        pass
                except Exception:
                    pass

    tree = ET.ElementTree(articles)
    tree.write("~/Documents/test.xml", "utf-8")


def extract_article(url):
    paragraphs = SoupStrainer('p')
    soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=paragraphs)
    return soup.text


def remove_stop_words(text):
    text = nltk.word_tokenize(text)
    filtered_words = [w for w in text if not w in stopwords.words('english')]
    return ' '.join(filtered_words)

Note that I've removed the regular expression processing from extract_article() - looks like you can just get the whole text from the p tags.

I might have introduced some problems - please check if everything is correct.

Another solution would be to use lxml for everything from parsing (replace beautifulSoup) to creating the xml (replace xml.etree.ElementTree).

Another solution (definitely the fastest) would be to switch to Scrapy web-scraping web-framework. It is simple and very fast. There are all kind of batteries, you can imagine, included. For example there are link extractors, XML exporters, database pipelines etc. Worth looking.

Hope that helps.

Question 2

You want to pick the best parser.

We benchmark most of the parser / platform when building: http://serpapi.com

Here is a full article on Medium: https://medium.com/@vikoky/fastest-html-parser-available-now-f677a68b81dd