Question

I'm getting BadStatusLine: '' error when using tldextract.extract(url):

subdomain, domain, tld = tldextract.extract(url)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 194, in extract
    return TLD_EXTRACTOR(url)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 128, in __call__
    return self._extract(netloc)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 132, in _extract
    registered_domain, tld = self._get_tld_extractor().extract(netloc)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 165, in _get_tld_extractor
    tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 165, in <genexpr>
    tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 204, in _PublicSuffixListSource
    page = _fetch_page('http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1')
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 198, in _fetch_page
    return unicode(urllib2.urlopen(url).read(), 'utf-8')
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 400, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 418, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    raise BadStatusLine(line)
BadStatusLine: ''
Was it helpful?

Solution

This is due to that mozilla.org URL in your stacktrace (http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1) being unavailable, and tldextract tries to update from that URL on first install. This live update can be disabled (see below), but the uncaught exception is a tldextract bug. It should only log the exception, and seamlessly fallback to package's bundled PSL.

This is fixed in tldextract 1.2.1, just published to PyPI. It switches to the GitHub mirror of the PSL. So upgrading should workaround the uncaught exception.

Another release soon will avoid future uncaught exceptions when the e.g. GitHub PSL mirror is unavailable.

Turning off the default fetch

You can avoid this problem in the previous version by turning off the default on-first-install fetch. Construct your own TLDExtract callable with fetch=False. From the docs:

import tldextract
no_fetch_extract = tldextract.TLDExtract(fetch=False)
no_fetch_extract('http://www.google.com')

OTHER TIPS

The package is trying to download a public suffix list from a URL that currently does not work:

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

This is due to a DDOS attack on that URL, Mozilla has blocked the URL for now.

This has already been reported to the project, and a fix has been proposed albeit that the latter only works if you already have a cached copy of the public suffix list.

In the meantime, use the publicsuffix package instead; it bundles the data in the package itself and does not require a URL request.

Update: Mozilla now host the file at https://publicsuffix.org/list/effective_tld_names.dat and any access to the MXR source repository without a mxr.mozilla.org Referer header redirects you to that new location.

This is due to http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 not being served.

If you want to keep using tldextract to obtain the subdomain, domain, tld, a temporary solution is to use a cache, e.g. in project/tldextractor/__init__.py

import os 
import tldextract
TLD_CACHE_PATH = os.path.join(
    os.path.abspath(os.path.dirname(__file__)), 'tldextract_cache')
tldextractor = tldextract.TLDExtract(cache_file=TLD_CACHE_PATH, fetch=False)

In project/tldextractor/tldextract_cache: https://gist.github.com/AJamesPhillips/6899560

then:

from .tldextractor import tldextractor
tldextractor('http://subdomain.domain.tld')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top