lxml não funciona com Django, scraperwiki

https://stackoverflow.com//questions/24005988

20-12-2019
|

Pergunta

Estou trabalhando em um aplicativo Django que passa pelo site da Assembleia Geral de Illinois para extrair alguns PDFs.Embora implantado em minha área de trabalho, ele funciona bem até o tempo limite do urllib2.Quando tento implantar em meu servidor Bluehost, a parte lxml do código gera um erro.Qualquer ajuda seria apreciada.

import scraperwiki
from bs4 import BeautifulSoup
import urllib2
import lxml.etree
import re
from django.core.management.base import BaseCommand
from legi.models import Votes

class Command(BaseCommand):
    def handle(self, *args, **options):
        chmbrs =['http://www.ilga.gov/house/', 'http://www.ilga.gov/senate/']
        for chmbr in chmbrs:
            site = chmbr    
            url = urllib2.urlopen(site)
            content = url.read()
            soup = BeautifulSoup(content)
            links = []
            linkStats = []
            x=0
            y=0
            table = soup.find('table', cellpadding=3)
            for a in soup.findAll('a',href=True):
                if re.findall('Bills', a['href']):
                    l = (site + a['href']+'&Primary=True')
                    links.append(str(l))
                    x+=1
                    print x
            for link in links:
                url = urllib2.urlopen(link)
                content = url.read()
                soup = BeautifulSoup(content)
                table = soup.find('table', cellpadding=3)
                for a in table.findAll('a',href=True):
                    if re.findall('BillStatus', a['href']):
                        linkStats.append(str('http://ilga.gov'+a['href']))
            for linkStat in linkStats:
                url = urllib2.urlopen(linkStat)
                content = url.read()
                soup = BeautifulSoup(content)
                for a in soup.findAll('a',href=True):
                    if re.findall('votehistory', a['href']):
                        vl = 'http://ilga.gov/legislation/'+a['href']
                        url = urllib2.urlopen(vl)
                        content = url.read()
                        soup = BeautifulSoup(content)
                        for b in soup.findAll('a',href=True):
                            if re.findall('votehistory', b['href']):
                                llink = 'http://ilga.gov'+b['href']
                                try:
                                    u = urllib2.urlopen(llink)
                                    x = scraperwiki.pdftoxml(u.read())
                                    root = lxml.etree.fromstring(x)
                                    pages = list(root)
                                    chamber = str()
                                    for page in pages:
                                        print "working_1"
                                        for el in page:
                                            print "working_2"
                                            if el.tag == 'text':
                                                if int(el.attrib['top']) == 168:
                                                    chamber = el.text
                                                if re.findall("Senate Vote", chamber):
                                                    if int(el.attrib['top']) >= 203 and int(el.attrib['top']) < 231:
                                                        title = el.text
                                                        if (re.findall('House', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "HB"+title[0]
                                                        elif (re.findall('Senate', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "SB"+title[0]
                                                    if int(el.attrib['top']) >350 and int(el.attrib['top']) <650:
                                                        r = el.text
                                                        names = re.findall(r'[A-z-\u00F1]{3,}',r)
                                                        vs = re.findall(r'[A-Z]{1,2}\s',r)
                                                        for name in names:
                                                            legi = name
                                                            for vote in vs:
                                                                v = vote
                                                            if Votes.objects.filter(legislation=title).exists() == False:
                                                                c = Votes(legislation=title, legislator=legi, vote=v)
                                                                c.save()    
                                                                print 'saved'
                                                            else:
                                                                print 'not saved'                                                       
                                                elif int(el.attrib['top']) == 189:
                                                    chamber = el.text
                                                if re.findall("HOUSE ROLL CALL", chamber):
                                                    if int(el.attrib['top']) > 200 and int(el.attrib['top']) <215:
                                                        title = el.text
                                                        if (re.findall('HOUSE', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "HB"+title[0]
                                                        elif (re.findall('SENATE', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "SB"+title[0]
                                                    if int(el.attrib['top']) >385 and int(el.attrib['top']) <1000:
                                                        r = el.text
                                                        names = re.findall(r'[A-z-\u00F1]{3,}',r)
                                                        votes = re.findall(r'[A-Z]{1,2}\s',r)
                                                        for name in names:
                                                            legi = name
                                                            for vote in votes:
                                                                v = vote
                                                            if Votes.objects.filter(legislation=title).exists() == False:
                                                                c = Votes(legislation=title, legislator=legi, vote=v)
                                                                c.save()
                                                                print 'saved'
                                                            else:
                                                                print 'not saved'

                                except:
                                    pass

EDITAR 1Aqui está o rastreamento de erro

    Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home7/maythirt/GAB/legi/management/commands/vote.py", line 51, in handle
    root = lxml.etree.fromstring(x)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
  File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
  File "parser.pxi", line 1674, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101299)
  File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:96481)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
  File "parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91939)
lxml.etree.XMLSyntaxError: None

Solução

Como Jonathan mencionou, pode ser o resultado de scraperwiki.pdftoxml() isso está causando um problema.Você pode exibir ou registrar o valor de x para confirmá-lo.

Especificamente, pdftoxml() executa um programa externo pdftohtml e usa arquivos temporários para armazenar PDF e XML.

O que eu também verificaria é:

É pdftohtml configurado corretamente no seu servidor?
Em caso afirmativo, a conversão para XML funciona se você executá-la diretamente em um shell no servidor com o PDF em que o código está falhando?O comando que está executando é pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes "input.pdf" "output.xml"

Se houver um problema ao executar o comando diretamente, então é aí que reside o seu problema.Com o caminho pdftohtml corre no scraperwiki código, não há uma maneira fácil de saber se o comando falha.

Outras dicas

A maneira como eu faria isso seria adicionar uma tentativa:exceto:cláusula e quando você receber o erro, basta salvar o arquivo xml, bem como o link, em seu disco rígido.Dessa forma você pode inspecionar o arquivo xml separadamente.

Pode ser que scraperwiki.pdftoxml crie um arquivo xml ilegal por algum motivo.Isso aconteceu comigo ao usar outra ferramenta pdftoxml.

E, por favor, refatore seu código em mais funções, será muito mais fácil de ler e manter :).

Outra maneira seria, claro, baixar todos os PDFs primeiro e depois analisá-los todos.Dessa forma, você evita acessar o site várias vezes sempre que falhar por algum motivo.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow