lxml이 django, scraperwiki에서 작동하지 않습니다.

https://stackoverflow.com//questions/24005988

20-12-2019
|

문제

저는 일리노이주의 총회 웹사이트를 통과하여 일부 PDF를 긁어내는 django 앱을 개발 중입니다.데스크탑에 배포하는 동안 urllib2가 시간 초과될 때까지 제대로 작동합니다.Bluehost 서버에 배포하려고 하면 코드의 lxml 부분에서 오류가 발생합니다.어떤 도움이라도 주시면 감사하겠습니다.

import scraperwiki
from bs4 import BeautifulSoup
import urllib2
import lxml.etree
import re
from django.core.management.base import BaseCommand
from legi.models import Votes

class Command(BaseCommand):
    def handle(self, *args, **options):
        chmbrs =['http://www.ilga.gov/house/', 'http://www.ilga.gov/senate/']
        for chmbr in chmbrs:
            site = chmbr    
            url = urllib2.urlopen(site)
            content = url.read()
            soup = BeautifulSoup(content)
            links = []
            linkStats = []
            x=0
            y=0
            table = soup.find('table', cellpadding=3)
            for a in soup.findAll('a',href=True):
                if re.findall('Bills', a['href']):
                    l = (site + a['href']+'&Primary=True')
                    links.append(str(l))
                    x+=1
                    print x
            for link in links:
                url = urllib2.urlopen(link)
                content = url.read()
                soup = BeautifulSoup(content)
                table = soup.find('table', cellpadding=3)
                for a in table.findAll('a',href=True):
                    if re.findall('BillStatus', a['href']):
                        linkStats.append(str('http://ilga.gov'+a['href']))
            for linkStat in linkStats:
                url = urllib2.urlopen(linkStat)
                content = url.read()
                soup = BeautifulSoup(content)
                for a in soup.findAll('a',href=True):
                    if re.findall('votehistory', a['href']):
                        vl = 'http://ilga.gov/legislation/'+a['href']
                        url = urllib2.urlopen(vl)
                        content = url.read()
                        soup = BeautifulSoup(content)
                        for b in soup.findAll('a',href=True):
                            if re.findall('votehistory', b['href']):
                                llink = 'http://ilga.gov'+b['href']
                                try:
                                    u = urllib2.urlopen(llink)
                                    x = scraperwiki.pdftoxml(u.read())
                                    root = lxml.etree.fromstring(x)
                                    pages = list(root)
                                    chamber = str()
                                    for page in pages:
                                        print "working_1"
                                        for el in page:
                                            print "working_2"
                                            if el.tag == 'text':
                                                if int(el.attrib['top']) == 168:
                                                    chamber = el.text
                                                if re.findall("Senate Vote", chamber):
                                                    if int(el.attrib['top']) >= 203 and int(el.attrib['top']) < 231:
                                                        title = el.text
                                                        if (re.findall('House', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "HB"+title[0]
                                                        elif (re.findall('Senate', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "SB"+title[0]
                                                    if int(el.attrib['top']) >350 and int(el.attrib['top']) <650:
                                                        r = el.text
                                                        names = re.findall(r'[A-z-\u00F1]{3,}',r)
                                                        vs = re.findall(r'[A-Z]{1,2}\s',r)
                                                        for name in names:
                                                            legi = name
                                                            for vote in vs:
                                                                v = vote
                                                            if Votes.objects.filter(legislation=title).exists() == False:
                                                                c = Votes(legislation=title, legislator=legi, vote=v)
                                                                c.save()    
                                                                print 'saved'
                                                            else:
                                                                print 'not saved'                                                       
                                                elif int(el.attrib['top']) == 189:
                                                    chamber = el.text
                                                if re.findall("HOUSE ROLL CALL", chamber):
                                                    if int(el.attrib['top']) > 200 and int(el.attrib['top']) <215:
                                                        title = el.text
                                                        if (re.findall('HOUSE', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "HB"+title[0]
                                                        elif (re.findall('SENATE', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "SB"+title[0]
                                                    if int(el.attrib['top']) >385 and int(el.attrib['top']) <1000:
                                                        r = el.text
                                                        names = re.findall(r'[A-z-\u00F1]{3,}',r)
                                                        votes = re.findall(r'[A-Z]{1,2}\s',r)
                                                        for name in names:
                                                            legi = name
                                                            for vote in votes:
                                                                v = vote
                                                            if Votes.objects.filter(legislation=title).exists() == False:
                                                                c = Votes(legislation=title, legislator=legi, vote=v)
                                                                c.save()
                                                                print 'saved'
                                                            else:
                                                                print 'not saved'

                                except:
                                    pass

편집 1오류 추적은 다음과 같습니다.

    Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home7/maythirt/GAB/legi/management/commands/vote.py", line 51, in handle
    root = lxml.etree.fromstring(x)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
  File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
  File "parser.pxi", line 1674, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101299)
  File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:96481)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
  File "parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91939)
lxml.etree.XMLSyntaxError: None

해결책

Jonathan이 언급했듯이, 이는 다음의 결과일 수 있습니다. scraperwiki.pdftoxml() 그게 문제를 일으키는 거죠.다음 값을 표시하거나 기록할 수 있습니다. x 그것을 확인하기 위해.

구체적으로, pdftoxml() 외부 프로그램을 실행 pdftohtml 임시 파일을 사용하여 PDF 및 XML을 저장합니다.

내가 확인하고 싶은 것은 다음과 같습니다.

~이다 pdftohtml 올바르게 설정 귀하의 서버에서?
그렇다면 코드가 실패한 PDF가 있는 서버의 셸에서 직접 실행하면 XML로의 변환이 작동합니까?실행하는 명령은 다음과 같습니다. pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes "input.pdf" "output.xml"

명령을 직접 실행할 때 문제가 있다면 거기에 문제가 있는 것입니다.방법으로 pdftohtml 에서 실행 scraperwiki 코드에서는 명령이 실패했는지 알 수 있는 쉬운 방법이 없습니다.

다른 팁

ko try를 추가하는 것입니다 : 제외 : 조항 및 오류가 발생하면 XML 파일을 단순히 저장할 것뿐만 아니라 하드 드라이브에 링크를 저장하십시오.그렇게하면 XML 파일을 별도로 검사 할 수 있습니다.

ScraperWiki.pdftoxml은 어떤 이유로 불법 XML 파일을 만듭니다.다른 PDFTOXML 도구를 사용할 때 저에게 일어났습니다.

코드를 읽고 유지 관리하는 것이 훨씬 쉽게 될 수있는 더 많은 기능에 코드를 리팩토링하십시오. :).

다른 방법은 물론 모든 PDF를 먼저 다운로드 한 다음 모두 구문 분석합니다.그렇게하면 어떤 이유로든지 실패 할 때마다 웹 사이트를 여러 번 때리지 않도록하십시오.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow