BeautifulSoup은 테이블에 가치를 얻습니다

https://stackoverflow.com/questions/1817184

08-07-2019
|

문제

나는 긁어 내려고 노력하고있다http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104그리고 "소유자 이름"을 얻으십시오. 제가 가지고있는 것은 실제로 못 생겼지 만 확실하지는 않습니다. 그래서 나는 더 나은 방법을 찾고 있습니다. 다음은 다음과 같습니다.

soup = BeautifulSoup(url_opener.open(url))            
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next

관련 HTML은입니다

<td valign="top">
    <table border="1" cellpadding="1" cellspacing="0" align="right">
    <tbody><tr class="tableheaders">
    <td>Owner Name(s)</td>
    </tr>

    <tr>

    <td>PILCHER DONALD L                         </td>
    </tr>

    </tbody></table>
</td>

와우, BeautifulSoup에 대한 많은 질문이 있습니다. 나는 그들을 살펴 보았지만 저를 도와주는 대답을 찾지 못했습니다. 이것은 중복 질문이 아닙니다.

해결책

(편집하다: 분명히 OP가 게시 한 html - 실제로는 없습니다. tbody 그가 HTML에 포함시켜 주었음에도 불구하고 찾아야 할 태그. 따라서 사용으로 변경됩니다 table 대신에 tbody).

원하는 몇 개의 테이블 열이있을 수 있으므로 (예 : 마지막 숫자 4가 5로 변경된 것과 함께 제공 한 것의 형제 URL을 참조하십시오) 다음과 같은 루프를 제안합니다.

# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
  print x

이것은 페이지 구조의 사소한 변화에 대해 합리적으로 강력합니다. 관심있는 셀을 찾으면 테이블 태그가 발견 될 때까지 부모를 반복 한 다음 비어 있지 않은 테이블 내의 모든 탐색 가능한 문자열 (또는 흰색 공간)을 제외합니다. 그만큼 owner 헤더.

다른 팁

이것은 BeautifulSoup 토론 그룹의 Aaron Devore의 답변입니다.

soup = BeautifulSoup(...)
label = soup.find(text="Owner Name(s)")

실제 이름 문자열에 도달하려면 tag.string이 필요합니다

name = label.findNext('td').string

당신이 그것들을 많이하고 있다면, 당신은 목록 이해를 위해 갈 수도 있습니다.

names = [unicode(label.findNext('td').string) for label in
soup.findAll(text="Owner Name(s)")]

이것은 약간의 개선이지만 세 부모를 제거하는 방법을 알 수 없었습니다.

x[0].parent.parent.parent.findAll('td')[1].string

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow