Beautifulsoupはテーブルの値を取得します

https://stackoverflow.com/questions/1817184

08-07-2019
|

質問

スクレイプしようとしています http://www.co.jefferson.co.us/ ats / displaygeneral.do？sch = 000104 「所有者名」を取得します; 私は作品を持っていますが、本当にく、私が確信している最高ではないので、私はより良い方法を探しています。ここに私が持っているものがあります：

soup = BeautifulSoup(url_opener.open(url))            
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next

関連するHTMLは

<td valign="top">
    <table border="1" cellpadding="1" cellspacing="0" align="right">
    <tbody><tr class="tableheaders">
    <td>Owner Name(s)</td>
    </tr>

    <tr>

    <td>PILCHER DONALD L                         </td>
    </tr>

    </tbody></table>
</td>

うわー、beautifulsoupについてはたくさんの質問があります。それらを調べましたが、私を助けてくれる答えが見つかりませんでした。できればこれは重複した質問ではありません

解決

（編集：OPが投稿したHTMLのようです。実際には、 tbody タグを探す必要はありません。そのHTMLです。したがって、 tbody ）の代わりに table を使用するように変更します。

必要なテーブル行が複数ある場合があるため（たとえば、最後の数字4を5に変更して、指定したものの兄弟URLを参照）、次のようなループをお勧めします。

# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
  print x

これは、ページ構造の小さな変更に対してかなり堅牢です：対象のセルを見つけたら、テーブルタグが見つかるまでその親をループし、そのテーブル内の空ではないすべてのナビゲーション可能な文字列（または単に空白）をループします）、 owner ヘッダーを除く。

他のヒント

これは、BeautifulsoupディスカッショングループからのAaron DeVoreの回答です。私にとってはうまくいきます。

soup = BeautifulSoup(...)
label = soup.find(text="Owner Name(s)")

実際の名前文字列を取得するにはTag.stringが必要です

name = label.findNext('td').string

それらの束をしている場合は、リストの理解に行くこともできます。

names = [unicode(label.findNext('td').string) for label in
soup.findAll(text="Owner Name(s)")]

これはわずかな改善ですが、3つの親を取り除く方法がわかりませんでした。

x[0].parent.parent.parent.findAll('td')[1].string

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow