Rubyの正規表現したいマッチ抽出のhtml doc

https://stackoverflow.com/questions/708350

22-08-2019
|

質問

私はHTMLドキュメントのこのフォーマット:

<tr><td colspan="4"><span class="fullName">Bill Gussio</span></td></tr>
    <tr>
        <td class="sectionHeader">Contact</td>
        <td class="sectionHeader">Phone</td>
        <td class="sectionHeader">Home</td>
        <td class="sectionHeader">Work</td>
    </tr>
    <tr valign="top">
        <td class="sectionContent"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>wmgussio@erols.com</span></td>
        <td class="sectionContent"><span>Mobile: </span><span>2404173223</span></td>
        <td class="sectionContent"><span>NY</span><br><span>New York</span><br><span>78642</span></td>
        <td class="sectionContent"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>
    </tr>

    <tr><td colspan="4"><hr class="contactSeparator"></td></tr>

    <tr><td colspan="4"><span class="fullName">Eddie Osefo</span></td></tr>
    <tr>
        <td class="sectionHeader">Contact</td>
        <td class="sectionHeader">Phone</td>
        <td class="sectionHeader">Home</td>
        <td class="sectionHeader">Work</td>
    </tr>
    <tr valign="top">
        <td class="sectionContent"><span>Screen Name:</span> <span>eddieOS</span><br><span>Email 1:</span> <span>osefo@wam.umd.edu</span></td>
        <td class="sectionContent"></td>
        <td class="sectionContent"><span></span></td>
        <td class="sectionContent"><span></span></td>
    </tr>

    <tr><td colspan="4"><hr class="contactSeparator"></td></tr>

なので交互に行-チャンクの詳細は直接お問い合わせ後に"お問い合わせセパレータ".欲しいものを掴んで、詳細は直接お問い合わせく私の最初の障害物を掴んのチャンクとの接触セパレーターが不要になる。思い出の正規表現を利用rubular.です:

/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/

確認することができますrubularることを検証するこの菌塊.

しかし私の大きな問題であると私は困りにrubyのコードです。を利用したい内蔵のマッチ機能や版画な結果を得た。こちらのコード:

page = agent.get uri.to_s    
chunks = page.body.match(/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/).captures

chunks.each do |chunk|
   puts "new chunk: " + chunk.inspect
end

※このページです。身体の体のhtmlドキュメントを手により機械化.はhtmlドキュメントが大きくなりがこの形式です。なので、予想外の出力は以下:

new chunk: "Bill Gussio</span></td></tr>\r\n\t<tr>\r\n\t\t<td class=\"sectionHeader\">Contact</td>\r\n\t\t<td class=\"sectionHeader\">Phone</td>\r\n\t\t<td class=\"sectionHeader\">Home</td>\r\n\t\t<td class=\"sectionHeader\">Work</td>\r\n\t</tr>\r\n\t<tr valign=\"top\">\r\n\t\t<td class=\"sectionContent\"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>wmgussio@erols.com</span></td>\r\n\t\t<td class=\"sectionContent\"><span>Mobile: </span><span>2404173223</span></td>\r\n\t\t<td class=\"sectionContent\"><span>NY</span><br><span>New York</span><br><span>78642</span></td>\r\n\t\t<td class=\"sectionContent\"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>\r\n\t</tr>\r\n\t\r\n\t<tr><td colspan=\"4\">"
new chunk: ">"

ある2つの驚きはこちら：

1)にありない2試合を含むの塊のお問い合わせ先もにrubularしていることを確認これらのチャンクを抽出する.

2)は (ラインフィード、タブなど）れる。

誰でもできるので、問題。

さくなければなりません。誰でも知っていい無料AOL連絡先の輸入業者さい。っていblackbookがで不用になったのかAOLととしていいじゃないか初書きぐらい。残念ながら、魅力的な連絡先のAPIはまだない。

感謝です。

解決 4

これは、HTMLを解析するコードです。より良いものを提案するお気軽ます：

contacts = []
    email, mobile = "",""

    names = page.search("//span[@class='fullName']")

    # Every contact has a fullName node, so for each fullName node, we grab the chunk of contact info
    names.each do |n|

      # next_sibling.next_sibling skips:
      # <tr>
      #   <td class=\"sectionHeader\">Contact</td>
      #   <td class=\"sectionHeader\">Phone</td>
      #   <td class=\"sectionHeader\">Home</td>
      #   <td class=\"sectionHeader\">Work</td>
      # </tr>
      # to give us the actual chunk of contact information
      # then taking the children of that chunk gives us rows of contact info
      contact_info_rows = n.parent.parent.next_sibling.next_sibling.children

      # Iterate through the rows of contact info
      contact_info_rows.each do |row|

        # Iterate through the contact info in each row
        row.children.each do |info|
          # Get Email. There are two ".next_siblings" because space after "Email 1" element is processed as a sibling
          if info.content.strip == "Email 1:" then email = info.next_sibling.next_sibling.content.strip end

          # If the contact info has a screen name but no email, use screenname@aol.com
          if (info.content.strip == "Screen Name:" && email == "") then email = info.next_sibling.next_sibling.content.strip + "@aol.com" end

          # Get Mobile #'s
          if info.content.strip == "Mobile:" then mobile = info.next_sibling.content.strip end

          # Maybe we can try and get zips later.  Right now the zip field can look like the street address field
          # so we can not tell the difference.  There is no label node
          #zip_match = /\A\D*(\d{5})-?\d{4}\D*\z/i.match(info.content.strip) 
          #zip_match = /\A\D*(\d{5})[^\d-]*\z/i.match(info.content.strip)     
        end  

      end

      contacts << { :name => n.content, :email => email, :mobile => mobile }

      # clear variables
      email, mobile = "", ""
    end

他のヒント

<のhref = "https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-htmlを参照してください。 -with-REGE ">あなたは正規表現でXMLやHTMLを解析することは困難である理由のいくつかの例を提供することができますかこれは悪い考えである理由のために。代わりに HTMLパーサを使用してください。

だけを抽出情報を収集することを意図したXMLではこれより使いやすいもの以外の正規表現オブジェクトXPathのに良いツールの抽出のための情報から形式をサポートしています。があると考えていかの図書館利用のためのRubyを支援するXPathもうREXML:

hpricotはあなたの頭痛の多くが保存されますのようなHTMLパーサを使用します）。

sudoの宝石インストールhpricot

これは主にCで書かれていますので、

なども高速です

ここではそれを使用する方法です。

http://wiki.github.com/why/hpricot/hpricot-basics の

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow