루비 기계화 테이블 스크래핑은 전체 행을 캡처하지 않습니다

https://stackoverflow.com/questions/5023740

14-11-2019
|

문제

나는 기계화로 테이블 웹 사이트를 긁어려고 노력하고 있습니다. 나는 두 번째 행을 긁고 싶다.

실행할 때 :

agent.page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text }

나는 그것이 전체 행을 긁을 것으로 기대할 것입니다.그러나 대신 그것은 긁어 낸 것입니다 : [ "2011-02-17", "0,00"]

이는 행의 모든 열을 긁어 내지 않지만 첫 번째와 마지막 열은 무엇입니까?

XPath : / html / body / center / table / tbody / tr [2] / td [2] / table / tbody / tr [3] / td / table / tbody / tr [2] / td / table / tby / tr [2]

CSS 경로 : HTML 바디 센터 테이블 TRTD 테이블 TD TR TD 테이블 TR TD TABLE.EA TRD TR TD.Total

이 페이지는 다음과 유사합니다.

<table><table><table>
<table width="100%" border="0" cellpadding="0" cellspacing="1" class="ea">
<tr>
    <th><a href="#">Date</a></th>
    <th><a href="#">One</a></th>    
    <th><a href="#">Two</a></th>    
    <th><a href="#">Three</a></th>     
    <th><a href="#">Four</a></th>    
    <th><a href="#">Five</a></th>        
    <th><a href="#">Six</a></th>        
    <th><a href="#">Seven</a></th>      
    <th><a href="#">Eight</a></th>
</tr>
<tr>
    <td><a href="#">2011-02-17</a></td>
    <td align="right">0</td>    
    <td align="right">0</td>    
    <td align="right">0,00</td>     
    <td align="right">0</td>    
    <td align="right">0</td>        
    <td align="right">0</td>    
    <td align="right">0</td>        
    <td align="right">387</td>      
    <td align="right">0,00</td>     <!-- FOV -->
    <td align="right">0,00</td>
</tr>
<tr>
    <td class="total">Ialt</td>
    <td class="total" align="right">0</td>  
    <td class="total" align="right">40</td>     
    <td class="total" align="right">0,46</td>   
    <td class="total" align="right">2</td>      
    <td class="total" align="right">0</td>        
    <td class="total" align="right">0</td>      
    <td class="total" align="right">0</td>        
    <td class="total" align="right">3.060</td>      
    <td class="total" align="right">0,00</td>       
    <td class="total" align="right">18,58</td>
</tr>
</table>
</table></table></table>

해결책

Using the following Ruby code (https://gist.github.com/835603):

require 'mechanize'
require 'pp'

a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

a.get('http://binarymuse.net/table.html') do |page|
  pp page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text }
end

I get the following output:

["2011-02-17", "0", "0", "0,00", "0", "0", "0", "0", "387", "0,00", "0,00"]

다른 팁

I would recommend you to leave Mechanize to harder stuff than scraping a page. You can use Nokogiri much more simple than using Mechanize(but ofcourse you can do it with it) since you can just query the page.

Try it out!

here is a link to an answer regarding nokogiri

Personally I used Mechanize when I needed to send forms and stuff like that albeit there are tons of other uses to it!

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow