¿Cómo leer el foro de otra persona

https://stackoverflow.com/questions/2060247

20-09-2019
|

Pregunta

Mi amigo tiene un foro, que está lleno de mensajes que contienen información. A veces se quiere revisar los mensajes en su foro, y llegar a conclusiones. En el momento en que revisa los mensajes accediendo a su foro, y genera una imagen no necesariamente precisa de los datos (en el cerebro) de la que hace que las conclusiones. Mi pensamiento actual es que probablemente podría golpear a un script de Ruby rápida que analizar el HTML necesario para darle una idea real de lo que los datos está diciendo.

Estoy utilizando la biblioteca de red / http Ruby por primera vez en el día, y he encontrado un problema. Mientras que mi navegador no tiene problemas para ver el foro de mi amigo, parece que el método de Net :: HTTP.new ( "forumname.net") produce el siguiente error:

No se ha podido establecer conexión porque el equipo de destino la rechazó activamente. - conectar (2)

googlear ese error, he aprendido que tiene que ver con MySQL (o algo así) que no quieren chicos curiosos como yo remotamente a hurgar ahí: por razones de seguridad. Esto tiene sentido para mí, pero eso me hace pensar: ¿cómo es que mi navegador se pone a hurgar en el foro de mi amigo, pero mi pequeño script de Ruby obtiene ningún derecho poking. ¿Hay alguna manera para que mi guión para indicar al servidor que no es una amenaza? Que sólo quiero leer los derechos y no escribir derechos?

Gracias chicos,

Solución

El raspar un sitio web? Usar mecanizar :

#!/usr/bin/ruby1.8

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get("http://xkcd.com")
page = page.link_with(:text=>'Forums').click
page = page.link_with(:text=>'Mathematics').click
page = page.link_with(:text=>'Math Books').click
#puts page.parser.to_html    # If you want to see the html you just got
posts = page.parser.xpath("//div[@class='postbody']")
for post in posts
  title = post.at_xpath('h3//text()').to_s
  author = post.at_xpath("p[@class='author']//a//text()").to_s
  body = post.xpath("div[@class='content']//text()").collect do |div|
    div.to_s
  end.join("\n")
  puts '-' * 40
  puts "title: #{title}"
  puts "author: #{author}"
  puts "body:", body
end

La primera parte de la salida:

----------------------------------------
title: Math Books
author: Cleverbeans
body:
This is now the official thread for questions about math books at any level, fr\
om high school through advanced college courses.
I'm looking for a good vector calculus text to brush up on what I've forgotten.\
 We used Stewart's Multivariable Calculus as a baseline but I was unable to pur\
chase the text for financial reasons at the time. I figured some things may hav\
e changed in the last 12 years, so if anyone can suggest some good texts on thi\
s subject I'd appreciate it.
----------------------------------------
title: Re: Multivariable Calculus Text?
author: ThomasS
body:
The textbooks go up in price and new pretty pictures appear. However, Calculus \
really hasn't changed all that much.
If you don't mind a certain lack of pretty pictures, you might try something li\
ke Widder's Advanced Calculus from Dover. it is much easier to carry around tha\
n Stewart. It is also written in a style that a mathematician might consider no\
rmal. If you think that you might want to move on to real math at some point, i\
t might serve as an introduction to the associated style of writing.

Otros consejos

algunos sitios sólo se puede acceder con el subdominio "www", por lo que puede estar causando el problema.

para crear una solicitud GET, que se quiere utilizar el método GET:

require 'net/http'

url = URI.parse('http://www.forum.site/')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
  http.request(req)
}
puts res.body

u también puede que tenga que configurar el agente de usuario en algún momento como una opción:

{'User-Agent' => 'Mozilla/5.0 (Windows; U;
    Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1'})

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow