How to get real source code of html page? [duplicate]

https://stackoverflow.com/questions/23657849

22-07-2023
|

Pergunta

Every time when I'm using standart librarys like urllib2, requests, pycurl I am not getting full source code. How can I get full source code like I am looking on it from chrome, firefox, etc. I am trying to do it like this:

def go_to(link):
    headers = {'User-Agent': USER_AGENT,
               'Accept': ACCEPT,
               'Accept-Encoding': ACCEPT_ENCODING,
               'Accept-Language': ACCEPT_LANGUAGE,
               'Cache-Control': CACHE_CONTROL,
               'Connection': CONNECTION,
               'Host': HOST}
    req = urllib2.Request(link, None, headers)
    response = urllib2.urlopen(req)
    return response.read()

Thank You!

Sorry for my bad english.

UPD: This is full code from browser:

 <td colspan="1"><font class="spy1">1</font> <font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(TwoFiveFiveSix^OneOneSix)+(Zero0FourFour^ZeroSevenSeven)+(TwoFiveFiveSix^OneOneSix)+(TwoFiveFiveSix^OneOneSix))</script><font class="spy2">:</font>8088</font></td>

This is not full code from my script:

<font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(Eight7FiveSix^Seven1One)+(FiveZeroTwoOne^Two3Zero)+(Eight7FiveSix^Seven1One)+(Eight7FiveSix^Seven1One))</script></font>

Solução 2

The best solution is :

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://webscraping.com'  
r = Render(url)  
html = r.frame.toHtml()

Source: http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

UPD: Type of output is QString. If you want to convert it to string use

html = r.frame.toHtml().toUtf8().data()

Outras dicas

Since there can be javascript, AJAX calls involved in forming the web page, to be sure you are getting the same source code as you see in the browser, you need to use tools that actually use real browsers, like selenium:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get(link)

print browser.page_source

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow