Why does text retrieved from pages sometimes look like gibberish?

https://stackoverflow.com/questions/8271484

09-03-2021
|

Question

I'm using urllib and urllib2 in Python to open and read webpages but sometimes, the text I get is unreadable. For example, if I run this:

import urllib

text = urllib.urlopen('http://tagger.steve.museum/steve/object/141913').read()
print text

I get some unreadable text. I've read these posts:

Gibberish from urlopen

Does python urllib2 automatically uncompress gzip data fetched from webpage?

but can't seem to find my answer.

Thank you in advance for your help!

UPDATE: I fixed the problem by 'convincing' the server that my user-agent is a brower and not a crawler.

import urllib

class NewOpener(urllib.FancyURLopener):
  version = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.2 (KHTML, like Gecko) Ubuntu/11.10 Chromium/15.0.874.120 Chrome/15.0.874.120 Safari/535.2'

nop = NewOpener()
html_text = nop.open('http://tagger.steve.museum/steve/object/141913').read()

Thank you all for your replies.

Solution

You can use Selenium to get the content. Download the server and client drivers, run server and run this:

from selenium import selenium
s = selenium("localhost", 4444, "*chrome", "http://tagger.steve.museum")
s.start()

s.open("/steve/object/141913")

text = s.get_html_source()
print text

OTHER TIPS

This gibberish is a real server response for the request to 'http://tagger.steve.museum/steve/object/141913'. Actually, it looks like obfuscated JavaScript, which, if executed by a browser, loads page content.

To get this content, you need to execute this JavaScript, and this can be a really difficult task within Python. If you still want to do this, take a look at pywebkitgtk.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow