문제

I'm trying to scrape "game tag" data (not the same as HTML tags) from games listed on the digital game distribution site, Steam (store.steampowered.com). This information isn't available via the Steam API, as far as I can tell.

Once I have the raw source data for a page, I want to pass it into beautifulsoup for further parsing, but I have a problem - urllib2 doesn't seem to be reading the information I want (request doesn't work either), even though it's obviously in the source page when viewed in the browser. For example, I might download the page for the game "7 Days to Die" (http://store.steampowered.com/app/251570/). When viewing the browser source page in Chrome, I can see the following relevant information regarding the game's "tags" near the end, starting at line 1615:

<script type="text/javascript">
      $J( function() {
          InitAppTagModal( 251570,    
          {"tagid":1662,"name":"Survival","count":283,"browseable":true},
          {"tagid":1659,"name":"Zombies","count":274,"browseable":true},
          {"tagid":1702,"name":"Crafting","count":248,"browseable":true},...

In initAppTagModal, there are the tags "Survival", "Zombies", "Crafting", ect that contain the information I'd like to collect.

But when I use urllib2 to get the page source:

import urllib2  
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page  
page = urllib2.urlopen(url).read()

The part of the source page that I'm interested in is not saved in the my "page" variable, instead everything below line 1555 is simply blank until the closing body and html tags. Resulting in this (carriage returns included):

</div><!-- End Footer -->





</body>  
</html>

In the blank space is where the source code I need (along with other code), should be.
I've tried this on several different computers with different installs of python 2.7 (Windows machines and a Mac), and I get the same result on all of them.

How can I get the data that I'm looking for?

Thank you for your consideration.

도움이 되었습니까?

해결책

Well, I don't know if I'm missing something, but it's working for me using requests:

import requests

# Getting html code
url = "http://store.steampowered.com/app/251570/"
html = requests.get(url).text

And even more, the data requested is in json format, so it's easy to extract it in this way:

# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( 251570,'
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]

# Load raw data as python json object
data = json.loads(raw_data)

You will see a beatiful json object like this (this is the info that you need, right?):

[
  {
    "count": 283,
    "browseable": true,
    "tagid": 1662,
    "name": "Survival"
 },
 {
    "count": 274,
    "browseable": true,
    "tagid": 1659,
    "name": "Zombies"
 },
 {
   "count": 248,
   "browseable": true,
   "tagid": 1702,
   "name": "Crafting"
 }......

I hope it helps....

UPDATED:

Ok, I see your problem right now, it seems that the problem is in the page 224600. In this case the webpage requires that you confirm your age before to show you the games info. Anyway, easy to solve it just posting the form that confirm the age. Here is the code updated (and I created a function):

def extract_info_games(page_id):
    # Create session
    session = requests.session()

    # Get initial html
    html = session.get("http://store.steampowered.com/app/%s/" % page_id).text

    # Checking if I'm in the check age page (just checking if the check age form is in the html code)
    if ('<form action="http://store.steampowered.com/agecheck/app/%s/"' % page_id) in html:
            # I'm being redirected to check age page
            # let's confirm my age with a POST:
            post_data = {
                     'snr':'1_agecheck_agecheck__age-gate',
                     'ageDay':1,
                     'ageMonth':'January',
                     'ageYear':'1960'
            }
            html = session.post('http://store.steampowered.com/agecheck/app/%s/' % page_id, post_data).text


    # Extracting javscript object (a json like object)
    start_tag = 'InitAppTagModal( %s,' % page_id
    end_tag = '],'
    startIndex = html.find(start_tag) + len(start_tag)
    endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
    raw_data = html[startIndex:endIndex]

    # Load raw data as python json object
    data = json.loads(raw_data)
    return data

And to use it:

extract_info_games(224600)
extract_info_games(251570)

Enjoy!

다른 팁

When using urllib2 and read(), you will have to read repeatedly in chunks till you hit EOF, in order to read the entire HTML source.

import urllib2  
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
url_handle = urllib2.urlopen(url)
data = ""
while True:
    chunk = url_handle.read()
    if not chunk:
        break
    data += chunk

An alternative would be to use the requests module as:

import requests
r = requests.get('http://store.steampowered.com/app/251570/')
soup = BeautifulSoup(r.text)
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top