Question

i having a problem in accessing the next page in sub category and i need to soup the information page by page. But in my code i able to soup only the first page of the each sub category. Can anyone help me how to access the next page in subcategories. Thank you in advanced.

import re           
import urllib
import urllib2
import time  
import sys
from datetime import datetime, date
from BeautifulSoup import BeautifulSoup
#list
categories=[]
details=[]
tools=[]
pages_details=[]
division =[]
links=[]
link =[]
subcategory=[]
info=[]
Data ={}

url = 'http://www.share-hit.de/'  # website
pageHTML = urllib.urlopen(url).read()
soup = BeautifulSoup(pageHTML)

#find the main category and append to a list
for category in soup.find('td',{'class':'linkmenu'}).findAll('a'):
    categories.append('http://www.share-hit.de/' + category['href']) # print all the categories
print categories
try:
    for i in categories:
        if len(subcategory) != " ":
                del subcategory[:]

        try:    
            pageHTML = urllib2.urlopen(i).read()
            soup2 = BeautifulSoup(pageHTML)
            table = soup2.find('table', attrs={'id':'kategoriemenu'})
            division = table.findAll('div',attrs={'align':'left'})
            # find the  sub category of each main category
            for sub_cate in division:
                try:    

                    subcategory.append('http://www.share-hit.de/' + sub_cate.find("a")["href"])
                    print subcategory

                    # Inside each sub category get the application link of first page only 
                    # I need to know the way how to find the next page in each sub category 

                    for apps in subcategory:

                        pageHTML = urllib2.urlopen(apps).read()
                        soup2 = BeautifulSoup(pageHTML)
                        tools = soup2.findAll('span', attrs={'class':'Stil2'})


                    if len(links) != " ":
                            del links[:]
                    # append list of application on each page 
                    for list in tools:
                        try:
                            links.append('http://www.share-hit.de/' + list.find("a")["href"])

                            print links 
                        except Exception:   
                            print 'No Apps'

                    #Details From the application link i manage to soup the details of each application

                except Exception:   
                    print 'No Sub Categories'

        except Exception:
            print 'No Categories'

except Exception:
    print 'Finish'
Was it helpful?

Solution

You can choose a visual scraper like IRobotSoft scraper to handle such problems. It includes many options you can easily use to navigate through next pages. The query for you next link is:

<a (tx like 'Vorw%')>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top