Downloading a sequence of webpages using Python

https://stackoverflow.com/questions/23689805

23-07-2023
|

Pregunta

I am very new to Python [running 2.7.x] and I am trying to download content from a webpage with thousands of links. Here's my code:

import urllib2
i = 1
limit = 1441

for i in limit: 
    url = 'http://pmindia.gov.in/content_print.php?nodeid='+i+'&nodetype=2'
    response = urllib2.urlopen(url)
    webContent = response.read()
    f = open('speech'+i+'.html', 'w')
    f.write(webContent)
    f.close

Fairly elementary, but I get one or both of these errors 'int object is not iterable' or 'cannot concatenate str and int'. These are the printable versions of the links on this page: http://pmindia.gov.in/all-speeches.php (1400 links). But the node id's go from 1 to 1441 which means 41 numbers are missing (which is a separate problem). Final final question: in the long run, while downloading thousands of link objects, is there a way to run them in parallel to increase processing speed?

Solución

There are a couple of mistakes in your code.

You got the syntax of for wrong. When you call the for loop, you need to pass it a an object that it can iterate on. This can be a list or a generator
adding a number to a string won't work. You need to convert with for example repr

With those fixes your code look like

import urllib2
i = 1
limit = 1441

for i in xrange(1,limit+1): 
    url = 'http://pmindia.gov.in/content_print.php?nodeid='+repr(i)+'&nodetype=2'
    response = urllib2.urlopen(url)
    webContent = response.read()
    f = open('speech'+repr(i)+'.html', 'w')
    f.write(webContent)
    f.close

Now, if you want to go into web scraping for real, I suggest you have a look at some packages such as lxml and requests

Otros consejos

Try this:

for i in range(1, limit + 1):
...

range(M, N) returns a list of numbers from M (inclusive) to N (exclusive). See https://docs.python.org/release/1.5.1p1/tut/range.html

You might want to look into using Scrapy or some other web crawling framework to help you with this.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow