Question

I am trying to read a list of URL's that I have on google docs. What I want to do is read the URL's in from google doc spreadsheet then scrape each URL.

import gdata.docs.data
import gdata.docs.client
import gdata.docs.service
import gdata.spreadsheet.service
import re, os

username        = 'myemail.nuigalway@gmail.com'
password         = 'mypassword'
doc_name        = 'My document'

gd_client = gdata.spreadsheet.service.SpreadsheetsService()
gd_client.email = username 
gd_client.password = password  
gd_client.source = 'https://docs.google.com/spreadsheet/ccc? key=0AkGb10ekJtfQdG9EOHN0VzRDdVhWaG1kNVEtdVpyRlE#gid=0'
gd_client.ProgrammaticLogin()

q = gdata.spreadsheet.service.DocumentQuery()
q['title'] = doc_name
q['title-exact'] = 'true'
feed = gd_client.GetSpreadsheetsFeed(query=q)
spreadsheet_id = feed.entry[0].id.text.rsplit('/',1)[1]
feed = gd_client.GetWorksheetsFeed(spreadsheet_id)
worksheet_id = feed.entry[0].id.text.rsplit('/',1)[1]

rows = gd_client.GetListFeed(spreadsheet_id, worksheet_id).entry


for row in rows:
    for key in row.custom:
        urls = row.custom[key].text 
    newlist = urls
print 'this is a list',  newlist 

elec_urls = newlist.strip()

#After this each the Url in the list is scraped using scraperwiki 

This works fine if I only have one URL in the spredsheet, I don't, when I have more that one URL in the document the program only scrapes the last Url.

I thought using a loop would solve this something to cycle from newlist[0] to newlist[i] but found out that newlist[0] is = to h of the h t t p://(URL) Last entered urls and newlist[1]= t and so on.

Any help would be appreciated thanks.

Was it helpful?

Solution

As you said, newlist is the last URL, so naturally when you ask for its indices you get individual letters. You need to create a list before the loop and then append each url to it, instead of setting urls to the text of each one:

urls = []
for row in rows:
    for key in row.custom:
        urls.append(row.custom[key].text)

Now urls is a list where each element is one URL.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top