Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

https://stackoverflow.com/questions/19621473

01-07-2022
|

Question

Apologies in advance for the long block of code following. I'm new to BeautifulSoup, but found there were some useful tutorials using it to scrape RSS feeds for blogs. Full disclosure: this is code adapted from this video tutorial which has been immensely helpful in getting this off the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE.

Here's my problem: the video does a great job of showing how to print the relevant content to the console. I need to write out each article's text to a separate .txt file and save it to some directory (right now I'm just trying to save to my Desktop). I know the problem lies i the scope of the two for-loops near the end of the code (I've tried to comment this for people to see quickly--it's the last comment beginning # Here's where I'm lost...), but I can't seem to figure it out on my own.

Currently what the program does is takes the text from the last article read in by the program and writes that out to the number of .txt files that are indicated in the variable listIterator. So, in this case I believe there are 20 .txt files that get written out, but they all contain the text of the last article that's looped over. What I want the program to do is loop over each article and print the text of each article out to a separate .txt file. Sorry for the verbosity, but any insight would be really appreciated.

from urllib import urlopen
from bs4 import BeautifulSoup
import re

# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

# On RSS Feed site, find tags for title of articles and 
# tags for article links to be downloaded.

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)

# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))

for i in listIterator:
    # Print each title to console to ensure program is working. 
    print findPatTitle[i]

    # Read in the linked-to article.
    articlePage = urlopen(findPatLink[i]).read()

    # Find the beginning and end of articles using tags listed below.
    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    # Define article variable that will contain all the content between the 
    # beginning of the article to the end as indicated by variables above.
    article = articlePage[divBegin:divEnd]

    # Parse the page using BeautifulSoup
    soup = BeautifulSoup(article)

    # Compile list of all <p> tags for each article and store in paragList
    paragList = soup.findAll('p')

    # Create empty string to eventually convert items in paragList to string to 
    # be written to .txt files.
    para_string = ''

    # Here's where I'm lost and have some sort of scope issue with my for-loops.
    for i in paragList:
        para_string = para_string + str(i)
        newlist = range(len(findPatTitle))
        for i in newlist:
            ofile = open(str(listIterator[i])+'.txt', 'w')
            ofile.write(para_string)
            ofile.close()

Solution

The reason why it seems that only the last article is written down, is because all the articles are writer to 20 separate files over and over again. Lets have a look at the following:

for i in paragList:
    para_string = para_string + str(i)
    newlist = range(len(findPatTitle))
    for i in newlist:
        ofile = open(str(listIterator[i])+'.txt', 'w')
        ofile.write(para_string)
        ofile.close()

You are writing parag_string over and over again to the same 20 files for each iteration. What you need to be doing is this, append all your parag_strings to a separate list, say paraStringList, and then write all its contents to separate files, like so:

for i, var in enumerate(paraStringList):  # Enumerate creates a tuple
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

Now that this needs to be outside of your main loop i.e. for i in listIterator:(...). This is a working version of the program:

from urllib import urlopen
from bs4 import BeautifulSoup
import re


webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]

listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []

for i in listIterator:

    print findPatTitle[i]

    articlePage = urlopen(findPatLink[i]).read()

    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    article = articlePage[divBegin:divEnd]

    soup = BeautifulSoup(article)

    paragList = soup.findAll('p')

    para_string = ''

    for i in paragList:
        para_string += str(i)

    paraStringList.append(para_string)

for i, var in enumerate(paraStringList):
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow