Unable to scrape certain values of a website using regex

https://stackoverflow.com/questions/23671560

23-07-2023
|

Question

I've been trying to scrape the information inside of a particular set of p tags on a website and running into a lot of trouble.

My code looks like:

import urllib   
import re

def scrape():
        url = "https://www.theWebsite.com"

        statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
        htmlfile = urllib.urlopen(url)
        htmltext = htmlfile.read()

        status = re.findall(statusText,htmltext)

        print("Status: " + str(status))
scrape()

Which unfortunately returns only: "Status: []"

However, that being said I have no idea what I am doing wrong because when I was testing on the same website I could use the code

statusText = re.compile('<a href="/about">(.+?)</a>')

instead and I would get what I was trying to, "Status: ['About', 'About']"

Does anyone know what I can do to get the information within the div tags? Or more specifically the single set of p tags the div tags contain? I've tried plugging in just about any values I could think of and have gotten nowhere. After Google, YouTube, and SO searching I'm running out of ideas now.

Solution

I use BeautifulSoup for extracting information between html tags. Suppose you want to extract a division like this : <div class='article_body' itemprop='articleBody'>...</div> then you can use beautifulsoup and extract this division by:

soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})

also see the official documentation of bs4

as an example i have edited your code for extracting a division form an article of bloomberg you can make your own changes

import urllib   
import re
from bs4 import BeautifulSoup

def scrape():
    url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    soup = BeautifulSoup(htmltext)
    ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
    print ans
scrape()

You can BeautifulSoup from here

P.S. : I use scrapy and BeautifulSoup for web scraping and I am satisfied with it

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow