How can I best isolate 2 different unlabeled pieces of html using beautiful soup to be printed to a CSV?

https://stackoverflow.com/questions/20533970

31-08-2022
|

Question

To preface, I'm a python beginner and this is my first time using BeautifulSoup. Any input is greatly appreciated.

I'm attempting to scrape all the company names and email addresses from this site. There are 3 layers of links to crawl through (Alphabetized pagination list -> Company list by letter -> Company detail page) and I'd subsequently print them to a csv.

So far, I've been able to isolate the alphabetized list of links with the code below, but I'm stuck when attempting to isolate the different company pages and then extracting the name/email from unlabeled html.

import re
import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen('http://www.indiainfoline.com/Markets/Company/A.aspx').read()
soup = BeautifulSoup(page)
soup.prettify()

pattern = re.compile(r'^\/Markets\/Company\/\D\.aspx$')

all_links = []
navigation_links = []
root = "http://www.indiainfoline.com/"

# Finding all links
for anchor in soup.findAll('a', href=True):
    all_links.append(anchor['href'])
# Isolate links matching regex
for link in all_links:
    if re.match(pattern, link):
        navigation_links.append(root + re.match(pattern, link).group(0))
navigation_links = list(set(navigation_links))

company_pages = []
for page in navigation_links:
    for anchor in soup.findAll('table', id='AlphaQuotes1_Rep_quote')              [0].findAll('a',href=True):
        company_pages.append(root + anchor['href'])

Solution

By pieces. Getting the links to each individual company is easy:

from bs4 import BeautifulSoup
import requests

html = requests.get('http://www.indiainfoline.com/Markets/Company/A.aspx').text
bs = BeautifulSoup(html)

# find the links to companies
company_menu = bs.find("div",{'style':'padding-left:5px'})
# print all companies links
companies = company_menu.find_all('a')
for company in companies:
    print company['href']

Second, get the companies names:

for company in companies:
    print company.getText().strip()

Third, emails is a little more complicated, but you can use regex here, so in a independent company page, do the following:

import re
# example company page
html = requests.get('http://www.indiainfoline.com/Markets/Company/Adani-Power-Ltd/533096').text
EMAIL_REGEX = re.compile("mailto:([A-Za-z0-9.\-+]+@[A-Za-z0-9_\-]+[.][a-zA-Z]{2,4})")
re.findall(EMAIL_REGEX, html)
# and there you got a list of found emails
...

Cheers,

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow