Extracting the name of the author from an Amazon page using BeautifulSoup

https://stackoverflow.com/questions/23300899

09-07-2023
|

Question

I'm trying to use beautifulSoup to extract information from an html file.

<a href="/s?_encoding=UTF8&amp;field-author=Reza%20Aslan&amp;search-alias=books&amp;sort=relevancerank">Reza Aslan</a> <span class="byLinePipe">(Author)</span>

I'm using the beautiful soup findAll function to extract the author, Reza Azlan from the previous code with this

import urllib2
from bs4 import BeautifulSoup
import re


ecj_data = open("book1.html",'r').read()

soup = BeautifulSoup(ecj_data)

for definition in soup.findAll('span', {"class":'byLinePipe'}):
    definition = definition.renderContents()

The print definition command gives me : "Release date:"

Which means that there is another class with "byLiniePipe"

<div class="buying"><span class="byLinePipe">Release date: </span><span style="font-weight: bold;">July 16, 2013</span> </div>

Does anybody know how I can differentiate between these sets of code to get the Authors name to print out?

Solution

It's better to find a unique marker near the author's name instead of going through the collection of elements with like classes. For example, we can locate the title of the book using its unique id then we locate the very next link to it (which contains the author's name) using the find_next function. See code below.

Code:

from bs4 import BeautifulSoup as bsoup
import requests as rq

url = "http://www.amazon.com/Zealot-Times-Jesus-Nazareth-ebook/dp/B00BRUQ7ZY"
r = rq.get(url)
soup = bsoup(r.content)

title = soup.find("span", id="btAsinTitle")
author = title.find_next("a", href=True)

print title.get_text()
print author.get_text()

Result:

Zealot: The Life and Times of Jesus of Nazareth [Kindle Edition]
Reza Aslan
[Finished in 2.4s]

Hope this helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow