If you are not stuck with lxml
, you can give a try to BeautifulSoup
. I find it easier to use. I looked into that page but couldn't parse it fine because it has an xml
header just before the html
header, like:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="ES" xml:lang="ES" >
...
I had to remove the first line (xml
header) to test it. Said that, here you have the example with BeautifulSoup
:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from itertools import dropwhile
import re
html = urlopen('http://www.datosempresa.com/Categoria/peluqueria?pagina=4').read()
soup = BeautifulSoup(html, 'html')
for div in soup.find_all('div', attrs={'class':'resultados'}):
title = div.find_next('h3').string
email = list(dropwhile(lambda x: not re.match(r'(?i)email:', x), div.strings))
print('{} - {}'.format(title, email[1] if email else 'Not found'))
It searches all <div>
elements with a class
attribute with resultados
as value, extracts all strings from its childrens and remove all of them found before one that matches email:
ignoring case. If the returning list is empty, just print Not found
, otherwise the email will be the second element in the list, so extract it.
Run it like:
python3 script.py
That yields:
MANUELA RIVERO - oscarvp30@hotmail.com
SALON DE BELLEZA LIDIA - Not found
TRUKO & HAIR DESIGN - Not found
PACO PERFUMERIAS - pacoperfumerias@gmail.com
ESTHER CENDAGORTAGALARZA ESTILISTA - peluqueriaesthercendagortagalarza@hotmail.es
ADARIS - adaris@hotmail.es
N&K NAILS - info@nknails.com
PELUQUERIA NELA - wrunela@hotmail.es
PELUQUERIA NELA - wrunela@hotmail.es
PELUQUERIA HUMBERTO STAR - humbertostar@yahoo.es
COLLADOS PELUQUEROS - contacta@colladospeluqueros.com
ZEN NATURE ESTéTICA - contacta@colladospeluqueros.com
LA CASA DE MAR - Not found
DELGADO PERRUQUERS - Not found
(...output cut to save space...)