lxml and xpath in python: get pairs of h3 and email from html document in a list with possible missing e-mail

StackOverflow https://stackoverflow.com/questions/21805546

سؤال

Im quite new to this, so I don't really know if this is possible:

This webpage has titles under h3, easy to get with lxml:

titles=doc.xpath("//div/h3/a/text())

under those, i have the emails:

emails=doc.xpath("//div/p[text()='Email: ']/a/text()")

And I can merge them into a list with '|':

both=doc.xpath("//div/h3/a/text()|//div/p[text()='Email: ']/a/text()")

The problem is, some results don't have an e-mail, so i get a bad list, with some titles followed not by an email, but with another title, without even an empty list item. I can work this around with some processing, but I wonder if its possible to return a 'not-found' when the email is missing so I get workable pairs: title-email, title-not found, and so on

I tried a recipe I found here using:

emails=doc.xpath("concat(//div/p[text()='Email: ']/a/text(),substring('not-found',1 div not(//div/p[text()='Email: ']/a/text())))")

But this works only as a standalone with emails, if I mix it with '|' I get a XPathEvalError: Invalid type error.

for the record, this is what i tried:

emails=doc.xpath("//div/h3/a/text()|concat(//div/p[text()='Email: ']/a/text(),substring('not-found',1 div not(//div/p[text()='Email: ']/a/text())))")

I'm quite new to lxml and xpath, so maybe i'm missing an easy way to do this.

هل كانت مفيدة؟

المحلول

If you are not stuck with lxml, you can give a try to BeautifulSoup. I find it easier to use. I looked into that page but couldn't parse it fine because it has an xml header just before the html header, like:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="ES" xml:lang="ES" >
...

I had to remove the first line (xml header) to test it. Said that, here you have the example with BeautifulSoup:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from itertools import dropwhile
import re

html = urlopen('http://www.datosempresa.com/Categoria/peluqueria?pagina=4').read()
soup = BeautifulSoup(html, 'html')

for div in soup.find_all('div', attrs={'class':'resultados'}):
    title = div.find_next('h3').string
    email = list(dropwhile(lambda x: not re.match(r'(?i)email:', x), div.strings))
    print('{} - {}'.format(title, email[1] if email else 'Not found'))

It searches all <div> elements with a class attribute with resultados as value, extracts all strings from its childrens and remove all of them found before one that matches email: ignoring case. If the returning list is empty, just print Not found, otherwise the email will be the second element in the list, so extract it.

Run it like:

python3 script.py

That yields:

MANUELA RIVERO - oscarvp30@hotmail.com
SALON DE BELLEZA LIDIA - Not found
TRUKO & HAIR DESIGN - Not found
PACO PERFUMERIAS - pacoperfumerias@gmail.com
ESTHER CENDAGORTAGALARZA ESTILISTA - peluqueriaesthercendagortagalarza@hotmail.es
ADARIS - adaris@hotmail.es
N&K NAILS - info@nknails.com
PELUQUERIA NELA - wrunela@hotmail.es
PELUQUERIA NELA - wrunela@hotmail.es
PELUQUERIA HUMBERTO STAR - humbertostar@yahoo.es
COLLADOS PELUQUEROS - contacta@colladospeluqueros.com
ZEN NATURE ESTéTICA - contacta@colladospeluqueros.com
LA CASA DE MAR - Not found
DELGADO PERRUQUERS - Not found
(...output cut to save space...)
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top