How do I get the link and title from this (part of) html string in python

https://stackoverflow.com/questions/7223498

15-01-2021
|

Domanda

I'm writing a plugin for xbmc in python. I have got a list of strings in the format:
<a href="/www.link.to/something">name of link</a>

By using beautiful stone soup (the relevant part of the code):

 soup = BeautifulStoneSoup(link, convertEntities=BeautifulStoneSoup.XML_ENTITIES)
    programs = soup('ul')
    i = 0
    for prog in programs:
        i = i+1
        if i==(5+getLetterValue(name)):
            j = 0
            while j < len(prog('li')):
                li = prog('li')[j]
                link = li('a')[0]

getLeterValue is a function that returns an index which indidcates where this specific 'ul' tag is placed (according to the desired letter).

Now I want to split link in the link and text. I tried using re.compile:
match=re.compile('<a href="(.+?)">(.+?)</a>').findall(link.string)
but all I get is match=[]

What have I done wrong?

Note: I know I should regexp html code but I'm not sure this ``rule'' is valid for small string. Also, for some reason this is almost a standard in xbmc plugin writing and I assume there is some reason for that.

Soluzione

Why not let BeautifulSoup give you the href attribute and the element contents?

Altri suggerimenti

The easiest way is to use lxml:

from lxml.html import fromstring

elem = fromstring(link.string)
print elem.attrib["href"]
print elem.text

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow