How to prevent infinite looping in Python while parsing with lxml?

https://stackoverflow.com/questions/21708393

10-10-2022
|

Question

have an html-file of this kind:

<html>
  <head></head>
   <body>
    <p>
      <dfn>A</dfn>sometext / ''
       (<i>othertext</i>)someothertext / ''
       (<i>...</i>)
       (<i>...</i>)
    </p>
    <p>
      <dfn>B</dfn>sometext / ''
      (<i>othertext</i>)someothertext / ''
      <i>blabla</i>
      <i>bubu</i>
    </p>
  </body>
</html>

sometext / ' ' means that there can or cannot be some text following the dfn tag, same for i tags. also, i tags and text within them are not always present. Only text inside dfn tag is constantly present.

I need to get all textual information from every p-tag:

A, sometext, othertext, someothertext.

B, sometext, othertext, someothertext.

C, sometext, othertext, someothertext.

...

Z, sometext, othertext, someothertext.

The following code works almost OK, except that it goes to infinite looping when giving output.

for p in tree.xpath("//p"):
    dfn = p.xpath('./dfn/text()')
    after_dfn = p.xpath("./dfn/following::text()")
    print '\n'.join(dfn), ''.join(after_dfn)

So, suppose I have all the letters of the ABC, I have this kind of output:

> A, sometext, othertext, someothertext.
> 
> B, sometext, othertext, someothertext.
> 
> C, sometext, othertext, someothertext.
> 
> ...
> 
> Z, sometext, othertext, someothertext.
> (2nd unnecessary loop):
> 
> B, sometext, othertext, someothertext.
> 
> C, sometext, othertext, someothertext.
> 
> D, sometext, othertext, someothertext.
> 
> ...
> 
> Z, sometext, othertext, someothertext.
> (3rd unnecessary loop):
> 
> C, sometext, othertext, someothertext.
> 
> D, sometext, othertext, someothertext.
> 
> E, sometext, othertext, someothertext.
> 
> ...
> 
> Z, sometext, othertext, someothertext...etc

It goes strangely from 1st p to the last one, then from 2nd to the last one, then from 3rd to the last one and so on... From the initial xml-file of 107 kb I receive an enormous horror of 26 MB when doing this! Please, help me to stop these loopings.

La solution

to get all text below p just do:

tree.xpath("//p//text()")

if you need them aggregated per p do:

[[y.strip() for y in x.xpath('.//text()') if y.strip()] for x in tree.xpath('//p')]

extract p text based on i text:

>>> [y.strip() for y in x.xpath('//i[.="blabla"]/..//text()') if y.strip()]
['B', 'sometext', 'othertext', 'someothertext', 'blabla', 'bubu']

or by dfn text:

>>> [y.strip() for y in x.xpath('//dfn[.="B"]/..//text()') if y.strip()]
[['B', 'sometext', 'othertext', 'someothertext', 'blabla', 'bubu']]

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow