have an html-file of this kind:
<html>
<head></head>
<body>
<p>
<dfn>A</dfn>sometext / ''
(<i>othertext</i>)someothertext / ''
(<i>...</i>)
(<i>...</i>)
</p>
<p>
<dfn>B</dfn>sometext / ''
(<i>othertext</i>)someothertext / ''
<i>blabla</i>
<i>bubu</i>
</p>
</body>
</html>
sometext / ' ' means that there can or cannot be some text following the dfn tag, same for i tags. also, i tags and text within them are not always present. Only text inside dfn tag is constantly present.
I need to get all textual information from every p-tag:
A, sometext, othertext, someothertext.
B, sometext, othertext, someothertext.
C, sometext, othertext, someothertext.
...
Z, sometext, othertext, someothertext.
The following code works almost OK, except that it goes to infinite looping when giving output.
for p in tree.xpath("//p"):
dfn = p.xpath('./dfn/text()')
after_dfn = p.xpath("./dfn/following::text()")
print '\n'.join(dfn), ''.join(after_dfn)
So, suppose I have all the letters of the ABC, I have this kind of output:
> A, sometext, othertext, someothertext.
>
> B, sometext, othertext, someothertext.
>
> C, sometext, othertext, someothertext.
>
> ...
>
> Z, sometext, othertext, someothertext.
> (2nd unnecessary loop):
>
> B, sometext, othertext, someothertext.
>
> C, sometext, othertext, someothertext.
>
> D, sometext, othertext, someothertext.
>
> ...
>
> Z, sometext, othertext, someothertext.
> (3rd unnecessary loop):
>
> C, sometext, othertext, someothertext.
>
> D, sometext, othertext, someothertext.
>
> E, sometext, othertext, someothertext.
>
> ...
>
> Z, sometext, othertext, someothertext...etc
It goes strangely from 1st p to the last one, then from 2nd to the last one, then from 3rd to the last one and so on...
From the initial xml-file of 107 kb I receive an enormous horror of 26 MB when doing this!
Please, help me to stop these loopings.