Parse '<' Symbol with lxml

https://stackoverflow.com/questions/19313152

30-06-2022
|

Question

I'm currenlty facing a problem with mathjax equations containing '<' symbols. If I parse these with lxml the string gets cropped.

Is there a way to tell the parser to not remove unknown tags (I guess thats the problem) but keep them as they are?

E.g

s="<div> This is a text with mathjax like $1<2$, let's see if this works till here $2>1$! </div>"
from lxml import html
tree=html.fragment_fromstring(s)
html.tostring(tree)

gives:

'<div> This is a text with mathjax like $11$! </div>'

It would be fine if the '<' gets escaped an nothing cropped.

I am totally aware that this is not valid xml. But, unfortunately, I cannot replace the '<' symbols with the correct html escaped symbol in the source, because actually, I'm trying to parse a markdown file containing html tags and the < symbol is a perfectly fine symbol here.

Thanks!

Jakob

Solution 2

Lxml alone does not work here, but using BeautifulSoup works fine!

s1="This is a text with mathjax like $1<2$, let's see if this works till here $2>1$!"
import lxml.html.soupparser as sp
from lxml import html  
soup1 = sp.fromstring(s1)
print sp.unescape(html.tostring(soup1, encoding='unicode'))

gives

<html>This is a text with mathjax like $1<2$, let's see if this works till here $2>1$!</html>

OTHER TIPS

If you're using a XML parser to parse something that is not valid XML then you're not using the right tool for the job.

Other solutions would be to either write a custom parser or first pass your markdown content to a markdown engine (cf https://github.com/trentm/python-markdown2 or https://pypi.python.org/pypi/Markdown) to turn it into proper HTML then parse this HTML using lxml's HTML parser (or any other HTML parser FWIW).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow