jericho-html - text extracting and incorrect text lenght

https://stackoverflow.com/questions/18028373

23-06-2022
|

Question

Today I tried to use the lib as jericho-html-3.2 to extract text from simple html... And I faced a strange text fake length problem as follows:

if I have html as this one

Hello World :)<br><br>Hello World :(<br><br>Hello World ;)<br>

...my RichTextArea getText().length() returns 42 that is correct length actually but when I try to extract text from this html with code like a

        Source source = new Source(html);
    String text = source.getTextExtractor().toString();

... the text.length() returns 44

So I don't get it why text which length is 42 turns into text which length is 44 and how to fix it?

Thanks

Solution 2

I had to dig it deeper and I think the wrong text length becomes from html line breakers because the jericho html-parser for some reason replaces new line breakers with spaces or something...

As for now, I cannot say for sure which more tags does it replace to which characters but as for my case I just tried to do some walk-around using regular expression like this (see snippet)

html=html.replaceAll("<br>","");

Source source = new Source(html);
String text = source.getTextExtractor().toString();

... so now it really returns original text length as 42 :)

I hope the tip saves one day

Thank you all for help

OTHER TIPS

It is 44 only, you need to consider all the
tags as one character each, spaces as one character each and all the smileys as one character each.

H(1)e(2)l(3)l(4)o(5) (6)W(7)o(8)r(9)l(10)d(11) (12):)(13)<br>(14)<br>(15)H(16)e(17)l(18)l(19)o(20) (21)W(22)o(23)r(24)l(25)d(26) (27:((28)<br>(29)<br>(30)H(31)e(32)l(33)l(34)o(35) (36)W(37)o(38)r(39)l(40)d(41) (42);)(43)<br>(44)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow