How to use unicode with enlive for web-scraping

https://stackoverflow.com/questions/10640792

09-06-2021
|

Question

I'm trying to scrape a few sites that require unicode support. For example, I'm trying to get the title of this book, but it returns jumbled characters:

(-> "http://www.brill.nl/publications/evliya-celebis-book-travels" 
      java.net.URL. enlive/html-resource
 (enlive/select [:h1#page-title]) first :content)

And trying to scrape an Arabic site returns with ?????? all over the place.

(enlive/html-resource (java.net.URL. "http://www.aljazeera.net/portal"))

I'm not sure how I'm supposed to activate unicode support.

Solution

Enlive does have unicode support because it uses Java strings. I ran your first example on my computer and got this result:

(Evliyā Çelebi's Book of Travels)

Perhaps the font that you are using doesn't have glyphs for the pointcodes that you are trying to show?

OTHER TIPS

Christophe Grand, the author of enlive, was kind of enough to reply on the Enlive email group. His suggestion was quite informative. I have copied the email below:

Hello,

Enlive is not (and does not include) a full-featured HTTP agent. When you pass a java.net.URL to a html-resource it call .getContent on it, get an InputStream an then assume UTF-8. However if you know the actual encoding you can do :

(-> "http://www.brill.nl/publications/evliya-celebis-book-travels" java.net.URL.
  .getContent (java.io.InputStreamReader. "ENCODING GOES HERE")
enlive/html-resource
 (en/select [:h1#page-title]) first :content)

Or use an agent library which will detect the correct encoding and pass the resulting Reader to html-resource.

hth,

Christophe

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow