Question

I am trying to scrape content of a web page using enlive's html-resource function, but I am getting response 403, because I am not coming from a browser.I guess this can be overridden in Java (found answer here) , but I would like to see a clojure way to handle this issue. Perhaps this can be achieved by providing parameters to html-resource function, but I have not encountered an example of how and what needs to be passed as parameter. Any suggestion will be greatly appreciated.

Thanks.

Was it helpful?

Solution

Enlive's html-resource does not provide a way to override the default request properties. You can, like the other answer you found, open the connection yourself and pass the resulting InputStream to html-resource.

Something like the following would handle it:

(with-open [inputstream (-> (java.net.URL. "http://www.example.com/")
                            .openConnection
                            (doto (.setRequestProperty "User-Agent"
                                                       "Mozilla/5.0 ..."))
                            .getContent)]
  (html-resource inputstream))

Although, it might look better split out into its own function.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top