I am trying to crawl some urls using Nutch 1.7, but I'm facing
authentication issues and
connection refused exception.
According to the logs I could see that it is trying to authenticate with NTLM, but after that it shows "Redirect required" and finally releasing the connection... (could be seen in logpart-1)
According to the Nutch tutorial in
http://wiki.apache.org/nutch/HttpAuthenticationSchemes#A_note_on_NTLM_domains
I have set the auth-configuration in httpclient-auth.xml
file:
Defined httpclient
property in both nutch-site.xml
and nutch-default.xml
plugin.includes
protocol-(httpclient|http)|urlfilter-
regex|parse-(text|html|tika)|index-(more|basic|anchor)|indexer-solr|scoring-
opic|urlnormalizer-(pass|regex|basic)
Also have defined the auth configuration file in nutch-site.xml
.
http.auth.file
httpclient-auth.xml
Authentication configuration file for 'protocol-httpclient' plugin.
But there is no success for me!
Am I not configuring the authentication in proper way or I am missing something?
Can anyone please help me with the proper required authentication configuration in Nutch?
Attaching complete hadoop.log
:
logpart-1:authentication
2014-04-16 05:11:23,712 DEBUG httpclient.HttpMethodDirector - Authorization required
2014-04-16 05:11:23,712 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
2014-04-16 05:11:23,731 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,732 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Retry authentication
2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
2014-04-16 05:11:23,733 DEBUG cookie.CookieSpec - Unrecognized cookie attribute: name=HttpOnly, value=null
2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Cookie accepted: "PHPSESSID=9f9378mvh9e720f5o3l0ibc1o7"
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Authenticating with NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Redirect required
2014-04-16 05:11:23,735 DEBUG params.HttpMethodParams - Credential charset not configured, using HTTP element charset
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Should close connection in response to directive: close
2014-04-16 05:11:23,735 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=www.xxxportal.com]
2014-04-16 05:11:23,736 DEBUG util.IdleConnectionHandler - Adding connection at: 1397643083736
2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Notifying no-one, there are no waiting threads
2014-04-16 05:11:23,737 DEBUG httpclient.HttpMethodBase - Adding Host request header
2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authorization required
2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,745 INFO regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials required
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials provider not available
2014-04-16 05:11:23,745 INFO httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
2014-04-16 05:11:23,746 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
2014-04-16 05:11:23,746 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=sp.zzz.com]
For few of other links I am getting
I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect
I am not behind any proxy and I have turned off all the firewall settings in the system still. No idea why I am getting connection refused exception.
Here also I am not able to find out the exact reason why I am getting connection refused exception.
Please help me to understand the exact problem in this case a well.
Attaching the complete hadoop.log
!
logPart2-connection
refused.
2014-04-16 05:11:26,443 INFO fetcher.Fetcher - * queue: www.xxxportal.com
2014-04-16 05:11:26,443 INFO fetcher.Fetcher - maxThreads = 1
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - inProgress = 0
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - crawlDelay = 5000
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - minCrawlDelay = 0
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - nextFetchTime = 1397643088739
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - now = 1397643086444
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - 0. www.xxxportal.com/profiles/
2014-04-16 05:11:26,445 INFO fetcher.Fetcher - 1. www.xxxportal.com/wiki/index.php
2014-04-16 05:11:26,445 INFO fetcher.Fetcher - 2. www.xxxportal.com/sop/
2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Closing the connection.
2014-04-16 05:11:26,560 INFO httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect
2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Connection refused: connect
java.net.ConnectException: Connection refused: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(ReflectionSocketFactory.java:140)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:125)
at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:94)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
at org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:75)
at org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:157)
at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:391)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:676)
2014-04-16 05:11:26,564 INFO httpclient.HttpMethodDirector - Retrying request
2014-04-16 05:11:26,565 DEBUG httpclient.HttpConnection - Open connection to www.zzzlearninglounge.com:80