Question

I am trying to crawl some urls using Nutch 1.7, but I'm facing

  1. authentication issues and

  2. connection refused exception.

  3. According to the logs I could see that it is trying to authenticate with NTLM, but after that it shows "Redirect required" and finally releasing the connection... (could be seen in logpart-1)

According to the Nutch tutorial in

http://wiki.apache.org/nutch/HttpAuthenticationSchemes#A_note_on_NTLM_domains

  1. I have set the auth-configuration in httpclient-auth.xml file:

  2. Defined httpclient property in both nutch-site.xml and nutch-default.xml

    plugin.includes protocol-(httpclient|http)|urlfilter-
    regex|parse-(text|html|tika)|index-(more|basic|anchor)|indexer-solr|scoring-
    opic|urlnormalizer-(pass|regex|basic)

  3. Also have defined the auth configuration file in nutch-site.xml.

    http.auth.file httpclient-auth.xml Authentication configuration file for 'protocol-httpclient' plugin.

But there is no success for me!

Am I not configuring the authentication in proper way or I am missing something?

Can anyone please help me with the proper required authentication configuration in Nutch?


Attaching complete hadoop.log:

logpart-1:authentication

2014-04-16 05:11:23,712 DEBUG httpclient.HttpMethodDirector - Authorization required
2014-04-16 05:11:23,712 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
2014-04-16 05:11:23,731 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,732 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Retry authentication
2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
2014-04-16 05:11:23,733 DEBUG cookie.CookieSpec - Unrecognized cookie attribute: name=HttpOnly, value=null
2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Cookie accepted: "PHPSESSID=9f9378mvh9e720f5o3l0ibc1o7"
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Authenticating with NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Redirect required
2014-04-16 05:11:23,735 DEBUG params.HttpMethodParams - Credential charset not configured, using HTTP element charset
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Should close connection in response to directive: close
2014-04-16 05:11:23,735 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=www.xxxportal.com]
2014-04-16 05:11:23,736 DEBUG util.IdleConnectionHandler - Adding connection at: 1397643083736
2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Notifying no-one, there are no waiting threads
2014-04-16 05:11:23,737 DEBUG httpclient.HttpMethodBase - Adding Host request header
2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authorization required
2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,745 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials required
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials provider not available
2014-04-16 05:11:23,745 INFO  httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
2014-04-16 05:11:23,746 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
2014-04-16 05:11:23,746 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=sp.zzz.com]

For few of other links I am getting

I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect

I am not behind any proxy and I have turned off all the firewall settings in the system still. No idea why I am getting connection refused exception.


Here also I am not able to find out the exact reason why I am getting connection refused exception.

Please help me to understand the exact problem in this case a well.

Attaching the complete hadoop.log!

logPart2-connection refused.

2014-04-16 05:11:26,443 INFO  fetcher.Fetcher - * queue: www.xxxportal.com
2014-04-16 05:11:26,443 INFO  fetcher.Fetcher -   maxThreads    = 1
2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   inProgress    = 0
2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   crawlDelay    = 5000
2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   minCrawlDelay = 0
2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   nextFetchTime = 1397643088739
2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   now           = 1397643086444
2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   0. www.xxxportal.com/profiles/
2014-04-16 05:11:26,445 INFO  fetcher.Fetcher -   1. www.xxxportal.com/wiki/index.php
2014-04-16 05:11:26,445 INFO  fetcher.Fetcher -   2. www.xxxportal.com/sop/
2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Closing the connection.
2014-04-16 05:11:26,560 INFO  httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect
2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Connection refused: connect
java.net.ConnectException: Connection refused: connect
                at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
                at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
                at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
                at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
                at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
                at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
                at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
                at java.net.Socket.connect(Socket.java:579)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(ReflectionSocketFactory.java:140)
                at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:125)
                at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
                at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
                at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
                at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
                at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
                at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
                at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:94)
                at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
                at org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:75)
                at org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:157)
                at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:391)
                at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:676)
2014-04-16 05:11:26,564 INFO  httpclient.HttpMethodDirector - Retrying request
2014-04-16 05:11:26,565 DEBUG httpclient.HttpConnection - Open connection to www.zzzlearninglounge.com:80
Was it helpful?

Solution

1) From the logs it clearly says Authentication failure for NTLM on your particular site.

Here you must first check username/password.

Then Scheme of Auth Basic/NTLM/ And then port on which you want to autheticate

If you validate these 3 point and use correct values then your Authentication problem should get resolved...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top