Question

I am writing a crawler. Once after the crawler logs into a website I want to make the crawler to "stay-always-logged-in". How can I do that? Is a client (like browser, crawler etc.,) make a server to obey this rule? This scenario could occur when the server allows limited logins in day.

Was it helpful?

Solution

"Logged-in state" is usually represented by cookies. So what your have to do is to store the cookie information sent by that server on login, then send that cookie with each of your subsequent requests (as noted by Aiden Bell in his message, thx).

See also this question:

How to "keep-alive" with cookielib and httplib in python?

A more comprehensive article on how to implement it:

http://www.voidspace.org.uk/python/articles/cookielib.shtml

The simplest examples are at the bottom of this manual page:

https://docs.python.org/library/cookielib.html

You can also use a regular browser (like Firefox) to log in manually. Then you'll be able to save the cookie from that browser and use that in your crawler. But such cookies are usually valid only for a limited time, so it is not a long-term fully automated solution. It can be quite handy for downloading contents from a Web site once, however.

UPDATE:

I've just found another interesting tool in a recent question:

http://www.scrapy.org

It can also do such cookie based login:

http://doc.scrapy.org/topics/request-response.html#topics-request-response-ref-request-userlogin

The question I mentioned is here:

Scrapy domain_name for spider

Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top