Вопрос

I'm trying to make a simple script in python that will scan a tweet for a link and then visit that link. I'm having trouble determining which direction to go from here. From what I've researched it seems that I can Use Selenium or Mechanize? Which can be used for browser automation. Would using these be considered web scraping?

Or

I can learn one of the twitter apis , the Requests library, and pyjamas(converts python code to javascript) so I can make a simple script and load it into google chrome's/firefox extensions.

Which would be the better option to take?

Это было полезно?

Решение

There are many different ways to go when doing web automation. Since you're doing stuff with Twitter, you could try the Twitter API. If you're doing any other task, there are more options.

  • Selenium is very useful when you need to click buttons or enter values in forms. The only drawback is that it opens a separate browser window.

  • Mechanize, unlike Selenium, does not open a browser window and is also good for manipulating buttons and forms. It might need a few more lines to get the job done.

  • Urllib/Urllib2 is what I use. Some people find it a bit hard at first, but once you know what you're doing, it is very quick and gets the job done. Plus you can do things with cookies and proxies. It is a built-in library, so there is no need to download anything.

  • Requests is just as good as urllib, but I don't have a lot of experience with it. You can do things like add headers. It's a very good library.

Once you get the page you want, I recommend you use BeautifulSoup to parse out the data you want.

I hope this leads you in the right direction for web automation.

Другие советы

I am not expect in web scraping. But I had some experience with both Mechanize and Selenium. I think in your case, either Mechanize or Selenium will suit your needs well, but also spend some time look into these Python libraries Beautiful Soup, urllib and urlib2.

From my humble opinion, I will recommend you use Mechanize over Selenium in your case. Because, Selenium is not as light weighted compare to Mechanize. Selenium is used for emulating a real web browser, so you can actually perform 'click action'.

There are some draw back from Mechanize. You will find Mechanize give you a hard time when you try to click a type button input. Also Mechanize doesn't understand java-scripts, so many times I have to mimic what java-scripts are doing in my own python code.

Last advise, if you somehow decided to pick Selenium over Mechanize in future. Use a headless browser like PhantomJS, rather than Chrome or Firefox to reduce Selenium's computation time. Hope this helps and good luck.

For

Web automation : "webbot"

Web scraping : "scrapy"

webbot works even for webpages with dynamically changing id and classnames and has more methods and features than selenium and mechanize.

Here's a snippet of webbot

from webbot import Browser 
web = Browser()
web.go_to('google.com') 
web.click('Sign in')
web.type('mymail@gmail.com' , into='Email')
web.click('NEXT' , tag='span')
web.type('mypassword' , into='Password' , id='passwordFieldId') # specific selection
web.click('NEXT' , tag='span') # you are logged in ^_^

For web scraping Scrapy seems to be the best framework.

It is very well documented and easy to use.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top