Question

I want to scrape some data from a website. I have used import.io but still not much satisfied.. can any of you suggest about it.. whats the best tool to get the unstructured data from web

Was it helpful?

Solution

Try BeautifulSoup - http://www.crummy.com/software/BeautifulSoup/

From the website "Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping." I have no personally used it, but it often comes up in regards to a nice library for scraping. Here's a blog post on using it to scrape Craigslist http://www.gregreda.com/2014/07/27/scraping-craigslist-for-tickets/

OTHER TIPS

You don't mention what language you're programming in (please consider adding it as a tag), so general help would be to seek out a HTML parser and use that to pull the data. Some web sites can have simply awful HTML code and can be very difficult to scrape, and just when you think you have it...

A HTML parser will parse all the html and allow you to access it in a structured sort of way, whether that's from an array, an object etc.

Ruby together with Nokogiri allows to access HTML and XML documents via XPath and CSS selectors. Here is a tutorial.

You don't need a tool and I don't recommend you use one.

Convert the html to well-formed XML (XHTML) - I recommend the tagsoup.

Once you've done that the data is just another XML feed and you can write an XSLT transformation (or XQuery) to access and pull out the data you want in the format you want.

That might mean learning XSLT/XQuery if you don't already know it but you will be learning skills that (unlike scraping tools) have multiple rather than just than one useful application.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top