質問

I'm looking for a method to scrape a website from server side (which uses javascript) and save the output after analyzing data into a mysql database. I need to navigate from page to page by clicking links and submitting data from the database,without session expiring . Is this possible using phpquery web browser plugin? . I've started doing this using casperjs. I would like to know the pros and cons of both methods. I'm a beginner in the coding space. Please help.

役に立ちましたか?

解決

I would recommend that you use PhantomJS or CasperJS and parse the DOM with JavaScript selectors to get the parts of the pages you want back. Don't use phpQuery as it's based on PHP and would require a separate step in your processing versus using just JavaScript DOM parsing. Also, you won't be able to perform click events using PHP. Anything client side would need to be run in PhantomJS or CasperJS.

It might even be possible to write a full scraping engine using just PHP if that's your server side language of choice. You would need to reverse engineer the login process and maintain a cookie jar with your cURL requests to keep your login valid with each request. Once you've established a session with the the website, you can then setup your navigation path with an array of links that you would like to crawl. The idea behind web crawling is that you load a page from some link and process the page and then move to the next link. You continue this process until all pages have been processed and then your crawl is complete.

他のヒント

I would check out Google's guide Making AJAX Applications Crawlable the website you're trying to scrap might have adopted the scheme (making their site's content crawlable).

You want to look for #! in the URL's hash fragment, this indicates to the crawler that the site supports the AJAX crawling scheme.

To put it simply, when you come across a URL like this. www.example.com/ajax.html#!key=value you would modify it to www.example.com/ajax.html?_escaped_fragment_=key=value. The server should respond with a HTML snapshot of that page.

Here is the Full Specification

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top