어떻게 화면에 스크레이퍼 작동하는가?[마감]

https://stackoverflow.com/questions/156083

03-07-2019
|

문제

사람들이 저를 쓰 이러한 프로그램의 모든 시간과 나는 그들이 무엇을 알고 있지만,그들은 어떻게 실제로 그것을 해야 합니까?내가 찾는 일반적인 개념입니다.

해결책

기술적으로 Screenscraping은 다른 프로그램의 디스플레이 데이터를 가져와 자체 사용을 위해 수집하는 프로그램입니다.

Screenscaping은 종종 대상 웹 사이트의 HTML 페이지를 구문 분석하여 형식화 된 데이터를 추출하는 웹 클라이언트를 나타냅니다. 이는 웹 사이트가 프로그래밍 방식으로 데이터에 액세스하기 위해 RSS 피드 또는 REST API를 제공하지 않는 경우에 수행됩니다.

이 목적에 사용되는 라이브러리의 한 예는 다음과 같습니다. HPRICOT Screen Scraping에 사용되는 더 나은 아치화 된 HTML 파서 중 하나 인 Ruby의 경우.

다른 팁

여기에 많은 정확한 답변이 있습니다.

아무도 말한 것은 하지 마세요!

스크린 스크래핑은 아무도 합리적인 기계로 읽을 수있는 인터페이스를 제공하지 않았을 때하는 일입니다. 글을 쓰고 부서지기 어렵습니다.

예를 들어, RSS Aggregator를 고려한 다음 정상적인 인간 지향적 인 블로그 인터페이스를 통해 동일한 정보를 얻는 코드를 고려하십시오. 블로거가 레이아웃을 변경하기로 결정했을 때 어느 것이 끊어 집니까?

물론, 때때로 당신은 선택의 여지가 없습니다 :(

에서는 일반 화면을 긁는 프로그램을 포착하는 출력에서 서버 프로그램에 의해 mimicing 작업의 사람 앞에 앉아있는 워크스테이션의 브라우저를 사용하거나 액세스 터미널 프로그램입니다.에서 특정한 주요 포인트 프로그램을 해석은 출력하고 다음 작업을 수행하거나 추출물 특정 금액의 정보를 출력합니다.

원래 이루어졌으로 캐릭터/터미널 출력에서 메인프레임 데이터를 추출하거나 업데이트 시스템에 있던 고대나 직접 액세스할 수 없습니다 최종 사용자에게 있습니다.에 현대적인 측면 그것은 일반적으로 의미가 분석은 출력하는 HTTP 요청에서 데이터를 추출하거나 일부 다른 작업입니다.의 출현으로 웹 서비스의 이런 종류의 것이어야 떨어져 죽었지만,모든 응용 프로그램이 제공하는 좋은 api 과 상호 작용할 수 있습니다.

스크린 스크레이퍼는 HTML 페이지를 다운로드하고 알려진 토큰을 검색하거나 XML 또는 그와 같은 구문 분석하여 관심있는 데이터를 꺼냅니다.

In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.

With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.

Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):

//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');

If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.

If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).

Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.

With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.

Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.

Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.

Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.

As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.

One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.

You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.

Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.

Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..

In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.

string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match

Obviously in a large scale environment you'd be writing more robust code than the above.

A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.

That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow