Question

What is the most convenient way to remove all the HTML tags when using the SAS URL access method to read web pages?

Was it helpful?

Solution

This should do what you want. Removes everything between the <> including the <> and leaves just the content (aka innerHTML).

Data HTMLData;

filename INDEXIN URL "http://www.zug.com/";

input;

textline = _INFILE_;

/*-- Clear out the HTML text --*/
re1 = prxparse("s/<(.|\n)*?>//");
call prxchange(re1, -1, textline);

run;

OTHER TIPS

I think the methodology is not to remove the HTML from the page, but identify the standard patterns for the data you are trying to capture. This is the perl / regular expressions type methodology.

An example might be some data or table that comes so many characters after the logo image. You could write a script to keep only the data.

If you want to post up some html, maybe we can help decode it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top