Question

I have a few html pages, each with a number of posts that follow a given pattern and that contain a lot of different information, among others a well-identified url and an associated name and date. I would like to produce a table containing date + name + url in separate columns and ignore the rest of the text in the document (both data and html formatting).

I was thinking of using OpenOffice and its regex functions to do so but I don’t see how I would do the actual extraction from html to a table (I am familiar with search and replace but am not sure there is a way to do extraction; Jan Dvorak’s third comments to the question on How to extract file name from random image <img> tags in Open Office speaks against it).

Is there a good way to do this text extraction, in OpenOffice or with any other tool?

Was it helpful?

Solution

Is there a good way to do this text extraction, in OpenOffice or with any other tool?

Since you're parsing HTML, it would be easier to use an HTML parsing engine. For example in PHP you could pull all the links or all the images from a page with a few simple lines.

// Create DOM from URL or file
$html = file_get_html('path and file name');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

This could be further refined if you had some additional information about the values being pulled and how they are stored in the file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top