Question

I am trying to strip data from thousands of identical Excel 2007/2010 files. I would prefer to do this using scraping techniques. Is it possible to scrape an Excel file since, as far as I know, the file is basically some sort of XML format.

So, is it possible to convert an Excel file to XML or some other markup format?

Was it helpful?

Solution

The XLSX format is actually a ZIP file, but with a different extension. If you unzip it using your favorite zip program, you'll find that the worksheet data is located inside xl\worksheets. Each worksheet is saved as a separate XML document. You should be able to use XSLT as Michael suggested to extract the data you require.

OTHER TIPS

Excel 2010 files are in XML, by default. So what file format are your Excel files currently in (i.e., what extension do they have)? Your question is somewhat ambiguous on this matter. If they are already in XML, you can use XSLT to scrape them.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top