Pergunta

I have tons of HTML files saved from a website, with data in tables in a specific format. How can I retrieve data from those files and import them into Excel or write them in a CSV file?

The files are stored on HDD.

Foi útil?

Solução

You need to get these files, and parse them. So, you can write the results in a CSV file. Give more information to a better answer. Have you FTP access on server where these files are stored?

EDIT:

Use the PHP to iterate on directory and find the .html files(or any you need), and store the results on a variable. After, try a foreach() on variable, open each file and so parse it with some library, like php html parser. So, store the parser results on CSV.

Outras dicas

Assuming you have 20000 files, and their names follow convention like file1.html, file2.html etc..

And html is

    <a class = "export" id = "export" href = "#" > Export </a>

Here is JS, This was written based on assumptions.

    // with the help of http://jsfiddle.net/terryyounghk/KPEGU/
    function exportTableToCSV($table, filename) {
        var $rows = $table.find('tr:has(td)'),
        // Temporary delimiter characters unlikely to be typed by keyboard
        // This is to avoid accidentally splitting the actual contents
        tmpColDelim = String.fromCharCode(11), // vertical tab character
        tmpRowDelim = String.fromCharCode(0), // null character

        // actual delimiter characters for CSV format
        colDelim = '","',
        rowDelim = '"\r\n"',

        // Grab text from table into CSV formatted string
        csv = '"' + $rows.map(function (i, row) {
            var $row = $(row),
                $cols = $row.find('td');

            return $cols.map(function (j, col) {
                var $col = $(col),
                    text = $col.text();

                return text.replace('"', '""'); // escape double quotes

            }).get().join(tmpColDelim);

        }).get().join(tmpRowDelim)
            .split(tmpRowDelim).join(rowDelim)
            .split(tmpColDelim).join(colDelim) + '"',

        // Data URI
        csvData = 'data:application/csv;charset=utf-8,' + encodeURIComponent(csv);

        $(this).attr({
            'download': filename,
            'href': csvData,
            'target': '_blank'
        });
    }
            // #http://www.2ality.com/2013/11/initializing-arrays.html
    function fillArrayWithNumbers(n) {
        var arr = Array.apply(null, Array(n));
        return arr.map(function (x, i) {
            return i
        });
    }

    // This must be a hyperlink
    $(".export").on('click', function (event) {
        // CSV
        var that = this;
        var data = fillArrayWithNumbers(20000)
        // Async js is a JS library
        async.each(data, function (i, cb) {
            $.get(["./htmlFiles/file", i, ".html"].join('')).done(function (html) {
                var tables = $(html).find('table');
                $.each(tables, function () {
                    var table = $(this);
                    // Writing to individual csv file. If all the data structures are same you can merge all strings and download one.
                    // IF CSV, don't do event.preventDefault() or return false
                    // We actually need this to be a typical hyperlink
                    exportTableToCSV.apply(that, [table, 'export.csv']);
                    cb();
                });
            }).fail(function () {
                cb();
            })
        });
    });
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top