Processing large xlsx file

https://stackoverflow.com/questions/4897766

28-10-2019
|

سؤال

I need to auto-fit all rows in large (30k+ rows) xlsx file.

The following code via apache poi works on small files, but goes out with OutOfMemoryError on large ones:

Workbook workbook = WorkbookFactory.create(inputStream);
Sheet sheet = workbook.getSheetAt(0);

for (Row row : sheet) {
    row.setHeight((short) -1);
}

workbook.write(outputStream);

Update: Unfortunately, increasing heap size is not an option - OutOfMemoryError appears at -Xmx1024m and 30k rows is not an upper limit.

المحلول

Try using the event API. See Event API (HSSF only) and XSSF and SAX (Event API) in the POI documentation for details. A couple of quotes from that page:

HSSF:

The event API is newer than the User API. It is intended for intermediate developers who are willing to learn a little bit of the low level API structures. Its relatively simple to use, but requires a basic understanding of the parts of an Excel file (or willingness to learn). The advantage provided is that you can read an XLS with a relatively small memory footprint.

XSSF:

If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. This is intended for intermediate developers who are willing to learn a little bit of low level structure of .xlsx files, and who are happy processing XML in java. Its relatively simple to use, but requires a basic understanding of the file structure. The advantage provided is that you can read a XLSX file with a relatively small memory footprint.

For output, one possible approach is described in the blog post Streaming xlsx files. (Basically, use XSSF to generate a container XML file, then stream the actual content as plain text into the appropriate xml part of the xlsx zip archive.)

نصائح أخرى

A dramatic improvement in memory usage can be done by using a File instead of a Stream. (It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)

So instead of

Workbook workbook = WorkbookFactory.create(inputStream);

Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));

This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream

Files vs InputStreams

"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."

I was having the same problem with a lot less of row, but large strings.

Since I don't have to keep my data loaded, I found out that I can use SXSSF instead of XSSF.

They have similar interfaces, which helps if you have a lot of code already writen. But with SXSSF it is possible to set the amount of rows you keep loaded.

Here is the link. http://poi.apache.org/spreadsheet/how-to.html#sxssf

If you want to auto-fit or set styles or write all rows in large (30k+ rows) xlsx file,use SXSSFWorkbook.Here is the sample code that helps you...

SXSSFWorkbook wb = new SXSSFWorkbook();
            SXSSFSheet sheet = (SXSSFSheet) wb.createSheet("writetoexcel");
            Font font = wb.createFont();
                font.setBoldweight((short) 700);
                // Create Styles for sheet.
                XSSFCellStyle Style = (XSSFCellStyle) wb.createCellStyle();
                Style.setFillForegroundColor(new XSSFColor(java.awt.Color.LIGHT_GRAY));
                Style.setFillPattern(XSSFCellStyle.SOLID_FOREGROUND);
                Style.setFont(font);
                //iterating r number of rows
            for (int r=0;r < 30000; r++ )
            {
                Row row = sheet.createRow(r);
                //iterating c number of columns
                for (int c=0;c < 75; c++ )
                {
                    Cell cell = row.createCell(c);
                    cell.setCellValue("Hello"); 
                    cell.setCellStyle(Style);
                }
    }
            FileOutputStream fileOut = new FileOutputStream("E:" + File.separator + "NewTest.xlsx");

I used Event API for a HSSF file (.xls), and I discovered terrible lack of documentation about order of records.

If you are writing to XLSX, I found an improvement by writing to different sheets of the same Excel file. You also might find an improvement by writing to different Excel files. But first try writing to different sheets.

The best example for this is described in the following stack overflow thread: Error While Reading Large Excel Files (xlsx) Via Apache POI

The code snippet in the main answer in that topic illustrates the Apache POI wrappings around SAX xml parsing, and how you can trivially loop over all the sheets and then over each individual cell.

The code is stale with current implementation of the Apache POI API, as the endRow() api provides the current row number that has finished to be processing.

With that code snippet it should be trivial for your to parse a big XLSX file cell by cell. E.g. for each sheet; for each row cell; row has ended event. You could trivial create app logic where at the of each row you create a Map of columneName to cellValue.

I had the same problem with 800,000 cells and 3M characters where XSSF allocates 1GB of heap!

I used Python with openpyxl and numpy to read the xlsx file (from Java code) and first convert it to a normal text. Then I loaded the text file in java. It may seems to have large overhead, but it is indeed fast.

The python script looks like

import openpyxl as px
import numpy as np

# xlsx file is given through command line foo.xlsx
fname = sys.argv[1]
W = px.load_workbook(fname, read_only = True)
p = W.get_sheet_by_name(name = 'Sheet1')

a=[]
# number of rows and columns
m = p.max_row
n = p.max_column

for row in p.iter_rows():
    for k in row:
        a.append(k.value)

# convert list a to matrix (for example maxRows*maxColumns)
aa= np.resize(a, [m, n])

# output file is also given in the command line foo.txt
oname = sys.argv[2]
print (oname)
file = open(oname,"w")
mm = m-1
for i in range(mm):
    for j in range(n):
        file.write( "%s " %aa[i,j]  )
    file.write ("\n")

# to prevent extra newline in the text file
for j in range(n):
    file.write("%s " %aa[m-1,j])

file.close()

Then in my java code, I wrote

try {
  // `pwd`\python_script  foo.xlsx  foo.txt
  String pythonScript =  System.getProperty("user.dir") + "\\exread.py ";
  String cmdline = "python " + pythonScript +
                    workingDirectoryPath + "\\" + fullFileName + " " + 
                    workingDirectoryPath + "\\" + shortFileName + ".txt";
  Process p = Runtime.getRuntime().exec(cmdline);
  int exitCode = p.waitFor();
  if (exitCode != 0) {
    throw new IOException("Python command exited with " + exitCode);
  }
} catch (IOException e) {
  System.out.println( e.getMessage() );
} catch (InterruptedException e) {
  ReadInfo.append(e.getMessage() );
}

After that, you will get foo.txt which is similar to foo.xlsx, but in text format.

I used SAX parser to process XML structure. It works for XLSX files.

https://stackoverflow.com/a/44969009/4587961

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow