find files in huge directory - very slow

https://stackoverflow.com//questions/21039104

21-12-2019
|

Domanda

I have a directory with files. The archive is very big and has 1.5 million pdf files inside.
the directory is stored on an IBM i server with OS V7R1 and the machine is new and very fast.
The files are named like this :

invoice_[custno]_[year']_[invoice_number].pdf  
invoice_081500_2013_7534435564.pdf

No I try to find files with the find command using the Shell.

find  . -name 'invoice_2013_*.pdf'  -type f | ls -l > log.dat

The command took a long time so I aborted the operation with no result.

If I try it with smaller directories all works fine.

Later I want to have a job that runs everey day and finds the files created the last 24 hours but I it aleays runs so slow I can forget this.

Soluzione

That invocation would never work because ls does not read filenames from stdin.

Possible solutions are:

Use the find utility's built-in list option:

find . -name 'invoice_2013_*.pdf' -type f -ls > log.dat

Use the find utility's -exec option to execute ls -l for each matching file:

find . -name 'invoice_2013_*.pdf' -type f -exec ls {} \; > log.dat

Pipe the filenames to the xargs utility and let it execute ls -l with the filenames as parameters:

find . -name 'invoice_2013_*.pdf' -type f | xargs ls -l > log.dat

A pattern search of 1.5 million files in a single directory is going to be inefficient on any filesystem.

Altri suggerimenti

For looking only at a list of new entries in the directory, you might consider journaling the directory. You would specify INHERIT(*NO) to prevent journaling all the files in the directory as well. Then you could simply extract the recent journal entries with DSPJRN to find out what objects had been added.

I don't think I'd put more than maybe 15k files in a single directory. Some QShell utilities run into trouble at around 16k files. But I'm not sure I'd store them in a directory in any case, except maybe for ones over 16MB if that's a significant fraction of the total. I'd possibly look to store them in CLOBs/BLOBs in the database first.

Storing as individual streamfile objects brings ownership/authority problems that need to be addressed. Some profile is getting entries into its owned-objects table, and I'd expect that profile to be getting pretty large. Perhaps getting to one or more limits.

By storing in the database, you drop to a single owned object.

Or perhaps a few similar objects... There might be a purging/archiving process that moves rows off to a secondary or tertiary table. Hard to guess how that might need to be structured, if at all.

Saves could also benefit, especially SAVSECDTA and SAV saves. Security data is greatly reduced. And saving a 4GB table is faster than saving a thousand 4MB objects (or whatever the breakdown might be).

Other than determining how the original setup and implementation would go in your environment, the big tricky part could involve volatility. If these are stable objects with relatively few changes and few deletions, it should be okay. But if BLOBs are often modified, it can bring trouble when the table takes at a significant fraction of DASD capacity. It gets particularly rough when it exceeds the size of DASD free space and a re-org is needed. With low volatility, that's much less of a concern.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow