Question

I need to produce statistics for files that are stored on a Linux network share and would like to be able run a shell script or program locally on the network share to produce data points with the following attributes:

path (or relativepath) | filename | filesize | datecreated | datechanged | dateaccessed

There are roughly 1–2 million files (8TB) and I want to explore the dataset to get a grasp of the organization and balance of the file types (as determined by a combination of file name and path) in relation to the total number of files and total amount of storage.

Questions:

  1. What is an efficient way to traverse the file system and get this data?

  2. What kind of database would you recommend to explore this kind of data with statistics at different levels in the hierarchy?

Était-ce utile?

La solution

This is what I ended up using to solve the problem:

  1. Linux commands find and fstat were used generate the dataset as a plain text file.
  2. Python’s pandas and exifread libraries were used to enrich and analyze the dataset.
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top