Best file formats for ML training

https://softwareengineering.stackexchange.com/questions/421223

21-03-2021
|

Question

Hi I'm bulding an ML pipeline with PyTorch to support various tasks and am looking for some advice on efficient ways to store processed data.

The main frame work was 3 layers [data prep] -> [data loading] -> [training/inference]. The dataprep module is responsible for taking raw data (medical data in this case) and storing it in an organized and efficient way to be handled subsequently by the dataloaders. Data preb is ideally only done once initially for each dataset and data loading/training may be done many times.

The main insights I'm looking are:

Video files: is storing them as raw pixels (uint8) in .npy files an okay method or am I really missing out on optimizations in standard video formats (mp4/avi...). For further info most videos are around 100-200 frames.
Segmentations: A segmentation can be a contour of 20-40 x/y points. My plan is to save the contours all to a single json file which can be loaded directly into ram in the constructor of a PyTorch dataloader. Then every call to getitem would take the next contour and call something like skimage.draw.polygon to convert it into a binary mask. Wondering first if loading all the segmentations in the constructor is naive and if I should store them as npy files, or directly as binary masks from the start them load them individually with the dataloader on each call to getitem
Images I think just using png is fine and loading during calls to getitem

Any insights on these would be appreciated, it's difficult to find agreed upon best practices for these kinds of thing on google.

Thanks

Solution

Since you don't have a clear criteria on efficiency, it's hard to have a best practice here. The data format you are using are all well supported by PyTorch data loader module, so it should be fine.

If you benchmark the performance and have some issues in data loading, as video files or large amount of image files, something worth considering may include the I/O efficiency on the following cases:

data size, scaling up to large datasets that exceed the capacity of local disk storage, requiring distributed storage systems and efficient network access.
number of files, large number of files with uniformly random access patterns.
data rates, massively parallel I/O systems required for GPU based training jobs.

To address the potential issues from above use cases, PyTorch introduces the WebDataset library. It is designed for situations where the data files are too large to fit in memory for training. You can still used the data format supported by PyTorch (png, uint8,..etc) and load the data with torch.utils.data.DataLoader in batches. You just need one more step by compressing the directory as a tar file. So this may be a supplementary worth trying in your use case.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange