how to shuffle the data for model.fit with custom data generator?

https://datascience.stackexchange.com/questions/73688

11-12-2020
|

Question

So trainfiles is a list that contains the files' directory and name e.g. ['../train/1.npy' , '../train/2.npy'] and then I create a dataset as shown in the middle of the code then I apply it to model fit function

just adding shuffle=True in the argument for model fit function doesn't do anything it seems. How do I go about shuffling the trainfiles correctly? i.e. when does the trainGenerator function gets executed? i.e. does it gets executed 516 times when generating 1 batch?

can I just add random.shuffle(trainfiles) right above the for-loop in trainGenerator?

Solution

The big issue here is that your generator yields after each file is loaded. This means that your batch size is always the number of examples stored in each file and that the training examples in each batch are always the same.

You can absolutely shuffle the order of the files, but that's just shuffling the batch order around and not shuffling data examples among batches, which is very important for the optimizer. (Each batch yields an estimate of the gradient of the loss function with respect to the model parameters given the particular batch data. The more combinations of examples you're able to give the network, the better you're able to estimate the gradient with respect to the training set.) There are lots of ways to try to fix this, the Tensorflow docs have tutorials for some of them. (You could load a random set of files and deliver batches that contain their data intermixed, then load more, etc)

Also, you have 75,000 files that appear to be (40, 40, 15). That means the whole dataset is (3 000 000, 40, 15) if all the files are the same size, which is ~14.4 GB if you're storing the data as np.float64. That's big, but not ridiculous. You could probably get a significant chunk of that in memory at once, which is probably better.

Is this float data? If so, do you need that level of precision (np.float64, the default)? I'm guessing probably not. If you can store it as np.float32 or even np.float16, you might be able to just get the whole thing in memory. If not, you could load it as bigger chunks (half the dataset at once, maybe) and intermix examples among them. Just something to think about.

EDIT:

Also, if you're going to concatenate them all, don't do it in a loop with a big array and iteratively concatenate the smaller arrays:

# NO

big_array = np.load("first_train_file.npy")

for train_file in rest_of_train_files:
    tmp_file = np.load(train_file)
    np.concat((big_array, tmp_file), axis=0)

# YES

list_of_little_arrays = []

for train_file in all_train_files:
    tmp_file = np.load(train_file)
    list_of_little_arrays.append(tmp_file)

all_data = np.concatenate(list_of_little_arrays)

You're going to have loads of memory problems using the first way. I don't know that that's what you tried or not, but just in case.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange