Is it possible to train stylegan2 with a custom dataset using a graphics card that only has 6GB of VRAM (GeForce GTX 1660)?
-
11-12-2020 - |
Pergunta
I'm attempting to train stylegan2 using a custom dataset, but no matter what settings I use I see the same error:
2020-05-22 11:15:05.261933: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2020-05-22 11:15:05.339186: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 3.52G (3781073152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I'm assuming this means I need more GPU memory, but I've read that you can lower memory use in exchange for longer training periods. I did have to downgrade from tensorflow2 to 1.15 to use this project so there could be some underlying configuration issue, but I am able to generate images from the pretrained models without any issues.
This is how I'm running the training process:
python run_training.py --num-gpus=1 --data-dir=datasets --config=config-e --dataset=customdata --mirror-augment=true
I've tried using the other config-x options, and adjusting the settings in both run_training.py
and training/training_loop.py
although more specifically I'm just trying different values for sched.minibatch_size_base
and sched.minibatch_gpu_base
. Checking the results folder does tell me that the settings I've changed in run_training.py
are actually used during the training process.
Here's the complete log from run_training.py
if it's useful:
Local submit - run_dir: results\00021-stylegan2-customdata-1gpu-config-e
dnnlib: Running training.training_loop.training_loop() on localhost...
2020-05-22 13:02:45.261043: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-05-22 13:02:51.127997: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-05-22 13:02:51.169757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1660 major: 7 minor: 5 memoryClockRate(GHz): 1.785
pciBusID: 0000:01:00.0
2020-05-22 13:02:51.176966: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-05-22 13:02:51.187788: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-05-22 13:02:51.197589: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-05-22 13:02:51.205389: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-05-22 13:02:51.216122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-05-22 13:02:51.225483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-05-22 13:02:51.244887: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-22 13:02:51.253430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-05-22 13:02:51.966561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-22 13:02:51.971731: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-05-22 13:02:51.974966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-05-22 13:02:51.979741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4630 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660, pci bus id: 0000:01:00.0, compute capability: 7.5)
Streaming data using training.dataset.TFRecordDataset...
self.tfrecord_dir: datasets\customdata
Dataset shape = [3, 64, 64]
Dynamic range = [0, 255]
Label size = 0
Constructing networks...
Setting up TensorFlow plugin "fused_bias_act.cu": Preprocessing... 2020-05-22 13:03:25.588173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1660 major: 7 minor: 5 memoryClockRate(GHz): 1.785
pciBusID: 0000:01:00.0
2020-05-22 13:03:25.596152: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-05-22 13:03:25.600627: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-05-22 13:03:25.605487: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-05-22 13:03:25.610555: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-05-22 13:03:25.618346: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-05-22 13:03:25.622514: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-05-22 13:03:25.626790: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-22 13:03:25.632722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-05-22 13:03:25.638261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-22 13:03:25.642363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-05-22 13:03:25.645684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-05-22 13:03:25.649560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 4630 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660, pci bus id: 0000:01:00.0, compute capability: 7.5)
Loading... Done.
Setting up TensorFlow plugin "upfirdn_2d.cu": Preprocessing... 2020-05-22 13:03:50.302225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1660 major: 7 minor: 5 memoryClockRate(GHz): 1.785
pciBusID: 0000:01:00.0
2020-05-22 13:03:50.310782: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-05-22 13:03:50.316161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-05-22 13:03:50.395110: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-05-22 13:03:50.463435: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-05-22 13:03:50.468677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-05-22 13:03:50.527377: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-05-22 13:03:50.531735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-22 13:03:50.537159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-05-22 13:03:50.615931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-22 13:03:50.679408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-05-22 13:03:50.682438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-05-22 13:03:50.686257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 4630 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660, pci bus id: 0000:01:00.0, compute capability: 7.5)
Loading... Done.
G Params OutputShape WeightShape
--- --- --- ---
latents_in - (?, 512) -
labels_in - (?, 0) -
lod - () -
dlatent_avg - (512,) -
G_mapping/latents_in - (?, 512) -
G_mapping/labels_in - (?, 0) -
G_mapping/Normalize - (?, 512) -
G_mapping/Dense0 262656 (?, 512) (512, 512)
G_mapping/Dense1 262656 (?, 512) (512, 512)
G_mapping/Dense2 262656 (?, 512) (512, 512)
G_mapping/Dense3 262656 (?, 512) (512, 512)
G_mapping/Dense4 262656 (?, 512) (512, 512)
G_mapping/Dense5 262656 (?, 512) (512, 512)
G_mapping/Dense6 262656 (?, 512) (512, 512)
G_mapping/Dense7 262656 (?, 512) (512, 512)
G_mapping/Broadcast - (?, 10, 512) -
G_mapping/dlatents_out - (?, 10, 512) -
Truncation/Lerp - (?, 10, 512) -
G_synthesis/dlatents_in - (?, 10, 512) -
G_synthesis/4x4/Const 8192 (?, 512, 4, 4) (1, 512, 4, 4)
G_synthesis/4x4/Conv 2622465 (?, 512, 4, 4) (3, 3, 512, 512)
G_synthesis/4x4/ToRGB 264195 (?, 3, 4, 4) (1, 1, 512, 3)
G_synthesis/8x8/Conv0_up 2622465 (?, 512, 8, 8) (3, 3, 512, 512)
G_synthesis/8x8/Conv1 2622465 (?, 512, 8, 8) (3, 3, 512, 512)
G_synthesis/8x8/Upsample - (?, 3, 8, 8) -
G_synthesis/8x8/ToRGB 264195 (?, 3, 8, 8) (1, 1, 512, 3)
G_synthesis/16x16/Conv0_up 2622465 (?, 512, 16, 16) (3, 3, 512, 512)
G_synthesis/16x16/Conv1 2622465 (?, 512, 16, 16) (3, 3, 512, 512)
G_synthesis/16x16/Upsample - (?, 3, 16, 16) -
G_synthesis/16x16/ToRGB 264195 (?, 3, 16, 16) (1, 1, 512, 3)
G_synthesis/32x32/Conv0_up 2622465 (?, 512, 32, 32) (3, 3, 512, 512)
G_synthesis/32x32/Conv1 2622465 (?, 512, 32, 32) (3, 3, 512, 512)
G_synthesis/32x32/Upsample - (?, 3, 32, 32) -
G_synthesis/32x32/ToRGB 264195 (?, 3, 32, 32) (1, 1, 512, 3)
G_synthesis/64x64/Conv0_up 1442561 (?, 256, 64, 64) (3, 3, 512, 256)
G_synthesis/64x64/Conv1 721409 (?, 256, 64, 64) (3, 3, 256, 256)
G_synthesis/64x64/Upsample - (?, 3, 64, 64) -
G_synthesis/64x64/ToRGB 132099 (?, 3, 64, 64) (1, 1, 256, 3)
G_synthesis/images_out - (?, 3, 64, 64) -
G_synthesis/noise0 - (1, 1, 4, 4) -
G_synthesis/noise1 - (1, 1, 8, 8) -
G_synthesis/noise2 - (1, 1, 8, 8) -
G_synthesis/noise3 - (1, 1, 16, 16) -
G_synthesis/noise4 - (1, 1, 16, 16) -
G_synthesis/noise5 - (1, 1, 32, 32) -
G_synthesis/noise6 - (1, 1, 32, 32) -
G_synthesis/noise7 - (1, 1, 64, 64) -
G_synthesis/noise8 - (1, 1, 64, 64) -
images_out - (?, 3, 64, 64) -
--- --- --- ---
Total 23819544
D Params OutputShape WeightShape
--- --- --- ---
images_in - (?, 3, 64, 64) -
labels_in - (?, 0) -
64x64/FromRGB 1024 (?, 256, 64, 64) (1, 1, 3, 256)
64x64/Conv0 590080 (?, 256, 64, 64) (3, 3, 256, 256)
64x64/Conv1_down 1180160 (?, 512, 32, 32) (3, 3, 256, 512)
64x64/Skip 131072 (?, 512, 32, 32) (1, 1, 256, 512)
32x32/Conv0 2359808 (?, 512, 32, 32) (3, 3, 512, 512)
32x32/Conv1_down 2359808 (?, 512, 16, 16) (3, 3, 512, 512)
32x32/Skip 262144 (?, 512, 16, 16) (1, 1, 512, 512)
16x16/Conv0 2359808 (?, 512, 16, 16) (3, 3, 512, 512)
16x16/Conv1_down 2359808 (?, 512, 8, 8) (3, 3, 512, 512)
16x16/Skip 262144 (?, 512, 8, 8) (1, 1, 512, 512)
8x8/Conv0 2359808 (?, 512, 8, 8) (3, 3, 512, 512)
8x8/Conv1_down 2359808 (?, 512, 4, 4) (3, 3, 512, 512)
8x8/Skip 262144 (?, 512, 4, 4) (1, 1, 512, 512)
4x4/MinibatchStddev - (?, 513, 4, 4) -
4x4/Conv 2364416 (?, 512, 4, 4) (3, 3, 513, 512)
4x4/Dense0 4194816 (?, 512) (8192, 512)
Output 513 (?, 1) (512, 1)
scores_out - (?, 1) -
--- --- --- ---
Total 23407361
2020-05-22 13:03:58.578847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-05-22 13:03:58.961664: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-22 13:04:00.763442: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-05-22 13:04:01.548775: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2020-05-22 13:04:01.651217: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Building TensorFlow graph...
Here's the contents of submit_config.txt
written to the results folder for the above job:
{ 'datasets': [],
'host_name': 'localhost',
'local': <dnnlib.submission.internal.local.TargetOptions object at 0x0000027CF0D20D48>,
'num_gpus': 1,
'nvprof': False,
'platform_extras': <dnnlib.submission.submit.PlatformExtras object at 0x0000027CF0D20E08>,
'print_info': False,
'run_desc': 'stylegan2-customdata-1gpu-config-e',
'run_dir': 'results\\00021-stylegan2-customdata-1gpu-config-e',
'run_dir_extra_files': [],
'run_dir_ignore': ['__pycache__', '*.pyproj', '*.sln', '*.suo', '.cache', '.idea', '.vs', '.vscode', '_cudacache'],
'run_dir_root': 'results',
'run_func_kwargs': { 'D_args': {'fmap_base': 8192, 'func_name': 'training.networks_stylegan2.D_stylegan2'},
'D_loss_args': {'func_name': 'training.loss.D_logistic_r1', 'gamma': 100},
'D_opt_args': {'beta1': 0.0, 'beta2': 0.99, 'epsilon': 1e-08},
'G_args': {'fmap_base': 8192, 'func_name': 'training.networks_stylegan2.G_main'},
'G_loss_args': {'func_name': 'training.loss.G_logistic_ns_pathreg'},
'G_opt_args': {'beta1': 0.0, 'beta2': 0.99, 'epsilon': 1e-08},
'data_dir': 'datasets',
'dataset_args': {'tfrecord_dir': 'customdata'},
'grid_args': {'layout': 'random', 'size': '8k'},
'image_snapshot_ticks': 10,
'metric_arg_list': [{'func_name': 'metrics.frechet_inception_distance.FID', 'minibatch_per_gpu': 8, 'name': 'fid50k', 'num_images': 50000}],
'mirror_augment': True,
'network_snapshot_ticks': 10,
'sched_args': {'D_lrate_base': 0.002, 'G_lrate_base': 0.002, 'minibatch_gpu_base': 1, 'minibatch_size_base': 8},
'tf_config': {'rnd.np_random_seed': 1000},
'total_kimg': 25000},
'run_func_name': 'training.training_loop.training_loop',
'run_id': 21,
'run_name': '00021-stylegan2-customdata-1gpu-config-e',
'submit_target': <SubmitTarget.LOCAL: 1>,
'task_name': 'itsame-00021-stylegan2-customdata-1gpu-config-e',
'user_name': 'itsame'}
I've trained other models with the same hardware, but I'm guessing stylegan2 requires a bit more space to work. Thanks for reading!
EDIT:
I've added some code to tfutil.py
and now I have a different error! According to the web, I may need to downgrade CUDA.
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333, allow_growth=True)
graph_options = tf.GraphOptions(place_pruned_graph=True)
config_proto = tf.ConfigProto(gpu_options=gpu_options, graph_options=graph_options)
error is now:
tensorflow.python.framework.errors_impl.InternalError: cudaErrorInvalidConfiguration
[[node GPU0/G_loss/PathReg/G/G_synthesis/8x8/Upsample/UpFirDn2D (defined at C:\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
EDIT 5/23/2020:
The above error seemed to go away on its own after reducing the batch size and using a much lower gpu memory fraction. I'm seeing this error now:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,3,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node TrainG/Apply0/grad_acc_var_38/Assign (defined at C:\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
I'm going to try and reduce the tensor size to 256x256. I have no idea how to do that or what it means, but most of what I've read about this error seems to suggest that.
Solução
According to the github readme:
One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit and cuDNN 7.5. To reproduce the results reported in the paper, you need an NVIDIA GPU with at least 16 GB of DRAM.