DLC issue to train the model from merged mp4 videos #2073
Replies: 3 comments 1 reply
-
Hi,
I would probably suggest you try to send your video to the DLC team and check if they can give you a bit more information why the training is failing. |
Beta Was this translation helpful? Give feedback.
-
I found that reconverting my video in mp4 with another codec solved the issue: Does this outcome make sense to you? |
Beta Was this translation helpful? Give feedback.
-
I actually found that the problem is not the format but the dimension of the image : 2020.2052 Is that possible that training fails on too big format? |
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm failing to train a one-animal model on DLC. My videos are a merged outcome from 2 mp4 videos, that I save in mp4 directly on bonsai.
Regarding the installation:
DLC is running from a container installed on a linux machine
Bonsai is running on Windows machine.
I used successfully this pipeline for 2 years until some months ago. Now that I'm resuming experiments, I face this out-of-nowhere issue.
My DLC pipeline still works well since I can train a model from non-merged (raw) videos, or from old videos that were merged with Bonsai around 1 year ago.
So my guess is the problem comes from the merging process; somehow the mp4 outcome is causing issue at the training stage. Did something change in the Bonsai code / node or else that could explain this issue?
Here the Bonsai pipeline I use to merge/save in mp4:
And here my DLC job log that get stuck when starting the training:
INFO: Environment variable SINGULARITY_BINDPATH is set, but APPTAINER_BINDPATH is preferred
INFO: Using cached SIF image
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
ERROR: ld.so: object 'libdlfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libvglfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libdlfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libvglfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
2024-11-13 17:39:38.678415: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Config:
{'all_joints': [[0], [1], [2], [3], [4], [5], [6], [7]],
'all_joints_names': ['Ear_left',
'Ear_right',
'Nose',
'Center',
'Lateral_left',
'Lateral_right',
'Tail_base',
'Tail_end'],
'alpha_r': 0.02,
'apply_prob': 0.5,
'batch_size': 1,
'clahe': True,
'claheratio': 0.1,
'crop_pad': 0,
'crop_sampling': 'hybrid',
'crop_size': [400, 400],
'cropratio': 0.4,
'dataset': 'training-datasets/iteration-0/UnaugmentedDataSet_CalciumNov12/Calcium_D.Battivelli95shuffle1.mat',
'dataset_type': 'default',
'decay_steps': 30000,
'deterministic': False,
'display_iters': 1000,
'edge': False,
'emboss': {'alpha': [0.0, 1.0], 'embossratio': 0.1, 'strength': [0.5, 1.5]},
'fg_fraction': 0.25,
'global_scale': 0.8,
'histeq': True,
'histeqratio': 0.1,
'init_weights': '/usr/local/lib/python3.8/dist-packages/deeplabcut/pose_estimation_tensorflow/models/pretrained/resnet_v1_50.ckpt',
'intermediate_supervision': False,
'intermediate_supervision_layer': 12,
'location_refinement': True,
'locref_huber_loss': True,
'locref_loss_weight': 0.05,
'locref_stdev': 7.2801,
'log_dir': 'log',
'lr_init': 0.0005,
'max_input_size': 1500,
'max_shift': 0.4,
'mean_pixel': [123.68, 116.779, 103.939],
'metadataset': 'training-datasets/iteration-0/UnaugmentedDataSet_CalciumNov12/Documentation_data-Calcium_95shuffle1.pickle',
'min_input_size': 64,
'mirror': False,
'multi_stage': False,
'multi_step': [[0.005, 10000],
[0.02, 430000],
[0.002, 730000],
[0.001, 1030000]],
'net_type': 'resnet_50',
'num_joints': 8,
'optimizer': 'sgd',
'pairwise_huber_loss': False,
'pairwise_predict': False,
'partaffinityfield_predict': False,
'pos_dist_thresh': 17,
'pre_resize': [],
'project_path': '/g/gross/Dorian2/Calcium-D.Battivelli-2024-11-12',
'regularize': False,
'rotation': 25,
'rotratio': 0.4,
'save_iters': 50000,
'scale_jitter_lo': 0.5,
'scale_jitter_up': 1.25,
'scoremap_dir': 'test',
'sharpen': False,
'sharpenratio': 0.3,
'shuffle': True,
'snapshot_prefix': '/g/gross/Dorian2/Calcium-D.Battivelli-2024-11-12/dlc-models/iteration-0/CalciumNov12-trainset95shuffle1/train/snapshot',
'stride': 8.0,
'weigh_negatives': False,
'weigh_only_present_joints': False,
'weigh_part_predictions': False,
'weight_decay': 0.0001}
2024-11-13 17:39:44.429770: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2024-11-13 17:39:44.658470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:c1:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.50GiB deviceMemoryBandwidth: 1.41TiB/s
2024-11-13 17:39:44.658499: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2024-11-13 17:39:44.691604: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2024-11-13 17:39:44.691649: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2024-11-13 17:39:44.709463: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2024-11-13 17:39:44.715759: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2024-11-13 17:39:44.720959: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2024-11-13 17:39:44.731939: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2024-11-13 17:39:44.733196: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2024-11-13 17:39:44.737584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2024-11-13 17:39:44.738419: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-13 17:39:44.743744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:c1:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.50GiB deviceMemoryBandwidth: 1.41TiB/s
2024-11-13 17:39:44.747657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2024-11-13 17:39:44.749012: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2024-11-13 17:39:46.663742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-11-13 17:39:46.663785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2024-11-13 17:39:46.663792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2024-11-13 17:39:46.669253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38335 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0)
2024-11-13 17:39:46.970603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:c1:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.50GiB deviceMemoryBandwidth: 1.41TiB/s
2024-11-13 17:39:46.973938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2024-11-13 17:39:46.973961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-11-13 17:39:46.973966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2024-11-13 17:39:46.973970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2024-11-13 17:39:46.977293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38335 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0)
2024-11-13 17:40:20.267483: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000060000 Hz
Thank you for helping,
Best
Dorian
Beta Was this translation helpful? Give feedback.
All reactions