DLC issue to train the model from merged mp4 videos #2073

DorianBattivelli · 2024-12-06T11:14:18Z

DorianBattivelli
Dec 6, 2024

Hello,

I'm failing to train a one-animal model on DLC. My videos are a merged outcome from 2 mp4 videos, that I save in mp4 directly on bonsai.
Regarding the installation:

DLC is running from a container installed on a linux machine
Bonsai is running on Windows machine.

I used successfully this pipeline for 2 years until some months ago. Now that I'm resuming experiments, I face this out-of-nowhere issue.
My DLC pipeline still works well since I can train a model from non-merged (raw) videos, or from old videos that were merged with Bonsai around 1 year ago.
So my guess is the problem comes from the merging process; somehow the mp4 outcome is causing issue at the training stage. Did something change in the Bonsai code / node or else that could explain this issue?

Here the Bonsai pipeline I use to merge/save in mp4:

And here my DLC job log that get stuck when starting the training:

INFO: Environment variable SINGULARITY_BINDPATH is set, but APPTAINER_BINDPATH is preferred
INFO: Using cached SIF image
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
ERROR: ld.so: object 'libdlfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libvglfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libdlfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object 'libvglfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
2024-11-13 17:39:38.678415: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Config:
{'all_joints': [[0], [1], [2], [3], [4], [5], [6], [7]],
'all_joints_names': ['Ear_left',
'Ear_right',
'Nose',
'Center',
'Lateral_left',
'Lateral_right',
'Tail_base',
'Tail_end'],
'alpha_r': 0.02,
'apply_prob': 0.5,
'batch_size': 1,
'clahe': True,
'claheratio': 0.1,
'crop_pad': 0,
'crop_sampling': 'hybrid',
'crop_size': [400, 400],
'cropratio': 0.4,
'dataset': 'training-datasets/iteration-0/UnaugmentedDataSet_CalciumNov12/Calcium_D.Battivelli95shuffle1.mat',
'dataset_type': 'default',
'decay_steps': 30000,
'deterministic': False,
'display_iters': 1000,
'edge': False,
'emboss': {'alpha': [0.0, 1.0], 'embossratio': 0.1, 'strength': [0.5, 1.5]},
'fg_fraction': 0.25,
'global_scale': 0.8,
'histeq': True,
'histeqratio': 0.1,
'init_weights': '/usr/local/lib/python3.8/dist-packages/deeplabcut/pose_estimation_tensorflow/models/pretrained/resnet_v1_50.ckpt',
'intermediate_supervision': False,
'intermediate_supervision_layer': 12,
'location_refinement': True,
'locref_huber_loss': True,
'locref_loss_weight': 0.05,
'locref_stdev': 7.2801,
'log_dir': 'log',
'lr_init': 0.0005,
'max_input_size': 1500,
'max_shift': 0.4,
'mean_pixel': [123.68, 116.779, 103.939],
'metadataset': 'training-datasets/iteration-0/UnaugmentedDataSet_CalciumNov12/Documentation_data-Calcium_95shuffle1.pickle',
'min_input_size': 64,
'mirror': False,
'multi_stage': False,
'multi_step': [[0.005, 10000],
[0.02, 430000],
[0.002, 730000],
[0.001, 1030000]],
'net_type': 'resnet_50',
'num_joints': 8,
'optimizer': 'sgd',
'pairwise_huber_loss': False,
'pairwise_predict': False,
'partaffinityfield_predict': False,
'pos_dist_thresh': 17,
'pre_resize': [],
'project_path': '/g/gross/Dorian2/Calcium-D.Battivelli-2024-11-12',
'regularize': False,
'rotation': 25,
'rotratio': 0.4,
'save_iters': 50000,
'scale_jitter_lo': 0.5,
'scale_jitter_up': 1.25,
'scoremap_dir': 'test',
'sharpen': False,
'sharpenratio': 0.3,
'shuffle': True,
'snapshot_prefix': '/g/gross/Dorian2/Calcium-D.Battivelli-2024-11-12/dlc-models/iteration-0/CalciumNov12-trainset95shuffle1/train/snapshot',
'stride': 8.0,
'weigh_negatives': False,
'weigh_only_present_joints': False,
'weigh_part_predictions': False,
'weight_decay': 0.0001}
2024-11-13 17:39:44.429770: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2024-11-13 17:39:44.658470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:c1:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.50GiB deviceMemoryBandwidth: 1.41TiB/s
2024-11-13 17:39:44.658499: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2024-11-13 17:39:44.691604: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2024-11-13 17:39:44.691649: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2024-11-13 17:39:44.709463: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2024-11-13 17:39:44.715759: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2024-11-13 17:39:44.720959: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2024-11-13 17:39:44.731939: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2024-11-13 17:39:44.733196: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2024-11-13 17:39:44.737584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2024-11-13 17:39:44.738419: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-13 17:39:44.743744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:c1:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.50GiB deviceMemoryBandwidth: 1.41TiB/s
2024-11-13 17:39:44.747657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2024-11-13 17:39:44.749012: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2024-11-13 17:39:46.663742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-11-13 17:39:46.663785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2024-11-13 17:39:46.663792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2024-11-13 17:39:46.669253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38335 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0)
2024-11-13 17:39:46.970603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:c1:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.50GiB deviceMemoryBandwidth: 1.41TiB/s
2024-11-13 17:39:46.973938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2024-11-13 17:39:46.973961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-11-13 17:39:46.973966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2024-11-13 17:39:46.973970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2024-11-13 17:39:46.977293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38335 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0)
2024-11-13 17:40:20.267483: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000060000 Hz

Thank you for helping,
Best
Dorian

bruno-f-cruz · 2024-12-06T12:04:52Z

bruno-f-cruz
Dec 6, 2024
Collaborator

Hi,
Please include the workflow as asked in #864.
Additionally, from your screenshot, it looks like bonsai is really not managing much of the pipeline at all. There are a few points where a change could have happened:

Did your ffmpeg distribution change? was it upgraded? Remember you are calling ffmpeg as an external process, Bonsai does not manage this dependency.
Did you update your dlc package?
Did you add/remove any codecs present in the computer that the DLC package could have been using?

I would probably suggest you try to send your video to the DLC team and check if they can give you a bit more information why the training is failing.
Thanks

1 reply

DorianBattivelli Dec 6, 2024
Author

Thanks for the support.

No I did not change any codec nor updated ffmpeg. I'm using exactly the same pipeline that always worked in past (including DLC version).
The weirdest point this that the merged video is correctly handled for the frame extraction and labelling part. Only for training the file looks problematic.

Although Bonsai indeed manages quiet a few part of the pipeline, the fact that raw / not merged videos are correctly handled by DLC for training led me to the conclusion that the problem appears during the merging process.

I'm already in touch with DLC but until J'bow the problem is still unsolved. Yet I like your suggestion to send them a video. I'll try this.

Best

DorianBattivelli · 2024-12-09T12:44:37Z

DorianBattivelli
Dec 9, 2024
Author

I found that reconverting my video in mp4 with another codec solved the issue:

Does this outcome make sense to you?

0 replies

DorianBattivelli · 2024-12-12T11:36:07Z

DorianBattivelli
Dec 12, 2024
Author

I actually found that the problem is not the format but the dimension of the image : 2020.2052
My previous videos on which I successfully run training were 2020.2034.
If I convert the new video so that they are 1920.1080, it works.

Is that possible that training fails on too big format?
I tried to reduce the batchsize to 1 but it does solve anything.
Thank you

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bonsai

DLC issue to train the model from merged mp4 videos #2073

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Bonsai

DLC issue to train the model from merged mp4 videos #2073

DorianBattivelli Dec 6, 2024

Replies: 3 comments · 1 reply

bruno-f-cruz Dec 6, 2024 Collaborator

DorianBattivelli Dec 6, 2024 Author

DorianBattivelli Dec 9, 2024 Author

DorianBattivelli Dec 12, 2024 Author

DorianBattivelli
Dec 6, 2024

Replies: 3 comments 1 reply

bruno-f-cruz
Dec 6, 2024
Collaborator

DorianBattivelli Dec 6, 2024
Author

DorianBattivelli
Dec 9, 2024
Author

DorianBattivelli
Dec 12, 2024
Author