Add `prepare_data` method to generate and save protocol data on disk #1500

clement-pages · 2023-10-13T14:11:20Z

The goal of this pull request is to reduce the time needed to setup a task in pyannote by implementing prepare_data method. prepare_data generates the data needed by the task and save it on disk for future uses, for example by the setup method. The objective is to avoid systematically recreating data on each process at the beginning of a training.

This PR also proposes to reorganize the segmentation tasks hierarchy. Now, the SegmentationTask (previously SegmentationTaskMixin) inherits the Task class. All the segmentation tasks inherit only from the SegmentationTask (instead of Task and SegmentationTaskMixin)

The goal of this method is to generate the data needed by the task and save it on disk for future uses, for example by the `setup` method. The objective is to avoid systematically recreating data on each process at the beginning of a training

hbredin · 2023-10-13T16:59:26Z

pyannote/audio/tasks/segmentation/mixins.py

+                    data_dict = pickle.load(data_file)
+                    self.metadata = data_dict["metadata"]
+                    self.audios = data_dict["audios"]
+                    self.audio_infos= data_dict["audio_infos"]
+                    self.audio_encodings = data_dict["audio_encodings"]
+                    self.annotated_duration = data_dict["annotated_duration"]
+                    self.annotated_regions = data_dict["annotated_regions"]
+                    self.annotated_classes = data_dict["annotated_classes"]
+                    self.annotations = data_dict["annotations"]
+                    self.metadata_unique_values = data_dict["metadata_unique_values"]
+                    if isinstance(self.protocol, SegmentationProtocol):
+                        self.classes = data_dict["classes"]
+                    if self.has_validation:
+                        self.validation_chunks = data_dict["validation_chunks"]


All of this can probably be replaced by:

for key, value in data_dict.items(): setattr(self, key) = value

Now all the segmentations tasks in `pyannote` inherit the `SegmentationTask` (previously `SegmentationTaskMixin`), which inherits the `Task` class. This commit also adds a `prepared_data` attribute to the `Task` class. That attribute is a dict which contains all the prepared data by the `prepare_data` method.

…note-audio into feat/data_preparation

One for the test of the `MultiLabelSegmentation` task, and the other for the test of the `SupervisedRepresentationLearningWithArcFace` task.

This eliminates the need to reload pickle data in setup when in the main process

…cenarios

…note-audio into feat/data_preparation

This issue occured when a list of classes was specified during `MultiLabelSegmentation` instanciation.

* use npz archive instead pickle to save task data * improve code readability * improve(task): update numpy array dtypes In order to use types whose size better machtes the contents of the arrays * remove `end` entry from `annotated_regions` numpy array This entry was redundant with the start and duration entries, since `end` = `start` + `duration`. * fix: allow data preparation to be finished when task has no validation * improve: clear data lists after assignation to `self.prepared_data` This is to avoid data redundancy in the `prepare_data` method --------- Co-authored-by: clement-pages <clement.pages@irit.fr>

hbredin · 2023-11-28T07:31:10Z

Looks like you closed this PR?

clement-pages · 2023-11-28T07:50:12Z

This is not me... I think the PR was closed because I have renamed the branch feat/data_preparation. The branch still exists. I will reopen it

add prepare_data method in Task class

933a660

The goal of this method is to generate the data needed by the task and save it on disk for future uses, for example by the `setup` method. The objective is to avoid systematically recreating data on each process at the beginning of a training

hbredin reviewed Oct 13, 2023

View reviewed changes

hbredin and others added 16 commits October 26, 2023 13:38

Merge branch 'develop' into feat/data_preparation

5257145

Merge branch 'feat/data_preparation' of github.com:clement-pages/pyan…

fa63c8a

…note-audio into feat/data_preparation

add two training tests

be6f7ec

One for the test of the `MultiLabelSegmentation` task, and the other for the test of the `SupervisedRepresentationLearningWithArcFace` task.

assign data directly to task in main process, in prepare_data

f447bb6

This eliminates the need to reload pickle data in setup when in the main process

Merge branch 'develop' into feat/data_preparation

930deda

handle call to Task.prepare_data and Task.setup under different s…

05ccc30

…cenarios

Merge branch 'feat/data_preparation' of github.com:clement-pages/pyan…

44a01fe

…note-audio into feat/data_preparation

add training tests using task caches

4b8e8a2

update cache_path type and docstrings

45918bd

fix classes variable used before assigment

980414e

This issue occured when a list of classes was specified during `MultiLabelSegmentation` instanciation.

Merge branch 'develop' into feat/data_preparation

a9ea07f

Merge branch 'pyannote:develop' into feat/data_preparation

797a8a4

improve code readability

987e702

Merge branch 'pyannote:develop' into feat/data_preparation

5358986

clement-pages closed this Nov 27, 2023

clement-pages deleted the feat/data_preparation branch November 27, 2023 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `prepare_data` method to generate and save protocol data on disk #1500

Add `prepare_data` method to generate and save protocol data on disk #1500

clement-pages commented Oct 13, 2023 •

edited

Loading

hbredin Oct 13, 2023

hbredin commented Nov 28, 2023

clement-pages commented Nov 28, 2023

Add prepare_data method to generate and save protocol data on disk #1500

Add prepare_data method to generate and save protocol data on disk #1500

Conversation

clement-pages commented Oct 13, 2023 • edited Loading

hbredin Oct 13, 2023

Choose a reason for hiding this comment

hbredin commented Nov 28, 2023

clement-pages commented Nov 28, 2023

Add `prepare_data` method to generate and save protocol data on disk #1500

Add `prepare_data` method to generate and save protocol data on disk #1500

clement-pages commented Oct 13, 2023 •

edited

Loading