Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prepare_data method to generate and save protocol data on disk #1500

Closed

Conversation

clement-pages
Copy link
Collaborator

@clement-pages clement-pages commented Oct 13, 2023

The goal of this pull request is to reduce the time needed to setup a task in pyannote by implementing prepare_data method. prepare_data generates the data needed by the task and save it on disk for future uses, for example by the setup method. The objective is to avoid systematically recreating data on each process at the beginning of a training.

This PR also proposes to reorganize the segmentation tasks hierarchy. Now, the SegmentationTask (previously SegmentationTaskMixin) inherits the Task class. All the segmentation tasks inherit only from the SegmentationTask (instead of Task and SegmentationTaskMixin)

The goal of this method is to generate the data needed by the task
and save it on disk for future uses, for example by the `setup` method.
The objective is to avoid systematically recreating data on each process
at the beginning of a training
Comment on lines 81 to 94
data_dict = pickle.load(data_file)
self.metadata = data_dict["metadata"]
self.audios = data_dict["audios"]
self.audio_infos= data_dict["audio_infos"]
self.audio_encodings = data_dict["audio_encodings"]
self.annotated_duration = data_dict["annotated_duration"]
self.annotated_regions = data_dict["annotated_regions"]
self.annotated_classes = data_dict["annotated_classes"]
self.annotations = data_dict["annotations"]
self.metadata_unique_values = data_dict["metadata_unique_values"]
if isinstance(self.protocol, SegmentationProtocol):
self.classes = data_dict["classes"]
if self.has_validation:
self.validation_chunks = data_dict["validation_chunks"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this can probably be replaced by:

for key, value in data_dict.items():
    setattr(self, key) = value

hbredin and others added 16 commits October 26, 2023 13:38
Now all the segmentations tasks in `pyannote` inherit the `SegmentationTask`
(previously `SegmentationTaskMixin`), which inherits the `Task` class. This
commit also adds a `prepared_data` attribute  to the `Task` class. That
attribute is a dict which contains all the prepared data by the `prepare_data`
method.
One for the test of the `MultiLabelSegmentation` task, and
the other for the test of the `SupervisedRepresentationLearningWithArcFace`
task.
This eliminates the need to reload pickle data in setup when in the main process
This issue occured when a list of classes was  specified during `MultiLabelSegmentation`
instanciation.
* use npz archive instead pickle to save task data

* improve code readability

* improve(task): update numpy array dtypes

In order to use types whose size better machtes the contents of the arrays

* remove `end` entry from `annotated_regions` numpy array

This entry was redundant with the start and duration entries,
since `end` = `start` + `duration`.

* fix: allow data preparation to be finished when task has no validation

* improve: clear data lists after assignation to `self.prepared_data`

This is to avoid data redundancy in the `prepare_data` method

---------

Co-authored-by: clement-pages <clement.pages@irit.fr>
@clement-pages clement-pages deleted the feat/data_preparation branch November 27, 2023 14:25
@hbredin
Copy link
Member

hbredin commented Nov 28, 2023

Looks like you closed this PR?

@clement-pages
Copy link
Collaborator Author

This is not me... I think the PR was closed because I have renamed the branch feat/data_preparation. The branch still exists. I will reopen it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants