Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize feats extraction with opensmile #181

Merged
merged 8 commits into from
Nov 15, 2024
Merged

Parallelize feats extraction with opensmile #181

merged 8 commits into from
Nov 15, 2024

Conversation

fabiocat93
Copy link
Collaborator

@fabiocat93 fabiocat93 commented Nov 13, 2024

This PR introduces parallelization for feature extraction processes using opensmile and praat_parselmouth to improve performance on large datasets. Key changes include:

  • Registering a custom serializer for opensmile.Smile objects
  • Implementing a Pydra workflow for opensmile audio processing
  • Improving the Pydra workflow for parselmouth audio processing (maybe by making parselmouth.Sound objects pickable)

@fabiocat93 fabiocat93 self-assigned this Nov 13, 2024
@fabiocat93 fabiocat93 changed the base branch from main to fix/b2aiprep November 13, 2024 16:50
@fabiocat93 fabiocat93 marked this pull request as draft November 13, 2024 16:50
@fabiocat93
Copy link
Collaborator Author

This PR introduces parallelization for feature extraction processes using opensmile and praat_parselmouth to improve performance on large datasets. Key changes include:

  • Registering a custom serializer for opensmile.Smile objects
  • Implementing a Pydra workflow for opensmile audio processing

I have addressed your comments here and parallelized feature extraction with opensmile. I followed @wilke0818's suggestion to create a custom serializer for the opensmile.Smile object. I remembered I tried some time ago with no success, but this time I had more time to study opensmile's documentation and made it work. The issue was that opensmile.Smile includes a reference to the process and the serializer doesn't like that. By removing that reference, everything seems to work fine.

  • Improving the Pydra workflow for parselmouth audio processing (maybe by making parselmouth.Sound objects pickable)

@satra I followed your suggestion here to use cloudpickle to make parselmouth.Sound pickable, but unfortunately didn't work out (it still says TypeError: cannot pickle 'parselmouth.Sound' object). As an experiment, I created a wrapper to parselmouth.Sound (see here) but I honestly don't like this solution because

  • it doesn't produce any speedup compared to the original solution here
  • it doesn't really make the code cleaner or more maintainable than it was.

In case you want to try any alternative solutions, or have ideas, please let me know

@fabiocat93 fabiocat93 requested review from wilke0818 and satra and removed request for wilke0818 November 13, 2024 23:02
@fabiocat93 fabiocat93 marked this pull request as ready for review November 13, 2024 23:03
@satra
Copy link
Collaborator

satra commented Nov 13, 2024

thanks @fabiocat93 for these enhancements and attempts. i think the parselmouth one is good enough for now, no need to try to make it more pickleable.

efficient parallelization is going to be a combined function of dataset diversity (number of samples x duration of sample), the types of features we will be extracting, the resources (the hardware, job scheduler, etc.,.) needed.

with the b2ai dataset i ran into many of these considerations (without even considering gpu options). so let's merge something like this in, and when we do the code review let's consider possible options for efficiency. also let's get feedback as people use this.

@fabiocat93 fabiocat93 changed the title [WIP] Parallelize feats extraction with opensmile and praat_parselmouth [WIP] Parallelize feats extraction with opensmile Nov 14, 2024
@fabiocat93 fabiocat93 changed the title [WIP] Parallelize feats extraction with opensmile Parallelize feats extraction with opensmile Nov 14, 2024
@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 89.58333% with 5 lines in your changes missing coverage. Please review.

Project coverage is 63.98%. Comparing base (113721a) to head (4593c61).
Report is 37 commits behind head on main.

Files with missing lines Patch % Lines
src/senselab/__init__.py 60.00% 2 Missing ⚠️
...rc/senselab/audio/tasks/features_extraction/api.py 0.00% 1 Missing ⚠️
...selab/audio/tasks/features_extraction/opensmile.py 95.23% 1 Missing ⚠️
...health_measurements/extract_health_measurements.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #181      +/-   ##
==========================================
+ Coverage   60.24%   63.98%   +3.74%     
==========================================
  Files         113      116       +3     
  Lines        4017     4101      +84     
==========================================
+ Hits         2420     2624     +204     
+ Misses       1597     1477     -120     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fabiocat93 fabiocat93 changed the base branch from fix/b2aiprep to main November 14, 2024 19:00
@satra
Copy link
Collaborator

satra commented Nov 14, 2024

could you perhaps merge the other PR that i had (without a release) and then release it with this?

@satra
Copy link
Collaborator

satra commented Nov 15, 2024

@fabiocat93 - upgrade to latest pydra release to try. and do post what the issues are with cf.

@satra
Copy link
Collaborator

satra commented Nov 15, 2024

defaulting to serial makes sense

@fabiocat93
Copy link
Collaborator Author

fabiocat93 commented Nov 15, 2024

@fabiocat93 - upgrade to latest pydra release to try.

Done.

and do post what the issues are with cf.

While testing pydra with plugin="cf" and passing some torch.tensor objects as parameters to tasks, I encountered an issue where the workflow would hang forever. After troubleshooting with @wilke0818, we identified a workaround that (at least temporarily) resolves the problem:

from multiprocessing import set_start_method
set_start_method("spawn", force=True)

@satra
Copy link
Collaborator

satra commented Nov 15, 2024

yes, i should have told you that (that's what i debugged over the weekend on linux). on macos spawn is default on linux it's fork and spawn will become default across systems from 3.14 onwards

@satra
Copy link
Collaborator

satra commented Nov 15, 2024

see here:

https://github.com/sensein/b2aiprep/blob/1cc589789d54595ac4b767a7f0bfb9654268c8b0/src/b2aiprep/prepare/prepare.py#L252

btw, there were some weird notions of that it would not work if placed it in cli.py under if __name__ == '__main__'

@fabiocat93
Copy link
Collaborator Author

do you think we can merge now? @satra

src/senselab/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@satra satra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just updated the multiprocessing bit with a try except

feel free to merge/release after tests pass.

@fabiocat93 fabiocat93 merged commit f460e32 into main Nov 15, 2024
9 checks passed
@satra
Copy link
Collaborator

satra commented Nov 15, 2024

thank you @fabiocat93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants