Parallelize feats extraction with opensmile #181

fabiocat93 · 2024-11-13T16:48:46Z

This PR introduces parallelization for feature extraction processes using opensmile ~~and praat_parselmouth~~ to improve performance on large datasets. Key changes include:

Registering a custom serializer for opensmile.Smile objects
Implementing a Pydra workflow for opensmile audio processing
~~Improving the Pydra workflow for parselmouth audio processing (maybe by making parselmouth.Sound objects pickable)~~

fabiocat93 · 2024-11-13T23:02:41Z

This PR introduces parallelization for feature extraction processes using opensmile and praat_parselmouth to improve performance on large datasets. Key changes include:

Registering a custom serializer for opensmile.Smile objects

Implementing a Pydra workflow for opensmile audio processing

I have addressed your comments here and parallelized feature extraction with opensmile. I followed @wilke0818's suggestion to create a custom serializer for the opensmile.Smile object. I remembered I tried some time ago with no success, but this time I had more time to study opensmile's documentation and made it work. The issue was that opensmile.Smile includes a reference to the process and the serializer doesn't like that. By removing that reference, everything seems to work fine.

Improving the Pydra workflow for parselmouth audio processing (maybe by making parselmouth.Sound objects pickable)

@satra I followed your suggestion here to use cloudpickle to make parselmouth.Sound pickable, but unfortunately didn't work out (it still says TypeError: cannot pickle 'parselmouth.Sound' object). As an experiment, I created a wrapper to parselmouth.Sound (see here) but I honestly don't like this solution because

it doesn't produce any speedup compared to the original solution here
it doesn't really make the code cleaner or more maintainable than it was.

In case you want to try any alternative solutions, or have ideas, please let me know

satra · 2024-11-13T23:12:14Z

thanks @fabiocat93 for these enhancements and attempts. i think the parselmouth one is good enough for now, no need to try to make it more pickleable.

efficient parallelization is going to be a combined function of dataset diversity (number of samples x duration of sample), the types of features we will be extracting, the resources (the hardware, job scheduler, etc.,.) needed.

with the b2ai dataset i ran into many of these considerations (without even considering gpu options). so let's merge something like this in, and when we do the code review let's consider possible options for efficiency. also let's get feedback as people use this.

codecov-commenter · 2024-11-14T18:52:56Z

Codecov Report

Attention: Patch coverage is 89.58333% with 5 lines in your changes missing coverage. Please review.

Project coverage is 63.98%. Comparing base (113721a) to head (4593c61).
Report is 37 commits behind head on main.

Files with missing lines	Patch %	Lines
src/senselab/__init__.py	60.00%	2 Missing ⚠️
...rc/senselab/audio/tasks/features_extraction/api.py	0.00%	1 Missing ⚠️
...selab/audio/tasks/features_extraction/opensmile.py	95.23%	1 Missing ⚠️
...health_measurements/extract_health_measurements.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #181      +/-   ##
==========================================
+ Coverage   60.24%   63.98%   +3.74%     
==========================================
  Files         113      116       +3     
  Lines        4017     4101      +84     
==========================================
+ Hits         2420     2624     +204     
+ Misses       1597     1477     -120

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

satra · 2024-11-14T20:11:38Z

could you perhaps merge the other PR that i had (without a release) and then release it with this?

satra · 2024-11-15T19:43:43Z

@fabiocat93 - upgrade to latest pydra release to try. and do post what the issues are with cf.

satra · 2024-11-15T19:44:22Z

defaulting to serial makes sense

…rsion

fabiocat93 · 2024-11-15T21:31:47Z

@fabiocat93 - upgrade to latest pydra release to try.

Done.

and do post what the issues are with cf.

While testing pydra with plugin="cf" and passing some torch.tensor objects as parameters to tasks, I encountered an issue where the workflow would hang forever. After troubleshooting with @wilke0818, we identified a workaround that (at least temporarily) resolves the problem:

from multiprocessing import set_start_method
set_start_method("spawn", force=True)

satra · 2024-11-15T21:36:36Z

yes, i should have told you that (that's what i debugged over the weekend on linux). on macos spawn is default on linux it's fork and spawn will become default across systems from 3.14 onwards

satra · 2024-11-15T21:38:04Z

see here:

https://github.com/sensein/b2aiprep/blob/1cc589789d54595ac4b767a7f0bfb9654268c8b0/src/b2aiprep/prepare/prepare.py#L252

btw, there were some weird notions of that it would not work if placed it in cli.py under if __name__ == '__main__'

fabiocat93 · 2024-11-15T21:43:45Z

do you think we can merge now? @satra

src/senselab/__init__.py

satra

just updated the multiprocessing bit with a try except

feel free to merge/release after tests pass.

satra · 2024-11-15T22:09:17Z

thank you @fabiocat93

parallelizing opensmile feats extraction with pydra #2

1d69f72

fabiocat93 self-assigned this Nov 13, 2024

fabiocat93 changed the base branch from main to fix/b2aiprep November 13, 2024 16:50

fabiocat93 marked this pull request as draft November 13, 2024 16:50

wrapping parselmouth.Sound to make it pickable [experiments]

a5d7735

fabiocat93 requested review from wilke0818 and satra and removed request for wilke0818 November 13, 2024 23:02

fabiocat93 marked this pull request as ready for review November 13, 2024 23:03

fabiocat93 changed the title ~~[WIP] Parallelize feats extraction with opensmile and praat_parselmouth~~ [WIP] Parallelize feats extraction with opensmile Nov 14, 2024

fabiocat93 changed the title ~~[WIP] Parallelize feats extraction with opensmile~~ Parallelize feats extraction with opensmile Nov 14, 2024

fabiocat93 added release minor Minor release to-test to-test-gpu labels Nov 14, 2024

fabiocat93 changed the base branch from fix/b2aiprep to main November 14, 2024 19:00

fabiocat93 and others added 3 commits November 15, 2024 14:59

Merge branch 'main' into serialize

d6b3da2

removing print statement in test

592b609

default to plugin=serial for opensmile (i noticed some issues with cf)

52e0da1

fabiocat93 added 2 commits November 15, 2024 16:00

enabling using cf with torch.tensor (good when using pydra)

9d457c0

defaulting to serial, testing with cf, upgrading pydra to the last ve…

3e6b95b

…rsion

satra reviewed Nov 15, 2024

View reviewed changes

src/senselab/__init__.py Outdated Show resolved Hide resolved

Update src/senselab/__init__.py

4593c61

satra approved these changes Nov 15, 2024

View reviewed changes

fabiocat93 merged commit f460e32 into main Nov 15, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize feats extraction with opensmile #181

Parallelize feats extraction with opensmile #181

fabiocat93 commented Nov 13, 2024 •

edited

Loading

fabiocat93 commented Nov 13, 2024

satra commented Nov 13, 2024

codecov-commenter commented Nov 14, 2024 •

edited

Loading

satra commented Nov 14, 2024

satra commented Nov 15, 2024

satra commented Nov 15, 2024

fabiocat93 commented Nov 15, 2024 •

edited

Loading

satra commented Nov 15, 2024

satra commented Nov 15, 2024

fabiocat93 commented Nov 15, 2024

satra left a comment

satra commented Nov 15, 2024

Parallelize feats extraction with opensmile #181

Parallelize feats extraction with opensmile #181

Conversation

fabiocat93 commented Nov 13, 2024 • edited Loading

fabiocat93 commented Nov 13, 2024

satra commented Nov 13, 2024

codecov-commenter commented Nov 14, 2024 • edited Loading

Codecov Report

satra commented Nov 14, 2024

satra commented Nov 15, 2024

satra commented Nov 15, 2024

fabiocat93 commented Nov 15, 2024 • edited Loading

satra commented Nov 15, 2024

satra commented Nov 15, 2024

fabiocat93 commented Nov 15, 2024

satra left a comment

Choose a reason for hiding this comment

satra commented Nov 15, 2024

fabiocat93 commented Nov 13, 2024 •

edited

Loading

codecov-commenter commented Nov 14, 2024 •

edited

Loading

fabiocat93 commented Nov 15, 2024 •

edited

Loading