Add batch analyze API support for recognizer #1506

jimmyxie-figma · 2025-01-06T19:21:50Z

Is your feature request related to a problem? Please describe.

Currently, the BatchAnalyzerEngine works by iterating through either a list or dictionary, analyzing and anonymizing the values one by one. while this is not an issue for the predefined recognizers, and there are improvements in the built-in NLP engine to support batch inference, it does pose an efficiency problem for the transformer recognizers, causing idle resources and low inference throughput.

Describe the solution you'd like

We want to build batch inference API support for recognizers. Early testing shows that even with a small batch size of 4, a BERT-like transformer speeds up inferences by 3x without any additional resource or memory usage.

The exact implementation is still up for discussion and one potential solution would be adding a batch recognizer mix-in, where we batch analyze the the batch recognizer first, and pass the results to the regular analyze for extension. similar to the nlp_engine.process_batch

Describe alternatives you've considered
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

omri374 · 2025-01-07T07:22:32Z

Thanks for the suggestion! One of the reasons we chose to use spacy-huggingface-pipelines is its batch support for transformer based models. Have you experimented with this option? In general, I agree that exposing an analyze_batch option for recognizers is a good idea, and having the BatchAnalyzerEngine call analyze_batch instead of analyze for each recognizer.

jimmyxie-figma · 2025-01-07T16:42:16Z

@omri374 thanks for the quick reply. ~~I would imagine the transformers_nlp_engine does the same thing as well.~~ nvm, I see the comments, that's just a wrapper around the package

The problem is that we are currently using two transformer models (and not for the purpose of multi-language support), so that approach wouldn’t work. Another way forward for us would be to fine-tune our own models and consolidate to one.

jimmyxie-figma · 2025-01-07T17:39:15Z

Alternatively, we could support multiple NLP engines and extend the NLP artifacts. If we decide to provide better support for batching, my team is happy to help with the PR.

omri374 · 2025-01-08T13:09:32Z

Will the option of having a batch_analyze for recognizers be useful in your case? If yes, a PR would be great. Essentially, we could add a batch_analyze method to the EntityRecognizer base class, and have it iterate through texts and pass them to the analyze method. In specific cases, like the transformers recognizer, we could override this method with a different implementation for batch mode. WDYT?

jimmyxie-figma · 2025-01-14T16:19:34Z

@omri374 The issue is that BatchAnalyzer is just a thin wrapper around the regular Analyzer. The magic happens in the Analyzer.analyze function. Adding batch_analyze to the EntityRecognizer would be a good starting point, but I feel like there is a deeper refactor might be needed to Between the two analyzers.

I’ll try cut a PR to add the batch analyzer API to the EntityRecognizers. and In our codebase, we’ll create:

A CustomAnalyzer inheriting from the regular Analyzer adding a batch API
A CustomBatchAnalyzer with the batch iteration function overridden to accept the previous batch API

omri374 · 2025-01-15T09:34:53Z

In the BatchAnalyzerEngine, there's a separation between the NLP engine phase, and the recognizers phase. I wonder if we can call the batch_process in the NLP engine, and then do a batch run through recognizers, with similar logic/configuration to the AnalyzerEngine.analyze. This way we wouldn't need the CustomAnalyzer, and we'd just update the code in BatchAnalyzerEngine.
So this line:

presidio/presidio-analyzer/presidio_analyzer/batch_analyzer_engine.py

Line 55 in a78e659

results = self.analyzer_engine.analyze(

Would be replaced with a call to self.run_recognizers_in_batch or something like that, with all the configuration coming from self.analyzer_engine.

I must say I haven't thought about this deeply enough to know if it's viable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch analyze API support for recognizer #1506

Add batch analyze API support for recognizer #1506

jimmyxie-figma commented Jan 6, 2025 •

edited

Loading

omri374 commented Jan 7, 2025

jimmyxie-figma commented Jan 7, 2025 •

edited

Loading

jimmyxie-figma commented Jan 7, 2025

omri374 commented Jan 8, 2025

jimmyxie-figma commented Jan 14, 2025

omri374 commented Jan 15, 2025 •

edited

Loading

Add batch analyze API support for recognizer #1506

Add batch analyze API support for recognizer #1506

Comments

jimmyxie-figma commented Jan 6, 2025 • edited Loading

omri374 commented Jan 7, 2025

jimmyxie-figma commented Jan 7, 2025 • edited Loading

jimmyxie-figma commented Jan 7, 2025

omri374 commented Jan 8, 2025

jimmyxie-figma commented Jan 14, 2025

omri374 commented Jan 15, 2025 • edited Loading

jimmyxie-figma commented Jan 6, 2025 •

edited

Loading

jimmyxie-figma commented Jan 7, 2025 •

edited

Loading

omri374 commented Jan 15, 2025 •

edited

Loading