Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use AnalyzerEngine inline of a log stream #1515

Open
gideonred opened this issue Jan 22, 2025 · 1 comment
Open

How to use AnalyzerEngine inline of a log stream #1515

gideonred opened this issue Jan 22, 2025 · 1 comment

Comments

@gideonred
Copy link

Hi team,

We're looking to use AnalyzerEngine in our Django backend to redact sensitive logs.

Django would usually have you set up a middleware were we would invoke something like the AnalyzerEngine for every log line, however two things give me pause:

  • We'd need to download a language model such as en_core_web_lg ahead of time and package it with the app. However that will increase our Django package size by half a gigabyte.
  • When the model is being loaded in it would use up to about half a gigabyte of memory, which would be a significant increase in memory usage.

We also have a Vector pipeline where Vector can invoke a script per log line, however doing a from presidio_analyzer import AnalyzerEngine in a script seems to take a few seconds, which would be a non-starter if we have to do it per log line.

We're trying to not set up additional services (e.g. cluster of Presidio processes that can process the logs), as that adds additional maintenance overhead.

Is there a way forward? Is there something we're missing?

@SharonHart
Copy link
Contributor

SharonHart commented Jan 23, 2025

Hi @gideonred!
I would suggest experimenting with the Presidio docker containers and have the Django middleware call Presidio in REST API.
That would decouple the redaction and language model use from the Django application and won't affect its size. I do not know what is the infrastructure that you're running on, but the containerized versions are easy to run in docker or docker-compose thus simplifying maintenance overhead.

We've also seen and have some samples of presidio integrating with a more robust higher level monitoring solution, like ELK stack (logstash plugin calling presidio), or similar solutions that collect logs further down the application flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants