You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're looking to use AnalyzerEngine in our Django backend to redact sensitive logs.
Django would usually have you set up a middleware were we would invoke something like the AnalyzerEngine for every log line, however two things give me pause:
We'd need to download a language model such as en_core_web_lg ahead of time and package it with the app. However that will increase our Django package size by half a gigabyte.
When the model is being loaded in it would use up to about half a gigabyte of memory, which would be a significant increase in memory usage.
We also have a Vector pipeline where Vector can invoke a script per log line, however doing a from presidio_analyzer import AnalyzerEngine in a script seems to take a few seconds, which would be a non-starter if we have to do it per log line.
We're trying to not set up additional services (e.g. cluster of Presidio processes that can process the logs), as that adds additional maintenance overhead.
Is there a way forward? Is there something we're missing?
The text was updated successfully, but these errors were encountered:
Hi @gideonred!
I would suggest experimenting with the Presidio docker containers and have the Django middleware call Presidio in REST API.
That would decouple the redaction and language model use from the Django application and won't affect its size. I do not know what is the infrastructure that you're running on, but the containerized versions are easy to run in docker or docker-compose thus simplifying maintenance overhead.
We've also seen and have some samples of presidio integrating with a more robust higher level monitoring solution, like ELK stack (logstash plugin calling presidio), or similar solutions that collect logs further down the application flow.
Hi team,
We're looking to use AnalyzerEngine in our Django backend to redact sensitive logs.
Django would usually have you set up a middleware were we would invoke something like the AnalyzerEngine for every log line, however two things give me pause:
en_core_web_lg
ahead of time and package it with the app. However that will increase our Django package size by half a gigabyte.We also have a Vector pipeline where Vector can invoke a script per log line, however doing a
from presidio_analyzer import AnalyzerEngine
in a script seems to take a few seconds, which would be a non-starter if we have to do it per log line.We're trying to not set up additional services (e.g. cluster of Presidio processes that can process the logs), as that adds additional maintenance overhead.
Is there a way forward? Is there something we're missing?
The text was updated successfully, but these errors were encountered: