* benchmarks for each scanner separately (remove the common ones)

* use fasttext-langdetect instead of langdetect
protectai · Oct 31, 2023 · 59f812b · 59f812b
1 parent 6cf0385
commit 59f812b
Show file tree

Hide file tree

Showing 35 changed files with 596 additions and 171 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,9 +17,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Use single `PromptInjection` scanner with multiple models
 - Benchmarks are measured for each scanner individually
 - In the `Refutation` output scanner use the same model for the NLI as used in the `BanTopics`
+- Benchmarks for each individual scanner instead of one common
+- Use [fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect/) in the `Language` and `LanguageSame` scanners
 
 ### Removed
-- Remove `PromptInjectionV2` scanner to rely on the single one
+- Remove `PromptInjectionV2` scanner to rely on the single one with a choice
 - Langchain `LLMChain` example as this functionality is deprecated, use `LCEL` instead
 
 ## [0.3.0] - 2023-10-14

diff --git a/docs/benchmarks/input_scanners.md b/docs/benchmarks/input_scanners.md
diff --git a/docs/benchmarks/output_scanners.md b/docs/benchmarks/output_scanners.md
diff --git a/docs/best_practices.md b/docs/best_practices.md
@@ -2,7 +2,7 @@
 
 ## Performance Optimization
 
-1. **Benchmark Analysis**: Before choosing the scanners, it's crucial to understand their performance on different instances. Review the benchmarks for [input](./benchmarks/input_scanners.md) and [output](./benchmarks/output_scanners.md) scanners to make an informed decision based on your specific requirements.
+1. **Benchmark Analysis**: Before choosing the scanners, it's crucial to understand their performance on different instances. Review the benchmarks for each scanner to make an informed decision based on your specific requirements.
 
 2. **Model Size Trade-off**: Opting for smaller models will expedite processing, reducing latency. However, this comes at the cost of accuracy. We are actively working on providing compact versions with minimal accuracy trade-offs.
 

diff --git a/docs/input_scanners/anonymize.md b/docs/input_scanners/anonymize.md
@@ -57,15 +57,19 @@ before the model sees them.
 - **Enhanced Detection**: Beyond Presidio Analyzer's capabilities, the scanner recognizes specific patterns like Email,
   US SSN, UUID, and more.
 - **Entities Support**:
-  - Peek at our [default entities](https://github.com/laiyer-ai/llm-guard/blob/main/llm_guard/input_scanners/anonymize.py#L26-L40).
-  - View the [Presidio's supported entities](https://microsoft.github.io/presidio/supported_entities/#list-of-supported-entities).
-  - And, we've got [custom regex patterns](https://github.com/laiyer-ai/llm-guard/blob/main/llm_guard/resources/sensisitive_patterns.json) too!
+  - Peek at
+    our [default entities](https://github.com/laiyer-ai/llm-guard/blob/main/llm_guard/input_scanners/anonymize.py#L26-L40).
+  - View
+    the [Presidio's supported entities](https://microsoft.github.io/presidio/supported_entities/#list-of-supported-entities).
+  - And, we've
+    got [custom regex patterns](https://github.com/laiyer-ai/llm-guard/blob/main/llm_guard/resources/sensisitive_patterns.json)
+    too!
 - **Tailored Recognizers**:
   - Balance speed vs. accuracy with our recognizers. For an informed choice, dive into
-      the [benchmark comparisons](https://blog.px.dev/detect-pii/).
+    the [benchmark comparisons](https://blog.px.dev/detect-pii/).
   - **Top Pick: [beki/en_spacy_pii_distilbert](https://huggingface.co/beki/en_spacy_pii_distilbert)**
   - Alternatives: [beki/en_spacy_pii_fast](https://huggingface.co/beki/en_spacy_pii_fast)
-  and [en_core_trf](https://spacy.io/models/en#en_core_web_trf).
+    and [en_core_trf](https://spacy.io/models/en#en_core_web_trf).
 
 !!! info
 
@@ -100,7 +104,8 @@ Configure the `Anonymize` Scanner:
 from llm_guard.input_scanners import Anonymize
 from llm_guard.input_scanners.anonymize_helpers.analyzer import RECOGNIZER_SPACY_EN_PII_FAST
 
-scanner = Anonymize(vault, preamble="Insert before prompt", allowed_names=["John Doe"], hidden_names=["Test LLC"], recognizer=RECOGNIZER_SPACY_EN_PII_FAST)
+scanner = Anonymize(vault, preamble="Insert before prompt", allowed_names=["John Doe"], hidden_names=["Test LLC"],
+                    recognizer=RECOGNIZER_SPACY_EN_PII_FAST)
 sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)
 ```
 
@@ -114,3 +119,27 @@ sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)
 
 Retrieving Original Data: To revert to the initial data, utilize the [Deanonymize](../output_scanners/deanonymize.md)
 scanner.
+
+## Benchmarks
+
+Environment:
+
+- Platform: Amazon Linux 2
+- Python Version: 3.11.6
+
+Run the following script:
+
+```sh
+python benchmarks/run.py input Anonymize
+```
+
+Results:
+
+| Instance                | Setup                                           | Time taken, s | Characters per Second | Total Length Processed |
+|-------------------------|-------------------------------------------------|---------------|-----------------------|------------------------|
+| inf1.xlarge (AWS)       | `recognizer=RECOGNIZER_SPACY_EN_PII_FAST`       | 0.067         | 4719.12               | 317                    |
+| m5.large (AWS)          | `recognizer=RECOGNIZER_SPACY_EN_PII_FAST`       | 0.126         | 2522.17               | 317                    |
+| g5.xlarge (AWS) **GPU** | `recognizer=RECOGNIZER_SPACY_EN_PII_FAST`       | 0.065         | 4844.37               | 317                    |
+| inf1.xlarge (AWS)       | `recognizer=RECOGNIZER_SPACY_EN_PII_DISTILBERT` | 0.134         | 2373.23               | 317                    |
+| m5.large (AWS)          | `recognizer=RECOGNIZER_SPACY_EN_PII_DISTILBERT` | 0.187         | 1693.19               | 317                    |
+| g5.xlarge (AWS) **GPU** | `recognizer=RECOGNIZER_SPACY_EN_PII_DISTILBERT` | 0.154         | 2061.57               | 317                    |
diff --git a/docs/input_scanners/ban_substrings.md b/docs/input_scanners/ban_substrings.md
@@ -19,7 +19,8 @@ Additionally, the scanner can be configured to replace the banned substrings wit
 ```python
 from llm_guard.input_scanners import BanSubstrings
 
-scanner = BanSubstrings(substrings=["forbidden", "unwanted"], match_type="word", case_sensitive=False, redact=False, contains_all=False)
+scanner = BanSubstrings(substrings=["forbidden", "unwanted"], match_type="word", case_sensitive=False, redact=False,
+                        contains_all=False)
 sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)
 ```
 
@@ -28,3 +29,27 @@ whole words. To ban substrings irrespective of their word boundaries, simply cha
 
 There is also a dataset prepared of harmful substrings for
 prompts: [prompt_stop_substrings.json](https://github.com/laiyer-ai/llm-guard/blob/main/llm_guard/resources/prompt_stop_substrings.json)
+
+## Benchmarks
+
+Environment:
+
+- Platform: Amazon Linux 2
+- Python Version: 3.11.6
+
+Run the following script:
+
+```sh
+python benchmarks/run.py input BanSubstrings
+```
+
+Results:
+
+| Instance          | Time taken, s | Characters per Second | Total Length Processed |
+|-------------------|---------------|-----------------------|------------------------|
+| inf1.xlarge (AWS) | 0.0           | 243606.68             | 45                     |
+| m5.large (AWS)    | 0.0           | 216970.99             | 45                     |
+
+!!! info:
+
+    This scanner uses built-in functions, which makes it fast.
diff --git a/docs/input_scanners/ban_topics.md b/docs/input_scanners/ban_topics.md
@@ -12,7 +12,8 @@ reduce the risk of generating responses that could lead to misunderstandings or
 
 ## How it works
 
-It relies on the capabilities of the model: [MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c](https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c).
+It relies on the capabilities of the
+model: [MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c](https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c).
 This model aids in identifying the underlying theme or topic of a prompt, allowing the scanner to cross-check it against
 a list of banned topics.
 
@@ -24,3 +25,24 @@ from llm_guard.input_scanners import BanTopics
 scanner = BanTopics(topics=["violence"], threshold=0.5)
 sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)
 ```
+
+## Benchmarks
+
+Environment:
+
+- Platform: Amazon Linux 2
+- Python Version: 3.11.6
+
+Run the following script:
+
+```sh
+python benchmarks/run.py input BanTopics
+```
+
+Results:
+
+| Instance                | Time taken, s | Characters per Second | Total Length Processed |
+|-------------------------|---------------|-----------------------|------------------------|
+| inf1.xlarge (AWS)       | 0.396         | 252.38                | 100                    |
+| m5.large (AWS)          | 0.727         | 137.51                | 100                    |
+| g5.xlarge (AWS) **GPU** | 0.4           | 250.11                | 100                    |
diff --git a/docs/input_scanners/code.md b/docs/input_scanners/code.md
@@ -19,7 +19,7 @@ to either whitelist or blacklist specific languages, thus retaining full control
 user queries.
 
 !!! note
-    The scanner is currently limited to extracting and detecting code snippets from Markdown in the following languages:
+The scanner is currently limited to extracting and detecting code snippets from Markdown in the following languages:
 
     - Go
     - Java
@@ -36,3 +36,24 @@ from llm_guard.input_scanners import Code
 scanner = Code(denied=["python"])
 sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)
 ```
+
+## Benchmarks
+
+Environment:
+
+- Platform: Amazon Linux 2
+- Python Version: 3.11.6
+
+Run the following script:
+
+```sh
+python benchmarks/run.py input Code
+```
+
+Results:
+
+| Instance                | Time taken, s | Characters per Second | Total Length Processed |
+|-------------------------|---------------|-----------------------|------------------------|
+| inf1.xlarge (AWS)       | 0.062         | 4029.3                | 248                    |
+| m5.large (AWS)          | 0.112         | 2215.66               | 248                    |
+| g5.xlarge (AWS) **GPU** | 0.358         | 692.11                | 248                    |
diff --git a/docs/input_scanners/language.md b/docs/input_scanners/language.md
@@ -16,13 +16,18 @@ The Language Scanner is designed to identify such attempts, assess the authentic
 
 ## How it works
 
-At its core, the scanner leverages the capabilities of [langdetect](https://github.com/Mimino666/langdetect) library.
+At its core, the scanner leverages the capabilities of [fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect/) library.
 The primary function of the scanner is to analyze the input prompt, determine its language, and check if it's in the
 list.
 
 !!! info
 
-    Supported languages: `['af', 'ar', 'bg', 'bn', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'fa', 'fi', 'fr', 'gu', 'he',  'pt', 'ro', 'ru', 'sk', 'sl', 'so', 'sq', 'sv', 'sw', 'ta', 'te', 'th', 'tl', 'tr', 'uk', 'ur', 'vi', 'zh-cn', 'zh-tw']`.
+    Supported languages: `af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce cebckb co
+    cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia
+    id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr
+    mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc
+    scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo
+    wa war wuu xal xmf yi yo yue zh`.
 
 ## Usage
 
@@ -32,3 +37,24 @@ from llm_guard.input_scanners import Language
 scanner = Language(valid_languages=["en", ...])  # Add other valid languages as needed
 sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)
 ```
+
+## Benchmarks
+
+Environment:
+
+- Platform: Amazon Linux 2
+- Python Version: 3.11.6
+
+Run the following script:
+
+```sh
+python benchmarks/run.py input Language
+```
+
+Results:
+
+| Instance                | Time taken, s | Characters per Second | Total Length Processed |
+|-------------------------|---------------|-----------------------|------------------------|
+| inf1.xlarge (AWS)       | 0.4           | 34.98                 | 14                     |
+| m5.large (AWS)          | 0.36          | 37.9                  | 14                     |
+| g5.xlarge (AWS) **GPU** | 0.314         | 44.63                 | 14                     |
diff --git a/docs/input_scanners/prompt_injection.md b/docs/input_scanners/prompt_injection.md
@@ -28,7 +28,8 @@ primary ways an attacker might exploit:
 
 Choose models you would like to validate against:
 
-- [JasperLS/deberta-v3-base-injection](https://huggingface.co/JasperLS/deberta-v3-base-injection). It's worth noting that while the current model can detect attempts effectively, it might occasionally yield false positives.
+- [JasperLS/deberta-v3-base-injection](https://huggingface.co/JasperLS/deberta-v3-base-injection). It's worth noting
+  that while the current model can detect attempts effectively, it might occasionally yield false positives.
 - [hubert233/GPTFuzz](https://huggingface.co/hubert233/GPTFuzz) based on the larger RoBERTa-large model.
 
 Usage:
@@ -39,3 +40,24 @@ from llm_guard.input_scanners import PromptInjection, MODEL_JASPERLS
 scanner = PromptInjection(threshold=0.5, models=[MODEL_JASPERLS])
 sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)
 ```
+
+## Benchmarks
+
+Environment:
+
+- Platform: Amazon Linux 2
+- Python Version: 3.11.6
+
+Run the following script:
+
+```sh
+python benchmarks/run.py input PromptInjection
+```
+
+Results:
+
+| Instance                | Time taken, s | Characters per Second | Total Length Processed |
+|-------------------------|---------------|-----------------------|------------------------|
+| inf1.xlarge (AWS)       | 0.2           | 1921.18               | 384                    |
+| m5.large (AWS)          | 0.344         | 1116.45               | 384                    |
+| g5.xlarge (AWS) **GPU** | 0.539         | 712.43                | 384                    |