Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(blog): data poisoning article #2566

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

docs(blog): data poisoning article #2566

wants to merge 7 commits into from

Conversation

vsauter
Copy link
Contributor

@vsauter vsauter commented Jan 7, 2025

Addition of article on data poisoning.

@vsauter vsauter requested a review from typpo January 7, 2025 23:28
Copy link
Contributor

gru-agent bot commented Jan 7, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail 752dfbd 🚫 Skipped No files need to be tested {"site/blog/data-poisoning.md":"target file(site/blog/data-poisoning.md) not in work scope \n include: /*.ts,/.tsx,**/.js,/*.jsx \n exclude: node_modules,/.test.ts,**/.test.tsx,/*.spec.ts,/.spec.tsx,**/.d.ts,/*.test.js,/.spec.js","site/static/img/blog/data-poisoning/backdoor-panda.png":"target file(site/static/img/blog/data-poisoning/backdoor-panda.png) not in work scope \n include: **/.ts,/*.tsx,/.js,**/.jsx \n exclude: node_modules,/*.test.ts,/.test.tsx,**/.spec.ts,/*.spec.tsx,/.d.ts,**/.test.js,**/*.spec.js"}

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

Copy link
Contributor

github-actions bot commented Jan 7, 2025

Images automagically compressed by Calibre's image-actions

Compression reduced images by 11.2%, saving 47.11 KB.

Filename Before After Improvement Visual comparison
site/static/img/blog/data-poisoning/poisoning-panda.jpeg 420.51 KB 373.40 KB -11.2% View diff

168 images did not require optimisation.

Copy link
Member

@mldangelo mldangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this article. You’ve done a wonderful job distilling your thoughts and presenting them clearly. I’ve left far too many comments. They’re mostly suggestions to consider at your discretion (feel free to ignore). One thing to think about is our target audience. Make sure terms and concepts are familiar to whoever you believe will read this. I really appreciate all your hard work, and I’m excited to see how this shapes up. Keep it up!

site/blog/data-poisoning.md Outdated Show resolved Hide resolved
site/blog/data-poisoning.md Outdated Show resolved Hide resolved
site/blog/data-poisoning.md Outdated Show resolved Hide resolved
site/blog/data-poisoning.md Outdated Show resolved Hide resolved
site/blog/data-poisoning.md Outdated Show resolved Hide resolved
site/blog/data-poisoning.md Outdated Show resolved Hide resolved
site/blog/data-poisoning.md Outdated Show resolved Hide resolved
site/blog/data-poisoning.md Outdated Show resolved Hide resolved
site/blog/data-poisoning.md Show resolved Hide resolved
site/blog/data-poisoning.md Outdated Show resolved Hide resolved

# Defending Against Data Poisoning Attacks on LLMs: A Comprehensive Guide

Data poisoning remains a top concern on the [OWASP Top 10 for 2025](https://owasp.org/www-project-top-10-for-large-language-model-applications/). However, the scope of data poisoning has expanded since the 2023 version. Data poisoning is no longer strictly a risk during the training of Large Language Models (LLMs); it now encompasses all three stages of the LLM lifecycle: pre-training, fine-tuning, and retrieval from external sources. OWASP also highlights the risk of model poisoning from shared repositories or open-source platforms, where models may contain backdoors or embedded malware.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Data poisoning remains a top concern on the [OWASP Top 10 for 2025](https://owasp.org/www-project-top-10-for-large-language-model-applications/). However, the scope of data poisoning has expanded since the 2023 version. Data poisoning is no longer strictly a risk during the training of Large Language Models (LLMs); it now encompasses all three stages of the LLM lifecycle: pre-training, fine-tuning, and retrieval from external sources. OWASP also highlights the risk of model poisoning from shared repositories or open-source platforms, where models may contain backdoors or embedded malware.
Data poisoning remains a top concern on the [OWASP Top 10 for 2025](https://owasp.org/www-project-top-10-for-large-language-model-applications/). However, the scope of data poisoning has expanded since the 2023 version. Data poisoning is no longer strictly a risk during the training of Large Language Models (LLMs); it now encompasses all stages of the LLM lifecycle, including: pre-training, fine-tuning, and retrieval from external sources. OWASP also highlights the risk of model poisoning from shared repositories or open-source platforms, where models may contain backdoors or embedded malware.


When exploited, data poisoning can degrade model performance, produce biased or toxic content, exploit downstream systems, or tamper with the model’s generation capabilities.

Understanding how these attacks work and implementing preventative measures is crucial for developers, security engineers, and technical leaders responsible for maintaining the security and reliability of these systems. This comprehensive guide delves into the nature of data poisoning attacks and offers strategies to safeguard against these threats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please take another pass at this section?


Data poisoning attacks are malicious attempts to corrupt the training data of an LLM, thereby influencing the model's behavior in undesirable ways. These attacks typically manifest in three primary forms:

1. **Poisoning the Training Dataset**: Attackers insert malicious data into the training set during pre-training or fine-tuning, causing the model to learn incorrect associations or behaviors. This can lead to the model making erroneous predictions or becoming susceptible to specific triggers. They may also create backdoors, where they poison the training dataset to cause the model to behave normally under typical conditions but produce attacker-chosen outputs when presented with certain triggers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this is a different list from the one in your intro but it is similar enough that it's confusing to me (there's overlap). maybe not an actionable nit but wanted to share.


## Detection and Prevention Strategies

To protect your LLM applications from [LLM vulnerabilities](https://www.promptfoo.dev/docs/red-team/llm-vulnerability-types/), including data poisoning attacks, it's essential to implement a comprehensive set of detection and prevention measures:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrase


### Implement Data Validation and Tracking to Mitigate Risk of Data Poisoning

- **Enforce Sandboxing**: Implement sandboxing to restrict model exposure to untrusted data sources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this make sense in an LLM context?


- **Enforce Sandboxing**: Implement sandboxing to restrict model exposure to untrusted data sources.
- **Track Data Origins**: Use tools like OWASP CycloneDX or ML-BOM to track data origins and transformations.
- **Use Data Versioning**: Use a version control system to track changes in datasets and detect manipulation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider rephrasing. I am not sure if this is useful detail in this article. It may be table stakes.

Regularly monitor the outputs of your LLM for signs of unusual or undesirable behavior.

- **Implement Tracing**: LLM tracing provides a detailed snapshot of the decision-making and thought processes within LLMs as they generate responses. Tracing can help you monitor, debug, and understand the execution of an LLM application
- **Use Golden Datasets**: Golden datasets in LLMs are high-quality, carefully curated collections of data used to evaluate and benchmark the performance of large language models. Use these datasets as a "ground truth" to evaluate the performance of your models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-deployment evals with promptfoo! consider stating this first.


- **Lock Down Access**: Restrict access to LLM repositories and implement robust monitoring to mitigate the risk of insider threats.
- Access to training data should be restricted based on least privilege and need-to-know. Access should be recertified on a regular cadence (such as quarterly) to account for employee turnover or job changes.
- All access should be logged and audited. Developer access should be limited to the minimum necessary to perform their job and access should be revoked when they leave the organization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right level of detail for your audience?


- **Vet Your Sources**: Conduct thorough due diligence on model providers and training data sources.
- Review model cards and documentation to understand the model's training processes and performance. You can learn more about this in our [foundation model security](https://www.promptfoo.dev/blog/foundation-model-security/) blog post.
- Verify that models downloaded from Hugging Face [pass their malware scans](https://huggingface.co/docs/hub/en/security-malware) and [pickling scans](https://huggingface.co/docs/hub/en/security-pickle).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good

### Red Team LLM Applications to Detect Data Poisoning

- **Model Red Teaming**: Run an initial [red team](https://www.promptfoo.dev/docs/red-team/) assessment against any models pulled from shared or public repositories like Hugging Face.
- **Assess Bias**: In Promptfoo's eval framework, use Promptfoo's [classifier assert type](https://www.promptfoo.dev/docs/configuration/expected-outputs/classifier/#bias-detection-example) to assess grounding, factuality, and bias in models pulled from Hugging Face.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just related to hugging face this can be run on any output. model is hosted on hugging face

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants