Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

probes: add ArtPrompt probes #617

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

jmartin-tech
Copy link
Collaborator

Fix #535

Implements two prompt obfuscation patterns based on ArtPrompt

Testing work is still in progress here a new possible base case or detector specific to technique, may be needed. As current mitigation.MitigationBypass detector does not quite cover the returned values when the model is not able to infer the masked word.

Example Usage pattern:

python -m garak -m huggingface.Model -n meta-llama/Llama-2-7b-chat-hf -p artprompt
python -m garak -m huggingface --model_name gpt2 --probes artprompt

The probe pattern could be enhanced to be provided a dataset of prompts to be augmented with a dictionary of unsafe words often blocked by safety training that can be easily maintained as a set of resource files.

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
@leondz leondz added probes Content & activity of LLM probes new plugin Describes an entirely new probe, detector, generator or harness labels Apr 30, 2024
garak/buffs/art.py Outdated Show resolved Hide resolved
@leondz
Copy link
Collaborator

leondz commented May 10, 2024

This is looking pretty reasonable. Agree that a probe works well for the case that the paper presents!

@zazer0
Copy link

zazer0 commented Jun 28, 2024

Hi, my team was thinking of building this as a Buff for the Apart Deception Hackathon (but, just saw this pull request already existed!) - is there a list anywhere of what's left to do for it? (E.g, adding configurable safety words?)

@jmartin-tech
Copy link
Collaborator Author

@zazer0, the current implementation for the probe is mostly complete, a plan for configurable prompts will likely be worked on after #602.

The primary reason this is still in draft is that work is still needed to generate a better detector for evaluating the responses from a prompt offered by this probe. The current detectors look for a mitigation response however this probe would needs additional filtering to determine if the response was able to identify the safety word masked in the prompt as failure to decode would not represent a finding of successful bypass of alignment or mitigation.

* consolidate init in `ArtPrompt`
* make `safety_words` and `stub_prompts` default params to allow override
* limit imports

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
@jmartin-tech jmartin-tech force-pushed the feature/artprompt-probe branch from bd1c9d6 to 3a7cd32 Compare June 28, 2024 16:22
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
@leondz
Copy link
Collaborator

leondz commented Aug 28, 2024

Given the upcoming payloads work on decoupling content from transformation, can it make sense to put this through as a probe in this PR? And defer (in separate PRs):

a. making the encoded texts accessible via the payload mechanism;
b. a buff

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
* require `populate_prompt()` for extending class
* use private class lists for initial prompts

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
@leondz leondz changed the title add ArtPrompt probes probes: add ArtPrompt probes Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new plugin Describes an entirely new probe, detector, generator or harness probes Content & activity of LLM probes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

buff: ascii art
3 participants