-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT ] - Simple Adaptive Jailbreaking #537
base: main
Are you sure you want to change the base?
Conversation
…bles to expose logprobs in openai chat targets
…ss_conversation_stream. Adding prompt templates.
…ng logprobs exposed (not toplogprobs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be converted into a notebook for documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should live under doc/code/orchestrators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or auxiliary_attacks instead of orchestrators. The more I think about it the more I feel like it's more like GCG than other orchestrators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prompt not used but could be later. It is from the paper directly.
@@ -219,20 +223,28 @@ async def _complete_chat_async(self, messages: list[ChatMessageListDictContent]) | |||
response: ChatCompletion = await self._async_client.chat.completions.create( | |||
model=self._deployment_name, | |||
max_completion_tokens=self._max_completion_tokens, | |||
max_tokens=self._max_tokens, | |||
max_tokens=self._max_tokens, # TODO: this is given as NOT_GIVEN? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO comes from inability to alter prompt target max tokens from orchestrator. I need to test.
temperature=self._temperature, | ||
top_p=self._top_p, | ||
frequency_penalty=self._frequency_penalty, | ||
presence_penalty=self._presence_penalty, | ||
logprobs=self._logprobs, | ||
top_logprobs=self._top_logprobs, # TODO: when called and logprobs is False, this param is not passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO is a bit trickier. When logprobs
is None
, we shouldn't pass top_logprobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Thanks for putting in all this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should live under doc/code/orchestrators
@@ -121,6 +122,8 @@ def __init__( | |||
# Original prompt id defaults to id (assumes that this is the original prompt, not a duplicate) | |||
self.original_prompt_id = original_prompt_id or self.id | |||
|
|||
self.logprobs = logprobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for @rlundeen2 / @rdheekonda : We can create a follow-up task to add this to the DB schema
@@ -119,6 +119,7 @@ def construct_response_from_request( | |||
response_type: PromptDataType = "text", | |||
prompt_metadata: Optional[str] = None, | |||
error: PromptResponseError = "none", | |||
logprobs: Optional[dict] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logprobs: Optional[dict] = None, | |
logprobs: Optional[Dict[str, float]] = None, |
Is that right?
@@ -203,6 +203,7 @@ def _parse_attacker_response(self, *, response: PromptRequestResponse) -> str: | |||
attacker_suggested_prompt = json_response["prompt"] | |||
except (json.JSONDecodeError, KeyError): | |||
# This forces the @pyrit_json_retry decorator to retry the function | |||
breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will have to go 😄
|
||
class SimpleAdaptiveOrchestrator(Orchestrator): | ||
""" | ||
This orchestrator implements the Prompt Automatic Iterative Refinement (PAIR) algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect your notebook will elaborate but a paragraph on the intuition behind the algorithm wouldn't hurt 🙂
|
||
def _adjust_adversarial_suffix(self) -> str: | ||
""" | ||
Randomly changes the adversarial suffix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should be more similar to the method name 😆 _change_suffix_randomly
if "Sure" in message: | ||
breakpoint() | ||
|
||
if self._desired_target_response_prefix in message[:len(self._desired_target_response_prefix)]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be ==
or in
?
self._best_logprobs = sum(logprob_dict.values()) | ||
self._best_logprobs_dict = logprob_dict | ||
self._best_adversarial_suffix = self._adversarial_suffix | ||
elif " " + self._desired_target_response_prefix in message: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this all about? Seems identical except for a leading space?!
if self._best_logprobs == -np.inf: | ||
breakpoint() | ||
logger.info(f"No improvement in logprobs after {self._number_of_iterations} iterations.") | ||
return jailbreaks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multi-turn orchestrators we've added MultiTurnAttackResult
just becaue every orchestrator was returning something custom and that's super annoying. Maybe we can use something similar?
return float(score_value) >= self._scorer_sensitivity | ||
|
||
|
||
async def run(self) -> list[PromptRequestResponse]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not exactly standardized yet but we seem to be converging towards run_attack_async
as the name for this thing.
Description
#266
@romanlutz @rlundeen2
Tests and Documentation
TODO