Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probe&Detector/SciSafeEval: The State-of-the-Art Benchmark for Safety Alignment of Large Language Models in Scientific Tasks #948

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

DavidLee528
Copy link
Contributor

Hi Garak Community,

We introduced SciSafeEval, the state-of-the-art benchmark for safety alignment of large language models in scientific tasks. More info could be found at https://arxiv.org/abs/2410.03769 and https://huggingface.co/datasets/Tianhao0x01/SciSafeEval .

In this PR, we add new probe garak/probes/sci_safe_eval.py and corresponding detector garak/detectors/refuse_to_answer.py to Garak. This new probe will enable better assessment of the safety alignment of large language models in scientific tasks.

Thanks for reviewing!

Best,
Tianhao

Copy link
Contributor

github-actions bot commented Oct 11, 2024

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@DavidLee528
Copy link
Contributor Author

Hi @500BoiledPotatoes , can you sign the DCO plz?

@500BoiledPotatoes
Copy link

I have read the DCO Document and I hereby sign the DCO

@DavidLee528
Copy link
Contributor Author

recheck

github-actions bot added a commit that referenced this pull request Oct 11, 2024
@leondz
Copy link
Collaborator

leondz commented Oct 11, 2024

Thank you, will take a look! And congratulations with the paper.

In the interim, could the brief documentation be added so the tests pass?

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting and provides a significant number of unique probe classes.

I wonder if there is value in refactoring the probe to be more generic and accept a list of known categories to tests vs a class per category? By setting the default to a limited number of prompts via sample_size and allowing for null to represent all prompts this could further reduce the number of probes to track.

plugins:
  probes:
    sci_safe_eval:
      ByCategory:
        categories:
          - BiologyProteinFunctionPrediction
          - BiologyProteinSequenceGeneration
          - BiologyProteinStructurePrediction
          - BiologyGeneClassification
          - BiologyGeneGeneration
          - ChemistryMoleculeGeneration
          - ChemistryReactionPrediction
          - MedicineInferenceReasoning
          - MedicineKnowledgeRetrieval
          - PhysicsKnowledgeRetrieval
        sample_size: 80

The above ideas likely needs to be tempered with the difference in resolution that would be generated in the report as there would not be a break down per category in the report summary.

The concepts in the PR related to model as a judge will also inform the in progress work on #419 which may present a method to provide a more flexible detector for use with the probe responses.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detector has requirements for modification of hard coded values as a default requirement, this project is distributed via pypi and should in default installation not expect users to have permissions to modify source code.

Also would refuse_to_answer fit as mitigation detector? I get the act of refusal is not a specific known mitigation string based response however it does seem like mitigation.refuse_to_answer or mitigation.refusal would be in line with what is being detected. This could still utilize model as a judge similar to how misleading.MustRefuteClaimModel or misleading.MustContradictNLI use a model to detect.

model = transformers.AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)

# Infer the device map for multi-GPU
device_map = infer_auto_device_map(model, max_memory={0: "24GiB"}, no_split_module_classes=["LlamaDecoderLayer"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not used hard coded GPU device expectations.

Usage of device_map seems like something that could be shared in HFCompatible to enable more complex auto detection of resources however this must be configurable by the user without code changes to be a viable use case.

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Ensure correct checkpoint path (you may need to download and specify the correct checkpoint file)
checkpoint_path = "/home/tianhao.li/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f" # Modify this with the actual path to your checkpoint files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot rely on code manipulation for configuration. The detector class implements Configurable and can accept configuration values via DEFAULT_PARAMS.

The singleton class model could be retained if there is an expectation that this detector will be used for multiple probes however it would need to provide a registry of loaded models to be sure the instance model is the one required for the probe/detector combination or at least instance level validation that the model object is configured as expected when accessed.

Comment on lines +144 to +145
print(f"\n\n[Response]: {response}")
print(f"\n[Evaluation]: {r}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detectors should not print prefer logging.debug() or remove theses.

Comment on lines +167 to +169
with open("detection_log.json", "a") as log_file:
json.dump(log_entry, log_file)
log_file.write("\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detectors should not write their own log files, as written this would write a log to the current working directory with no context of why it exists. The parent detector class already logs results into the report.jsonl and to hitlog.jsonl for the specific run when detectors are evaluated.

data_keys = ['name', 'smiles/selfies']
use_smiles = True

class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction, SciSafeEval, Probe):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction, SciSafeEval, Probe):
class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction):

The parent already provides the other mixin classes. Again applies to all *Tiny versions.

Comment on lines +131 to +134
hf_raw_filename = "chemistry_molecule-generation.jsonl"
placeholders = ['<name>', '<smiles/selfies>']
data_keys = ['name', 'smiles/selfies']
use_smiles = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be exposed a instance specific configurable values using the plugin Configurable patterns.

Suggested change
hf_raw_filename = "chemistry_molecule-generation.jsonl"
placeholders = ['<name>', '<smiles/selfies>']
data_keys = ['name', 'smiles/selfies']
use_smiles = True
DEFAULT_PARAMS = Probe.DEFAULT_PARAMS | {
"filename": "chemistry_molecule-generation.jsonl",
"placeholders": ['<name>', '<smiles/selfies>'],
"data_keys": ['name', 'smiles/selfies'],
"use_smiles" = True,
}

This would result in the default values project on the object as:

self.filename
self.placeholders
self.data_keys
self.use_smiles

Comment on lines +141 to +142
if retries >= max_retries:
raise Exception("Failed to get a valid annotation after 5 retries.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising a general exception will bring down the while run, this should log the error and continue retuning a value that will be interpreted as not able to detect.

import garak.attempt
from garak.detectors.base import Detector

class Llama31AsJudge(Detector):
Copy link
Collaborator

@jmartin-tech jmartin-tech Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently work in progress to define a generic ModelAsAJudge detector in #419, creating a detector that is coupled to a specific model by class name will likely be problematic. We can either wait for the generic detector or rename this to something that would allow or other models based on the same inference stack, something like refusalModel or maybe refusalCheckpoint.

If this detector is not moved under a different package as suggested in other comments, it might also be valuable to provide a refusalKW detector that can evaluate based on a english string detection, while less resilient this would offer tooling for extracting results in more resource constrained execution environments.

Comment on lines +58 to +98
def read_sci_safe_eval(self, file_path, combine=True):
data = []
try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
for line in file:
json_obj = json.loads(line.strip())
if combine:
if self.placeholders and self.data_keys:
prompt = json_obj['instruction']
for placeholder, data_key in zip(self.placeholders, self.data_keys):
if data_key == 'smiles/selfies':
data_value = json_obj['smiles'] if self.use_smiles else json_obj['selfies']
else:
data_value = json_obj.get(data_key, '')
prompt = prompt.replace(placeholder, data_value)
else:
prompt = json_obj['instruction']
data.append({
"idx": json_obj['idx'],
"prompt": prompt,
"few-shot": json_obj.get('few-shot', []),
"cot": json_obj.get('cot', None),
"jailbreak": json_obj.get('jailbreak', None),
"tags": json_obj.get('tags', [])
})
else:
data.append({
"idx": json_obj['idx'],
"instruction": json_obj['instruction'],
"few-shot": json_obj.get('few-shot', []),
"cot": json_obj.get('cot', None),
"jailbreak": json_obj.get('jailbreak', None),
"tags": json_obj.get('tags', [])
})
return data
except FileNotFoundError:
print(f"File not found: {file_path}")
return None
except json.JSONDecodeError:
print(f"Error decoding JSON in file: {file_path}")
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this code is here because the dataset files do not meet the criteria for the huggingface dataset parsing:

This error is seen when attempting to explore the data on huggingface:

Error code:   DatasetGenerationCastError
Exception:    DatasetGenerationCastError
Message:      An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 2 new columns ({'smiles', 'selfies'}) and 1 missing columns ({'sequence'}).

This happened while the json dataset builder was generating data using

hf://datasets/Tianhao0x01/SciSafeEval/chemistry_molecule-generation.jsonl (at revision 1751327df6dfc640571fa2d24cdae31522eb1bfe)

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1869, in _prepare_split_single
                  writer.write_table(table)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 580, in write_table
                  pa_table = table_cast(pa_table, self._schema)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2292, in table_cast
                  return cast_table_to_schema(table, schema)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in cast_table_to_schema
                  raise CastError(
              datasets.table.CastError: Couldn't cast
              idx: int64
              instruction: string
              name: string
              smiles: string
              selfies: string
              tags: list<item: string>
                child 0, item: string
              jailbreak: string
              to
              {'idx': Value(dtype='int64', id=None), 'instruction': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'sequence': Value(dtype='string', id=None), 'tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'jailbreak': Value(dtype='string', id=None)}
              because column names don't match
              
              During handling of the above exception, another exception occurred:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1392, in compute_config_parquet_and_info_response
                  parquet_operations = convert_to_parquet(builder)
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1041, in convert_to_parquet
                  builder.download_and_prepare(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 924, in download_and_prepare
                  self._download_and_prepare(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 999, in _download_and_prepare
                  self._prepare_split(split_generator, **prepare_split_kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1740, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1871, in _prepare_split_single
                  raise DatasetGenerationCastError.from_cast_error(
              datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
              
              All the data files must have the same columns, but at some point there are 2 new columns ({'smiles', 'selfies'}) and 1 missing columns ({'sequence'}).
              
              This happened while the json dataset builder was generating data using

Can the dataset be updated to conform to the required format? This would allow processing using huggingface's datasets package.

@leondz
Copy link
Collaborator

leondz commented Oct 16, 2024

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

@DavidLee528
Copy link
Contributor Author

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

Hi Leon - Thanks for the comment! We do aware that the licence in Tianhao0x01/SciSafeEval may conflict with garak. How about this way: we create a new huggingface repo Tianhao0x01/SciSafeEval-mini contain a subset of SciSafeEval with a MIT licence, then let's refer to the mini version? @leondz (btw, sorry for the delay due to heavy workload here...😅)

@leondz
Copy link
Collaborator

leondz commented Nov 12, 2024

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

How about this way: we create a new huggingface repo Tianhao0x01/SciSafeEval-mini contain a subset of SciSafeEval with a MIT licence, then let's refer to the mini version? @leondz

This sounds amazing, yes please, that would work great!

(btw, sorry for the delay due to heavy workload here...😅)

I am glad they're keeping you busy :) Hope things are well!

@DavidLee528
Copy link
Contributor Author

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

How about this way: we create a new huggingface repo Tianhao0x01/SciSafeEval-mini contain a subset of SciSafeEval with a MIT licence, then let's refer to the mini version? @leondz

This sounds amazing, yes please, that would work great!

(btw, sorry for the delay due to heavy workload here...😅)

I am glad they're keeping you busy :) Hope things are well!

We will working on this soon, all the best🫡

@leondz
Copy link
Collaborator

leondz commented Jan 7, 2025

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

How about this way: we create a new huggingface repo Tianhao0x01/SciSafeEval-mini contain a subset of SciSafeEval with a MIT licence, then let's refer to the mini version? @leondz

We will working on this soon, all the best🫡

Is there any news? Can this license be updated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants