Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding threshold to Transformers pipeline #14

Open
joshpopelka20 opened this issue Nov 21, 2023 · 6 comments
Open

Adding threshold to Transformers pipeline #14

joshpopelka20 opened this issue Nov 21, 2023 · 6 comments

Comments

@joshpopelka20
Copy link

I'm using this code to run inference:


// Use a pipeline as a high-level helper
from transformers import pipeline

// Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification


tokenizer = AutoTokenizer.from_pretrained("obi/deid_bert_i2b2")

pipe = pipeline("token-classification", tokenizer=tokenizer, model="obi/deid_bert_i2b2",
                aggregation_strategy="first")

I'm trying to increase the threshold, but can't find a config for it. Is it possible with my setup?

@prajwal967
Copy link
Collaborator

Hi, sorry for the late response.

Unfortunately, the threshold can't be added via the HuggingFace pipelines.
What you could do is see if you can get the raw logits from the pipeline - if you can, then you can process the raw logit values using the code given here: Threshold max or Threshold sum

Let us know if you have any other questions!

@joshpopelka20
Copy link
Author

joshpopelka20 commented Dec 20, 2023

Not sure if I'm doing this right, but this is the code I have so far:

inputs = tokenizer(text, return_tensors="pt")

   outputs = model(**inputs)
   predictions = outputs.logits
   print(PostProcessPicker.get_threshold_max(predictions, 1.8982457699258832e-06))

I'm getting this error message when I run the code:

/usr/local/lib/python3.10/dist-packages/robust_deid/sequence_tagging/post_process/model_outputs/post_process_picker.py in get_threshold_max(self, threshold)
56 (ThresholdProcessMax): Return Threshold Max post processor
57 """
---> 58 return ThresholdProcessMax(self._label_list, threshold=threshold)
59
60 def get_threshold_sum(self, threshold) -> ThresholdProcessSum:

AttributeError: 'Tensor' object has no attribute '_label_list'

@prajwal967
Copy link
Collaborator

Hi,

Could you add the following lines of code:

# Import the respective classes from the respective locations

# Initialize labels
ner_labels = NERLabels(notation='BIO', ner_types=["PATIENT", "STAFF", "AGE", "DATE", "PHONE", "ID", "EMAIL", "PATORG", "LOC", "HOSP", "OTHERPHI"]) 
label_list = ner_labels.get_label_list()

# Get the post processing object
picker = PostProcessPicker(label_list=label_list)
# This creates an object of the threshold max class which you can use to process the predictions with the threshold
threshold_max = picker.get_threshold_max(predictions, 1.8982457699258832e-06))

# Get the model predictions 
outputs = model(**inputs)
predictions = outputs.logits

# There are two ways to process the predictions -
# Case 1: Get the predictions - no additional filtering
# The label list converts ids of labels back to string form
final_preds = [[label_list[self.process_prediction(p)] for p in prediction] for prediction in predictions]

        
# Case 2: Get the predictions - where we also pass a labels list(that can be used to ignore predictions at certain positions etc.)
Use the pre-defined function
final_preds, final_labels = threshold_max.decode(predictions, labels)

Let us know if this piece of code did not work!

@joshpopelka20
Copy link
Author

I'm not understanding this piece of code:

   # Case 2: Get the predictions - where we also pass a labels list(that can be used to ignore predictions at certain positions etc.)
   # Use the pre-defined function
   final_preds, final_labels = threshold_max.decode(predictions, labels)

what is the "labels" list supposed to be?

@prajwal967
Copy link
Collaborator

We have an option to ignore the predictions for certain tokens, which can be specified via the labels argument. If we pass [O, NA, O, O, NA] as labels (assuming we have 5 tokens as input) the function ignores the predictions at positions 1 & 4 and return [P0, P2, P3] (predictions at the three positions).

You don't need to use it, this is optional

@joshpopelka20
Copy link
Author

The decode method seems to require the labels list. I've tried to create labels list with the same shape as the predictions tensor, but I'm getting a different error.

Code:

tensor_shape = torch.Size([1, 105, 45])
labels = [["O"] * tensor_shape[2] for _ in range(tensor_shape[1])]
final_preds, final_labels = threshold_max.decode(predictions, labels)

Error message:

/usr/local/lib/python3.10/dist-packages/numpy/ma/core.py in new(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order)
2904 msg = "Mask and data not compatible: data size is %i, " +
2905 "mask size is %i."
-> 2906 raise MaskError(msg % (nd, nm))
2907 copy = True
2908 # Set the mask to the new value

MaskError: Mask and data not compatible: data size is 45, mask size is 23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants