Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Refactoring #7

Closed
wants to merge 18 commits into from
Closed

Code Refactoring #7

wants to merge 18 commits into from

Conversation

prabha-git
Copy link
Contributor

  1. Separation of code by Model Provider. This allows adding more providers without a complex if-else condition check.

  2. Added simple evaluation scoring based on substring match, With the default simple question and answer I am checking for the word "sandwich' and 'dolores' in the response. Scores would be 0 or 1. The default is still gpt4, we can use this method to reduce API cost.

@gkamradt
Copy link
Owner

gkamradt commented Jan 4, 2024

@prabha-git this is awesome - seriously so cool to see your support on this. Thank you!

Two things I wanted to check in on

  1. What do you think about an warning if the user changes the default needle and selects a non-gpt4 evaluation method? Should we make it an enum (rather than saying GPT-4 or anything else)
    8938e86#diff-05a09c80927a7bafbb8521c43d34eb8398759f066eb0e1a75a609873aaa82116R332

I ask because the 'sandwich' and 'dolores' are hard coded which will be awkward if they switch eval methods

  1. I ran this for OpenAI and ran fine, then I tried Anthropic and got an error with a .pop() typo
    8938e86#diff-d2cfe243e36bf6df56a149432841389e7c611f2960763812f6c7b2b2ac154dffR31

I fixed that but then got another error for Anthropic

TypeError: LLMNeedleHaystackTester.__init__() got an unexpected keyword argument 'model_provider'

Do you mind double checking the anthropic use case or adding a test we can run through quickly?

btw once this is merged I'd love to buy you coffee or lunch if you're in SF

@prabha-git
Copy link
Contributor Author

prabha-git commented Jan 9, 2024

@gkamradt - I would love to meet for a coffee or lunch but I am based in Dallas, Let me know if you are in town 😄 ☕

Regarding the code updates:

  1. I've introduced a new parameter called substr_validation_words, which defaults to ['dolores', 'sandwich']. Additionally, I've implemented a check to ensure that if any of the words in substr_validation_words are missing from the needle, an exception will be raised.

  2. I've addressed the error you pointed out in Anthropic Code. Unfortunately, I haven't been able to do a full test myself as I'm still awaiting the API key. Could you possibly run it on your end to confirm everything is functioning correctly?

I'm open to any further comments or thoughts you might have.

@prabha-git prabha-git marked this pull request as draft January 17, 2024 22:31
@prabha-git prabha-git marked this pull request as ready for review January 17, 2024 22:57
@geronimi73
Copy link

hey everyone, will this PR be merged?

@justHungryMan
Copy link

This is an essential option.

Copy link
Collaborator

@pavelkraleu pavelkraleu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your changes are very good and will move this project forward.
Consider my comments as ideas for future improvements 🙂

return self.tokenizer.decode(encoded_context)

def get_prompt(self, context):
with open('Anthropic_prompt.txt', 'r') as file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed OpenAIEvaluator has values like this hardcoded. To maintain consistency you may consider also having those values in AnthropicEvaluator and removing Anthropic_prompt.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have updated the code to add the prompt directly under get_prompt method.

Text Completion API in Anthropic is deprecated, their recommendation is to use Message API. I have updated the code to use Message API following this migration guide

}
]
@abstractmethod
def get_prompt(self, context):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could consider merging get_prompt() and get_response_from_model() into a single function?

I understand the need for precise timing measurements, but I believe get_prompt() is so fast that it doesn't matter. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion regarding the merging of get_prompt() and get_response_from_model() into a single function. I appreciate the consideration given to the efficiency of the code. However, prompt engineering is a critical aspect of eliciting high-quality responses from Language Models, and the crafting of prompts can vary significantly across different models.

Maintaining a dedicated get_prompt function provides a clear and centralized location for prompt development and refinement. This separation not only ensures clarity in our codebase but also offers an organized approach for future updates and customizations specific to prompt formulation. For the time being, I believe that keeping get_prompt as an independent function is beneficial for both our current workflow and potential modifications that may arise as we continue to enhance our system's capabilities.

I hope this clarifies the rationale behind the current design choice. I'm open to further discussions on this matter as we progress and as the needs of our project evolve.

print (f"Depth: {depth_percent}%")
print (f"Score: {score}")
print (f"Response: {response}\n")
print(f"-- Test Summary -- ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

seconds_to_sleep_between_completions = None,
print_ongoing_status = True):
"""
substr_validation_words=['dolores', 'sandwich'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all those arguments? Are they ever used by someone? 🙂

Consider moving them to a constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. The parameters you mentioned are part of the original framework, intended to offer runtime flexibility for various testing scenarios. This design choice supports dynamic adjustments without code changes, which is essential for testing strategy. While moving to constants could streamline the code, it would reduce this flexibility. I'll defer to @gkamradt for a final decision, given his foundational work on these features.

# Tons of defaults set, check out the LLMNeedleHaystackTester's init for more info
ht = OpenAIEvaluator(model_name='gpt-4-1106-preview', evaluation_method='gpt4')

ht.start_test()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using pre-commit with end-of-file-fixer to fix warnings like this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion , added trailing-whitespace and end-of-file-fixer to pre-commit

image


if self.save_contexts:
results['file_name'] : context_file_location
results['file_name']: context_file_location

# Save the context to file for retesting
if not os.path.exists('contexts'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using pathlib, code below could be rewritten to something like

contexts_dir = Path('contexts')
contexts_dir.mkdir(parents=True, exist_ok=True)

context_file_path = contexts_dir / f'{context_file_location}_context.txt'
context_file_path.write_text(context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to use pathlib instead of os.path in the latest pull request.

self.model_to_test_description = model_name
self.evaluation_model = ChatOpenAI(model="gpt-4", temperature=0, openai_api_key = self.openai_api_key)

if evaluation_method == 'gpt4':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 'gpt4' very often, maybe we can consider having variable for example OpenAIEvaluator.MODEL_NAME = 'gpt4'.
This way we may easily add support for more models like GPT-4 Turbo by subclassing OpenAIEvaluator and setting few variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback on the model specification. You're correct that the "Needle in Haystack" tester is designed to be versatile and accepts a range of models as runtime parameters. Users have the flexibility to test with various models such as GPT-3.5, GPT-4, or GPT-4 Turbo, catering to diverse testing needs without any hardcoded constraints.

Regarding the model response evaluation, we currently support two methods:

substring_match: A straightforward and effective method for simpler cases.
Utilizing GPT-4: Leveraging GPT-4's advanced capabilities for more complex evaluations.
I concur with the suggestion to avoid hardcoding the model name in the code. Moving towards defining the model as a configurable constant would indeed enhance code modularity and maintainability. This change will also align with our aim for greater flexibility and adaptability in the testing framework

Created issue #10 to track this enhancement.

pass

@abstractmethod
def get_decoding(self, encoded_context):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this approach may not work for Gemini models, because we do not have public tokenizer for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although Gemini's tokenizer isn't public, our use case for tokenization—to identify fact placement depth—is effectively served by public tokenizers like 'tiktoken'. This ensures our test integrity without dependency on google's tokenizer.


if 'model_name' not in kwargs:
raise ValueError("model_name must be supplied with init")
elif "claude" not in kwargs['model_name']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using " or '. We may also consider using Black to fix those inconsistencies automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the inconsistencies.

image

@prabha-git
Copy link
Contributor Author

I think your changes are very good and will move this project forward. Consider my comments as ideas for future improvements 🙂

Thank you for your valuable feedback and for considering my changes positively! I'm eager to implement these ideas. Would you prefer I incorporate these improvements into the current pull request, or should I address them in a subsequent one?
@pavelkraleu

@pavelkraleu
Copy link
Collaborator

I think your changes are very good and will move this project forward. Consider my comments as ideas for future improvements 🙂

Thank you for your valuable feedback and for considering my changes positively! I'm eager to implement these ideas. Would you prefer I incorporate these improvements into the current pull request, or should I address them in a subsequent one? @pavelkraleu

Its up to @gkamradt to decide, I'm nobody here 🙂

Personally, I would:

  1. Merge the PR as it is, assuming it does not break anything (I have not run the code)
  2. Create Issues for my comments, if you agree with them
  3. Go through Issues one by one

@gkamradt
Copy link
Owner

Hey crew! just wanted to give an update from my end - I'm so stoked about the energy around this PR.

I recently posted a tweet for a co-maintainer on this project so we can take it to the next level.

Expect a note from me within the next week on next steps.

Progress will be made!

@pavelkraleu
Copy link
Collaborator

Hi @prabha-git, we had an internal discussion regarding #8 and this PR. Both PRs were taking conflicting approaches to solve the issue of separation of model interactions.

While you had chosen inheritance, #8 opted for composition, which we think is better suited for future project needs.

Since you have done a lot of work on this PR, we would appreciate it if you could address some of the issues that are essentially tasks you have already completed in this PR:

Sorry for the confusion. We are looking forward to your contributions to the issues I mentioned above.

@prabha-git
Copy link
Contributor Author

Ok, np, that Sounds good. I will start with #13 . Lets close this pull request.

@pavelkraleu
Copy link
Collaborator

Thank you for looking into that Issue.
Sorry about those complications. I appreciate your efforts in this PR.

@pavelkraleu pavelkraleu closed this Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants