-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code Refactoring #7
Conversation
Refactor models seperate class
updated the readme
@prabha-git this is awesome - seriously so cool to see your support on this. Thank you! Two things I wanted to check in on
I ask because the 'sandwich' and 'dolores' are hard coded which will be awkward if they switch eval methods
I fixed that but then got another error for Anthropic
Do you mind double checking the anthropic use case or adding a test we can run through quickly? btw once this is merged I'd love to buy you coffee or lunch if you're in SF |
issuecomment-1877803162 - Updating the code based on feedback
@gkamradt - I would love to meet for a coffee or lunch but I am based in Dallas, Let me know if you are in town 😄 ☕ Regarding the code updates:
I'm open to any further comments or thoughts you might have. |
Added a check to ensure substring validation words is in needle
hey everyone, will this PR be merged? |
This is an essential option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your changes are very good and will move this project forward.
Consider my comments as ideas for future improvements 🙂
AnthropicEvaluator.py
Outdated
return self.tokenizer.decode(encoded_context) | ||
|
||
def get_prompt(self, context): | ||
with open('Anthropic_prompt.txt', 'r') as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed OpenAIEvaluator
has values like this hardcoded. To maintain consistency you may consider also having those values in AnthropicEvaluator
and removing Anthropic_prompt.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I have updated the code to add the prompt directly under get_prompt
method.
Text Completion API in Anthropic is deprecated, their recommendation is to use Message API. I have updated the code to use Message API following this migration guide
} | ||
] | ||
@abstractmethod | ||
def get_prompt(self, context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you could consider merging get_prompt()
and get_response_from_model()
into a single function?
I understand the need for precise timing measurements, but I believe get_prompt()
is so fast that it doesn't matter. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion regarding the merging of get_prompt() and get_response_from_model() into a single function. I appreciate the consideration given to the efficiency of the code. However, prompt engineering is a critical aspect of eliciting high-quality responses from Language Models, and the crafting of prompts can vary significantly across different models.
Maintaining a dedicated get_prompt function provides a clear and centralized location for prompt development and refinement. This separation not only ensures clarity in our codebase but also offers an organized approach for future updates and customizations specific to prompt formulation. For the time being, I believe that keeping get_prompt as an independent function is beneficial for both our current workflow and potential modifications that may arise as we continue to enhance our system's capabilities.
I hope this clarifies the rationale behind the current design choice. I'm open to further discussions on this matter as we progress and as the needs of our project evolve.
LLMNeedleHaystackTester.py
Outdated
print (f"Depth: {depth_percent}%") | ||
print (f"Score: {score}") | ||
print (f"Response: {response}\n") | ||
print(f"-- Test Summary -- ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
LLMNeedleHaystackTester.py
Outdated
seconds_to_sleep_between_completions = None, | ||
print_ongoing_status = True): | ||
""" | ||
substr_validation_words=['dolores', 'sandwich'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all those arguments? Are they ever used by someone? 🙂
Consider moving them to a constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. The parameters you mentioned are part of the original framework, intended to offer runtime flexibility for various testing scenarios. This design choice supports dynamic adjustments without code changes, which is essential for testing strategy. While moving to constants could streamline the code, it would reduce this flexibility. I'll defer to @gkamradt for a final decision, given his foundational work on these features.
OpenAIEvaluator.py
Outdated
# Tons of defaults set, check out the LLMNeedleHaystackTester's init for more info | ||
ht = OpenAIEvaluator(model_name='gpt-4-1106-preview', evaluation_method='gpt4') | ||
|
||
ht.start_test() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using pre-commit
with end-of-file-fixer
to fix warnings like this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLMNeedleHaystackTester.py
Outdated
|
||
if self.save_contexts: | ||
results['file_name'] : context_file_location | ||
results['file_name']: context_file_location | ||
|
||
# Save the context to file for retesting | ||
if not os.path.exists('contexts'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using pathlib
, code below could be rewritten to something like
contexts_dir = Path('contexts')
contexts_dir.mkdir(parents=True, exist_ok=True)
context_file_path = contexts_dir / f'{context_file_location}_context.txt'
context_file_path.write_text(context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code to use pathlib
instead of os.path
in the latest pull request.
LLMNeedleHaystackTester.py
Outdated
self.model_to_test_description = model_name | ||
self.evaluation_model = ChatOpenAI(model="gpt-4", temperature=0, openai_api_key = self.openai_api_key) | ||
|
||
if evaluation_method == 'gpt4': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 'gpt4'
very often, maybe we can consider having variable for example OpenAIEvaluator.MODEL_NAME = 'gpt4'
.
This way we may easily add support for more models like GPT-4 Turbo
by subclassing OpenAIEvaluator
and setting few variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback on the model specification. You're correct that the "Needle in Haystack" tester is designed to be versatile and accepts a range of models as runtime parameters. Users have the flexibility to test with various models such as GPT-3.5, GPT-4, or GPT-4 Turbo, catering to diverse testing needs without any hardcoded constraints.
Regarding the model response evaluation, we currently support two methods:
substring_match: A straightforward and effective method for simpler cases.
Utilizing GPT-4: Leveraging GPT-4's advanced capabilities for more complex evaluations.
I concur with the suggestion to avoid hardcoding the model name in the code. Moving towards defining the model as a configurable constant would indeed enhance code modularity and maintainability. This change will also align with our aim for greater flexibility and adaptability in the testing framework
Created issue #10 to track this enhancement.
pass | ||
|
||
@abstractmethod | ||
def get_decoding(self, encoded_context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this approach may not work for Gemini models, because we do not have public tokenizer for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although Gemini's tokenizer isn't public, our use case for tokenization—to identify fact placement depth—is effectively served by public tokenizers like 'tiktoken'. This ensures our test integrity without dependency on google's tokenizer.
AnthropicEvaluator.py
Outdated
|
||
if 'model_name' not in kwargs: | ||
raise ValueError("model_name must be supplied with init") | ||
elif "claude" not in kwargs['model_name']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using "
or '
. We may also consider using Black to fix those inconsistencies automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable feedback and for considering my changes positively! I'm eager to implement these ideas. Would you prefer I incorporate these improvements into the current pull request, or should I address them in a subsequent one? |
Its up to @gkamradt to decide, I'm nobody here 🙂 Personally, I would:
|
Hey crew! just wanted to give an update from my end - I'm so stoked about the energy around this PR. I recently posted a tweet for a co-maintainer on this project so we can take it to the next level. Expect a note from me within the next week on next steps. Progress will be made! |
Hi @prabha-git, we had an internal discussion regarding #8 and this PR. Both PRs were taking conflicting approaches to solve the issue of separation of model interactions. While you had chosen inheritance, #8 opted for composition, which we think is better suited for future project needs. Since you have done a lot of work on this PR, we would appreciate it if you could address some of the issues that are essentially tasks you have already completed in this PR: Sorry for the confusion. We are looking forward to your contributions to the issues I mentioned above. |
Ok, np, that Sounds good. I will start with #13 . Lets close this pull request. |
Thank you for looking into that Issue. |
Separation of code by Model Provider. This allows adding more providers without a complex if-else condition check.
Added simple evaluation scoring based on substring match, With the default simple question and answer I am checking for the word "sandwich' and 'dolores' in the response. Scores would be 0 or 1. The default is still gpt4, we can use this method to reduce API cost.