Code Refactoring #7

prabha-git · 2024-01-02T23:57:32Z

Separation of code by Model Provider. This allows adding more providers without a complex if-else condition check.
Added simple evaluation scoring based on substring match, With the default simple question and answer I am checking for the word "sandwich' and 'dolores' in the response. Scores would be 0 or 1. The default is still gpt4, we can use this method to reduce API cost.

Refactor models seperate class

updated the readme

gkamradt · 2024-01-04T21:45:54Z

@prabha-git this is awesome - seriously so cool to see your support on this. Thank you!

Two things I wanted to check in on

What do you think about an warning if the user changes the default needle and selects a non-gpt4 evaluation method? Should we make it an enum (rather than saying GPT-4 or anything else)
8938e86#diff-05a09c80927a7bafbb8521c43d34eb8398759f066eb0e1a75a609873aaa82116R332

I ask because the 'sandwich' and 'dolores' are hard coded which will be awkward if they switch eval methods

I ran this for OpenAI and ran fine, then I tried Anthropic and got an error with a .pop() typo
8938e86#diff-d2cfe243e36bf6df56a149432841389e7c611f2960763812f6c7b2b2ac154dffR31

I fixed that but then got another error for Anthropic

TypeError: LLMNeedleHaystackTester.__init__() got an unexpected keyword argument 'model_provider'

Do you mind double checking the anthropic use case or adding a test we can run through quickly?

btw once this is merged I'd love to buy you coffee or lunch if you're in SF

issuecomment-1877803162 - Updating the code based on feedback

prabha-git · 2024-01-09T16:34:43Z

@gkamradt - I would love to meet for a coffee or lunch but I am based in Dallas, Let me know if you are in town 😄 ☕

Regarding the code updates:

I've introduced a new parameter called substr_validation_words, which defaults to ['dolores', 'sandwich']. Additionally, I've implemented a check to ensure that if any of the words in substr_validation_words are missing from the needle, an exception will be raised.
I've addressed the error you pointed out in Anthropic Code. Unfortunately, I haven't been able to do a full test myself as I'm still awaiting the API key. Could you possibly run it on your end to confirm everything is functioning correctly?

I'm open to any further comments or thoughts you might have.

Added a check to ensure substring validation words is in needle

… gpt4

geronimi73 · 2024-02-15T05:38:34Z

hey everyone, will this PR be merged?

justHungryMan · 2024-02-26T07:56:38Z

This is an essential option.

pavelkraleu

I think your changes are very good and will move this project forward.
Consider my comments as ideas for future improvements 🙂

pavelkraleu · 2024-02-28T13:10:46Z

AnthropicEvaluator.py

+        return self.tokenizer.decode(encoded_context)
+
+    def get_prompt(self, context):
+        with open('Anthropic_prompt.txt', 'r') as file:


I noticed OpenAIEvaluator has values like this hardcoded. To maintain consistency you may consider also having those values in AnthropicEvaluator and removing Anthropic_prompt.txt

Good point, I have updated the code to add the prompt directly under get_prompt method.

Text Completion API in Anthropic is deprecated, their recommendation is to use Message API. I have updated the code to use Message API following this migration guide

pavelkraleu · 2024-02-28T13:17:15Z

LLMNeedleHaystackTester.py

-                }
-            ]
+    @abstractmethod
+    def get_prompt(self, context):


Perhaps you could consider merging get_prompt() and get_response_from_model() into a single function?

I understand the need for precise timing measurements, but I believe get_prompt() is so fast that it doesn't matter. 🙂

Thank you for your suggestion regarding the merging of get_prompt() and get_response_from_model() into a single function. I appreciate the consideration given to the efficiency of the code. However, prompt engineering is a critical aspect of eliciting high-quality responses from Language Models, and the crafting of prompts can vary significantly across different models.

Maintaining a dedicated get_prompt function provides a clear and centralized location for prompt development and refinement. This separation not only ensures clarity in our codebase but also offers an organized approach for future updates and customizations specific to prompt formulation. For the time being, I believe that keeping get_prompt as an independent function is beneficial for both our current workflow and potential modifications that may arise as we continue to enhance our system's capabilities.

I hope this clarifies the rationale behind the current design choice. I'm open to further discussions on this matter as we progress and as the needs of our project evolve.

pavelkraleu · 2024-02-28T13:18:37Z

LLMNeedleHaystackTester.py

-            print (f"Depth: {depth_percent}%")
-            print (f"Score: {score}")
-            print (f"Response: {response}\n")
+            print(f"-- Test Summary -- ")


Consider removing f

pavelkraleu · 2024-02-28T13:20:08Z

LLMNeedleHaystackTester.py

-                 seconds_to_sleep_between_completions = None,
-                 print_ongoing_status = True):
-        """        
+                 substr_validation_words=['dolores', 'sandwich'],


Do we need all those arguments? Are they ever used by someone? 🙂

Consider moving them to a constants.

Thanks for the review. The parameters you mentioned are part of the original framework, intended to offer runtime flexibility for various testing scenarios. This design choice supports dynamic adjustments without code changes, which is essential for testing strategy. While moving to constants could streamline the code, it would reduce this flexibility. I'll defer to @gkamradt for a final decision, given his foundational work on these features.

pavelkraleu · 2024-02-28T13:21:41Z

OpenAIEvaluator.py

+    # Tons of defaults set, check out the LLMNeedleHaystackTester's init for more info
+    ht = OpenAIEvaluator(model_name='gpt-4-1106-preview', evaluation_method='gpt4')
+
+    ht.start_test()


Consider using pre-commit with end-of-file-fixer to fix warnings like this one.

Thanks for the suggestion , added trailing-whitespace and end-of-file-fixer to pre-commit

pavelkraleu · 2024-02-28T13:32:51Z

LLMNeedleHaystackTester.py


        if self.save_contexts:
-            results['file_name'] : context_file_location
+            results['file_name']: context_file_location

            # Save the context to file for retesting
            if not os.path.exists('contexts'):


Consider using pathlib, code below could be rewritten to something like

contexts_dir = Path('contexts') contexts_dir.mkdir(parents=True, exist_ok=True) context_file_path = contexts_dir / f'{context_file_location}_context.txt' context_file_path.write_text(context)

Updated the code to use pathlib instead of os.path in the latest pull request.

pavelkraleu · 2024-02-28T13:39:56Z

LLMNeedleHaystackTester.py

-        self.model_to_test_description = model_name
-        self.evaluation_model = ChatOpenAI(model="gpt-4", temperature=0, openai_api_key = self.openai_api_key)
+
+        if evaluation_method == 'gpt4':


I see 'gpt4' very often, maybe we can consider having variable for example OpenAIEvaluator.MODEL_NAME = 'gpt4'.
This way we may easily add support for more models like GPT-4 Turbo by subclassing OpenAIEvaluator and setting few variables.

Thank you for the feedback on the model specification. You're correct that the "Needle in Haystack" tester is designed to be versatile and accepts a range of models as runtime parameters. Users have the flexibility to test with various models such as GPT-3.5, GPT-4, or GPT-4 Turbo, catering to diverse testing needs without any hardcoded constraints.

Regarding the model response evaluation, we currently support two methods:

substring_match: A straightforward and effective method for simpler cases.
Utilizing GPT-4: Leveraging GPT-4's advanced capabilities for more complex evaluations.
I concur with the suggestion to avoid hardcoding the model name in the code. Moving towards defining the model as a configurable constant would indeed enhance code modularity and maintainability. This change will also align with our aim for greater flexibility and adaptability in the testing framework

Created issue #10 to track this enhancement.

pavelkraleu · 2024-02-28T13:47:20Z

LLMNeedleHaystackTester.py

+        pass
+
+    @abstractmethod
+    def get_decoding(self, encoded_context):


I'm afraid this approach may not work for Gemini models, because we do not have public tokenizer for them.

Although Gemini's tokenizer isn't public, our use case for tokenization—to identify fact placement depth—is effectively served by public tokenizers like 'tiktoken'. This ensures our test integrity without dependency on google's tokenizer.

pavelkraleu · 2024-02-28T13:50:11Z

AnthropicEvaluator.py

+
+        if 'model_name' not in kwargs:
+            raise ValueError("model_name must be supplied with init")
+        elif "claude" not in kwargs['model_name']:


Consider using " or '. We may also consider using Black to fix those inconsistencies automatically.

fixed the inconsistencies.

prabha-git · 2024-02-28T15:30:51Z

I think your changes are very good and will move this project forward. Consider my comments as ideas for future improvements 🙂

Thank you for your valuable feedback and for considering my changes positively! I'm eager to implement these ideas. Would you prefer I incorporate these improvements into the current pull request, or should I address them in a subsequent one?
@pavelkraleu

pavelkraleu · 2024-02-28T18:08:17Z

I think your changes are very good and will move this project forward. Consider my comments as ideas for future improvements 🙂

Thank you for your valuable feedback and for considering my changes positively! I'm eager to implement these ideas. Would you prefer I incorporate these improvements into the current pull request, or should I address them in a subsequent one? @pavelkraleu

Its up to @gkamradt to decide, I'm nobody here 🙂

Personally, I would:

Merge the PR as it is, assuming it does not break anything (I have not run the code)
Create Issues for my comments, if you agree with them
Go through Issues one by one

gkamradt · 2024-02-29T15:27:40Z

Hey crew! just wanted to give an update from my end - I'm so stoked about the energy around this PR.

I recently posted a tweet for a co-maintainer on this project so we can take it to the next level.

Expect a note from me within the next week on next steps.

Progress will be made!

pavelkraleu · 2024-03-05T12:07:47Z

Hi @prabha-git, we had an internal discussion regarding #8 and this PR. Both PRs were taking conflicting approaches to solve the issue of separation of model interactions.

While you had chosen inheritance, #8 opted for composition, which we think is better suited for future project needs.

Since you have done a lot of work on this PR, we would appreciate it if you could address some of the issues that are essentially tasks you have already completed in this PR:

Sorry for the confusion. We are looking forward to your contributions to the issues I mentioned above.

prabha-git · 2024-03-05T14:44:20Z

Ok, np, that Sounds good. I will start with #13 . Lets close this pull request.

pavelkraleu · 2024-03-05T18:50:32Z

Thank you for looking into that Issue.
Sorry about those complications. I appreciate your efforts in this PR.

prabha-git added 5 commits January 2, 2024 08:51

Code seperation for each model providers

a7b96d7

Code Seperation for each provider

8938e86

Merge pull request #1 from prabha-git/refactor_models_seperate_class

374e8bf

Refactor models seperate class

updated the readme

b238046

Merge pull request #2 from prabha-git/refactor_models_seperate_class

4501c01

updated the readme

prabha-git added 2 commits January 9, 2024 10:04

issuecomment-1877803162 - Updating the code based on feedback

9b093f2

Merge pull request #3 from prabha-git/remove_hardcoding_substring_eval

dc7b715

issuecomment-1877803162 - Updating the code based on feedback

prabha-git added 3 commits January 9, 2024 22:52

Added a check to ensure substring validation words is in needle

13ce6b0

Merge pull request #4 from prabha-git/remove_hardcoding_substring_eval

933eeff

Added a check to ensure substring validation words is in needle

instating evaluation model with gpt4 only evaluation_method is set to…

e471807

… gpt4

prabha-git marked this pull request as draft January 17, 2024 22:31

prabha-git force-pushed the main branch from 8759ca2 to e471807 Compare January 17, 2024 22:56

prabha-git marked this pull request as ready for review January 17, 2024 22:57

pavelkraleu reviewed Feb 28, 2024

View reviewed changes

Directly included the prompt in AnthropicEvaluator

aa20790

prabha-git added 4 commits February 29, 2024 02:01

refactored the code to use pathlib instead of os.path

c7b0dce

Updated the Anthropic to 0.16.0 , Since API has changed

59a82e7

updates save_results method to save json output in single line

b242a5c

remove f-string where is it not requried in print_ongoing_status

dc47cc9

prabha-git added 3 commits February 29, 2024 10:09

Added pre-commit with trailing-whitespace,end-of-file-fixer

9f49f2c

Fix with end-of-file-fixer

7e86887

added black hook to pre-commit

4a77764

pavelkraleu closed this Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Refactoring #7

Code Refactoring #7

prabha-git commented Jan 2, 2024

gkamradt commented Jan 4, 2024 •

edited

Loading

prabha-git commented Jan 9, 2024 •

edited

Loading

geronimi73 commented Feb 15, 2024

justHungryMan commented Feb 26, 2024

pavelkraleu left a comment

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

pavelkraleu Feb 28, 2024

prabha-git Feb 29, 2024

prabha-git commented Feb 28, 2024

pavelkraleu commented Feb 28, 2024

gkamradt commented Feb 29, 2024

pavelkraleu commented Mar 5, 2024

prabha-git commented Mar 5, 2024

pavelkraleu commented Mar 5, 2024

Code Refactoring #7

Code Refactoring #7

Conversation

prabha-git commented Jan 2, 2024

gkamradt commented Jan 4, 2024 • edited Loading

prabha-git commented Jan 9, 2024 • edited Loading

geronimi73 commented Feb 15, 2024

justHungryMan commented Feb 26, 2024

pavelkraleu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prabha-git commented Feb 28, 2024

pavelkraleu commented Feb 28, 2024

gkamradt commented Feb 29, 2024

pavelkraleu commented Mar 5, 2024

prabha-git commented Mar 5, 2024

pavelkraleu commented Mar 5, 2024

gkamradt commented Jan 4, 2024 •

edited

Loading

prabha-git commented Jan 9, 2024 •

edited

Loading