-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable LLM-Driven Data Exploration with Presets and 3W Integration #55
Open
zRafaF
wants to merge
14
commits into
petrobras:main
Choose a base branch
from
SorveteGalera:llm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* fixed relative imports * fixed relative imports * removed debbug prints
- Refactor the LLM class to improve readability and maintainability. - Add support for JSON tool schema in the `chat_completion_json_tool` method. - Update the default value of `max_tokens` parameter in the `chat_completion` method. - Add a new method `parse_chat_completion` to parse the chat completion response. Fix relative imports in LLM module - Fix relative imports in the `__init__.py` file of the LLM module. Add functionality to find linked column in three_w module - Add a new function `three_w_find_linked_column` in the `presets.py` file of the LLM module. - Modify the function to accept a JSON dataset and find the column linked to the error. Add functionality to format dataset for LLM prediction - Add a new function `format_for_llm_prediction` in the `tools.py` file of the three_w module. - Modify the function to format the dataset for LLM prediction based on the provided configuration file. Update formatting and documentation in tools.py - Update formatting and documentation in the `tools.py` file of the three_w module.
zRafaF
changed the title
Adding LLM client support to BibMon, implemented 3W integration for finding columns of interest
Enable LLM-Driven Data Exploration with Presets and 3W Integration
Oct 18, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR lays the groundwork for integrating Large Language Models (LLMs) into BibMon. It introduces a client that allows users to interact with endpoints for data processing and exploration. As a starting point, we have implemented access to the 3W dataset, enabling the model to infer the most relevant column based on the provided data.
Data tailoring is handled through what we call presets, which are located in
bibmon/llm/presets
. These presets allow users to customize how data is structured before being sent to the model.Limitations
This feature acts purely as a client, meaning it requires an external endpoint for model interaction. For instance, you can use OpenAI's API or self-host an alternative model, as we have done.
Direct LLM inference within BibMon could be achieved using tools like llama-cpp-python or similar. However, this approach was avoided to prevent unnecessary complexity and bloat in the library.
While fine-tuning the LLM and creating more precise instructions for a dataset is possible, it requires a detailed data annotation process. Additional information on this can be found in our auxiliary notebook.
Note: This PR is dependent on #50.
Implementation Details (3W Dataset Integration)
Data Preset
The data sent to the model follows this structure:
Model Response Format
The model will respond with the following structure:
Usage Example
Additional Resources
For further information and examples, please refer to our detailed notebook showcasing this feature.