Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck on reading large datasets #987

Open
dothe-Best opened this issue Nov 16, 2024 · 4 comments
Open

Stuck on reading large datasets #987

dothe-Best opened this issue Nov 16, 2024 · 4 comments

Comments

@dothe-Best
Copy link

image

When I use GPT to read a large local data set of several hundred MB in hybrid mode, GPT will start reading. When I check the Activity Monitor on my Mac, I can see that the memory usage of python3.13 exceeds 60GB. This indicates that GPT is indeed reading the data set. However, after about twenty minutes, this memory will be released. In my understanding, GPT has completed reading the data set, but it is always stuck on the interface of reading the data set and does not move. No matter how many hours I wait, there will be no further progress. What on earth is the reason for this?

@danieldekay
Copy link
Contributor

Can you share a bit more of what you are doing and what you want to achieve?

@dothe-Best
Copy link
Author

I want to build a network for network analysis with the dataset, I want gptr to read the dataset correctly and give me some ideas in conjunction with the dataset.

Can you share a bit more of what you are doing and what you want to achieve?

@danieldekay
Copy link
Contributor

@assafelovic - I was out for a while. Do we expect GPTR to deal with numercial data sets like this?

@ElishaKay
Copy link
Collaborator

ElishaKay commented Nov 20, 2024

Welcome @dothe-Best

It sounds like you'll want to set up a separate process for data ingestion.

GPTR is using Langchain Documents and Langchain VectorStores under the hood.

The flow would be:

Step 1: transform your content into Langchain Documents

Step 2: Insert your Langchain Documents into your Langchain VectorStore

Step 3: Pass your Langchain Vectorstore into your GPTR report run (more examples here and below)

Note: if your embedding model is having trouble with api limits or the DB you're using under the hood for your Langchain VectorStore needs to pace itself, you can handle that within your python.

In the example below, we're splitting the documents list into chunks of 100 & then inserting 1 chunk at a time into the vector store.

Code samples below:

Assuming your .env variables are like so:

OPENAI_API_KEY={Your OpenAI API Key here}
TAVILY_API_KEY={Your Tavily API Key here}

PGVECTOR_CONNECTION_STRING=postgresql://username:password...

Step 1:

from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

async def transform_to_langchain_docs(self, directory_structure):
    documents = []
    splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
    run_timestamp = datetime.utcnow().strftime('%Y%m%d%H%M%S')

    for file_name in directory_structure:
        if not file_name.endswith('/'):
            try:
                content = self.repo.get_contents(file_name, ref=self.branch_name)
                try:
                    decoded_content = base64.b64decode(content.content).decode()
                except Exception as e:
                    print(f"Error decoding content: {e}")
                    print("the problematic file_name is", file_name)
                    continue
                print("file_name", file_name)
                print("content", decoded_content)

                # Split each document into smaller chunks
                chunks = splitter.split_text(decoded_content)

                # Extract metadata for each chunk
                for index, chunk in enumerate(chunks):
                    metadata = {
                        "id": f"{run_timestamp}_{uuid4()}",  # Generate a unique UUID for each document
                        "source": file_name,
                        "title": file_name,
                        "extension": os.path.splitext(file_name)[1],
                        "file_path": file_name
                    }
                    document = Document(
                        page_content=chunk,
                        metadata=metadata
                    )
                    documents.append(document)

            except Exception as e:
                print(f"Error saving to vector store: {e}")
                return None

    await save_to_vector_store(documents)

Step 2:

from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from sqlalchemy.ext.asyncio import create_async_engine

from langchain_community.embeddings import OpenAIEmbeddings

async def save_to_vector_store(self, documents):
    # The documents are already Document objects, so we don't need to convert them
    embeddings = OpenAIEmbeddings()
    # self.vector_store = FAISS.from_documents(documents, embeddings)
    pgvector_connection_string = os.environ["PGVECTOR_CONNECTION_STRING"]

    collection_name = "my_docs"

    vector_store = PGVector(
        embeddings=embeddings,
        collection_name=collection_name,
        connection=pgvector_connection_string,
        use_jsonb=True
    )

    # for faiss
    # self.vector_store = vector_store.add_documents(documents, ids=[doc.metadata["id"] for doc in documents])

    # Split the documents list into chunks of 100
    for i in range(0, len(documents), 100):
        chunk = documents[i:i+100]
        # Insert the chunk into the vector store
        vector_store.add_documents(chunk, ids=[doc.metadata["id"] for doc in chunk])

Step 3:

async_connection_string = pgvector_connection_string.replace("postgresql://", "postgresql+psycopg://")

# Initialize the async engine with the psycopg3 driver
async_engine = create_async_engine(
    async_connection_string,
    echo=True
)

async_vector_store = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=async_engine,
    use_jsonb=True
)


researcher = GPTResearcher(
    query=query,
    report_type="research_report",
    report_source="langchain_vectorstore",
    vector_store=async_vector_store,
)
await researcher.conduct_research()
report = await researcher.write_report()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants