-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck on reading large datasets #987
Comments
Can you share a bit more of what you are doing and what you want to achieve? |
I want to build a network for network analysis with the dataset, I want gptr to read the dataset correctly and give me some ideas in conjunction with the dataset.
|
@assafelovic - I was out for a while. Do we expect GPTR to deal with numercial data sets like this? |
Welcome @dothe-Best It sounds like you'll want to set up a separate process for data ingestion. GPTR is using Langchain Documents and Langchain VectorStores under the hood. The flow would be: Step 1: transform your content into Langchain Documents Step 2: Insert your Langchain Documents into your Langchain VectorStore Step 3: Pass your Langchain Vectorstore into your GPTR report run (more examples here and below) Note: if your embedding model is having trouble with api limits or the DB you're using under the hood for your Langchain VectorStore needs to pace itself, you can handle that within your python. In the example below, we're splitting the documents list into chunks of 100 & then inserting 1 chunk at a time into the vector store. Code samples below: Assuming your .env variables are like so: OPENAI_API_KEY={Your OpenAI API Key here}
TAVILY_API_KEY={Your Tavily API Key here}
PGVECTOR_CONNECTION_STRING=postgresql://username:password... Step 1: from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
async def transform_to_langchain_docs(self, directory_structure):
documents = []
splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
run_timestamp = datetime.utcnow().strftime('%Y%m%d%H%M%S')
for file_name in directory_structure:
if not file_name.endswith('/'):
try:
content = self.repo.get_contents(file_name, ref=self.branch_name)
try:
decoded_content = base64.b64decode(content.content).decode()
except Exception as e:
print(f"Error decoding content: {e}")
print("the problematic file_name is", file_name)
continue
print("file_name", file_name)
print("content", decoded_content)
# Split each document into smaller chunks
chunks = splitter.split_text(decoded_content)
# Extract metadata for each chunk
for index, chunk in enumerate(chunks):
metadata = {
"id": f"{run_timestamp}_{uuid4()}", # Generate a unique UUID for each document
"source": file_name,
"title": file_name,
"extension": os.path.splitext(file_name)[1],
"file_path": file_name
}
document = Document(
page_content=chunk,
metadata=metadata
)
documents.append(document)
except Exception as e:
print(f"Error saving to vector store: {e}")
return None
await save_to_vector_store(documents) Step 2: from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from sqlalchemy.ext.asyncio import create_async_engine
from langchain_community.embeddings import OpenAIEmbeddings
async def save_to_vector_store(self, documents):
# The documents are already Document objects, so we don't need to convert them
embeddings = OpenAIEmbeddings()
# self.vector_store = FAISS.from_documents(documents, embeddings)
pgvector_connection_string = os.environ["PGVECTOR_CONNECTION_STRING"]
collection_name = "my_docs"
vector_store = PGVector(
embeddings=embeddings,
collection_name=collection_name,
connection=pgvector_connection_string,
use_jsonb=True
)
# for faiss
# self.vector_store = vector_store.add_documents(documents, ids=[doc.metadata["id"] for doc in documents])
# Split the documents list into chunks of 100
for i in range(0, len(documents), 100):
chunk = documents[i:i+100]
# Insert the chunk into the vector store
vector_store.add_documents(chunk, ids=[doc.metadata["id"] for doc in chunk]) Step 3: async_connection_string = pgvector_connection_string.replace("postgresql://", "postgresql+psycopg://")
# Initialize the async engine with the psycopg3 driver
async_engine = create_async_engine(
async_connection_string,
echo=True
)
async_vector_store = PGVector(
embeddings=embeddings,
collection_name=collection_name,
connection=async_engine,
use_jsonb=True
)
researcher = GPTResearcher(
query=query,
report_type="research_report",
report_source="langchain_vectorstore",
vector_store=async_vector_store,
)
await researcher.conduct_research()
report = await researcher.write_report() |
When I use GPT to read a large local data set of several hundred MB in hybrid mode, GPT will start reading. When I check the Activity Monitor on my Mac, I can see that the memory usage of python3.13 exceeds 60GB. This indicates that GPT is indeed reading the data set. However, after about twenty minutes, this memory will be released. In my understanding, GPT has completed reading the data set, but it is always stuck on the interface of reading the data set and does not move. No matter how many hours I wait, there will be no further progress. What on earth is the reason for this?
The text was updated successfully, but these errors were encountered: