Skip to content
This repository has been archived by the owner on Nov 23, 2023. It is now read-only.

IndexError: index 0 is out of bounds for axis 0 with size 0 #8

Open
skuma307 opened this issue May 5, 2023 · 8 comments
Open

IndexError: index 0 is out of bounds for axis 0 with size 0 #8

skuma307 opened this issue May 5, 2023 · 8 comments

Comments

@skuma307
Copy link

skuma307 commented May 5, 2023

Hi, thanks for the great work in the open-source space. I am facing the below error:
index = faiss.IndexFlatL2(len(embeddings[0])) IndexError: index 0 is out of bounds for axis 0 with size 0

The faiss index is empty. There are no embeddings?

Can you help me debug this? I really appreciate any help you can provide.

@kamil-kaczmarek
Copy link
Collaborator

Hi @skuma307 thanks for reaching out.

Let me ask two quick questions:

  1. Can you point me to code in the example?
  2. Please paste full stack trace for better context.

@skuma307
Copy link
Author

skuma307 commented May 8, 2023

Thanks for your reply @kamil-kaczmarek ! I am using below code base:
`import time

import numpy as np
import ray
from langchain.document_loaders import ReadTheDocsLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from embeddings import LocalHuggingFaceEmbeddings

To download the files locally for processing, here's the command line

wget -e robots=off --recursive --no-clobber --page-requisites --html-extension \

--convert-links --restrict-file-names=windows \

--domains docs.ray.io --no-parent https://docs.ray.io/en/master/

FAISS_INDEX_PATH = "faiss_index_fast"
db_shards = 8

loader = ReadTheDocsLoader("docs.ray.io/en/master/")

text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=300,
chunk_overlap=20,
length_function=len,
)

@ray.remote(num_gpus=1)
def process_shard(shard):
print(f"Starting process_shard of {len(shard)} chunks.")
st = time.time()
embeddings = LocalHuggingFaceEmbeddings("multi-qa-mpnet-base-dot-v1")
result = FAISS.from_documents(shard, embeddings)
et = time.time() - st
print(f"Shard completed in {et} seconds.")
return result

Stage one: read all the docs, split them into chunks.

st = time.time()
print("Loading documents ...")
docs = loader.load()

Theoretically, we could use Ray to accelerate this, but it's fast enough as is.

chunks = text_splitter.create_documents(
[doc.page_content for doc in docs], metadatas=[doc.metadata for doc in docs]
)
et = time.time() - st
print(f"Time taken: {et} seconds. {len(chunks)} chunks generated")

Stage two: embed the docs.

print(f"Loading chunks into vector store ... using {db_shards} shards")
st = time.time()
shards = np.array_split(chunks, db_shards)
futures = [process_shard.remote(shards[i]) for i in range(db_shards)]
results = ray.get(futures)
et = time.time() - st
print(f"Shard processing complete. Time taken: {et} seconds.")

st = time.time()
print("Merging shards ...")

Straight serial merge of others into results[0]

db = results[0]
for i in range(1, db_shards):
db.merge_from(results[i])
et = time.time() - st
print(f"Merged in {et} seconds.")

st = time.time()
print("Saving faiss index")
db.save_local(FAISS_INDEX_PATH)
et = time.time() - st
print(f"Saved in: {et} seconds.")`

I have created a virtual env on Python 3.9 on Windows.

@kamil-kaczmarek
Copy link
Collaborator

Hi, You need to make sure that you build DB first. Have a look at this script: https://github.com/ray-project/langchain-ray/blob/main/open_source_LLM_retrieval_qa/build_vector_store.py

@skuma307
Copy link
Author

skuma307 commented May 8, 2023

Thanks for your reply, but am I also using the same code I pasted above? Am I missing anything? I would appreciate your help. @kamil-kaczmarek

@kamil-kaczmarek
Copy link
Collaborator

@skuma307 you need to create embeddings store first. Please check these instructions for more details.

@noperator
Copy link

noperator commented May 11, 2023

@kamil-kaczmarek , when I run python build_vector_store.py as part of the step "Building the vector store index," I get the same error described above:

Traceback (most recent call last):
  File "/root/langchain-ray/open_source_LLM_retrieval_qa/build_vector_store.py", line 64, in <module>
    results = ray.get(futures)
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2521, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::process_shard()
  File "/root/langchain-ray/open_source_LLM_retrieval_qa/build_vector_store.py", line 42, in process_shard
    result = FAISS.from_documents(shard, embeddings)
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/base.py", line 272, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/faiss.py", line 385, in from_texts
    return cls.__from(
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/faiss.py", line 347, in __from
    index = faiss.IndexFlatL2(len(embeddings[0]))
IndexError: index 0 is out of bounds for axis 0 with size 0

@noperator
Copy link

Following the guidance in langchain-ai/chat-langchain#26 (comment), I fixed this error by:

  • changing the loader to UnstructuredURLLoader
  • installing the libmagic-dev package
  • prepending https:// to the docs URL
diff --git a/open_source_LLM_retrieval_qa/build_vector_store.py b/open_source_LLM_retrieval_qa/build_vector_store.py
index e530b54..9a519a8 100644
--- a/open_source_LLM_retrieval_qa/build_vector_store.py
+++ b/open_source_LLM_retrieval_qa/build_vector_store.py
@@ -4,7 +4,7 @@ from typing import List

 import numpy as np
 import ray
-from langchain.document_loaders import ReadTheDocsLoader
+from langchain.document_loaders import UnstructuredURLLoader
 from langchain.embeddings.base import Embeddings
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.vectorstores import FAISS
@@ -21,7 +21,7 @@ FAISS_INDEX_PATH = "faiss_index_fast"
 db_shards = 8
 ray.init()

-loader = ReadTheDocsLoader("docs.ray.io/en/master/")
+loader = UnstructuredURLLoader(urls=["https://docs.ray.io/en/master/"])

 text_splitter = RecursiveCharacterTextSplitter(
     # Set a really small chunk size, just to show.

@bharaniabhishek123
Copy link

I see a lot of users following the tutorial are getting same error
IndexError: index 0 is out of bounds for axis 0 with size 0
The solution does require to create embeddings store first. Please check these instructions for more details.

It would be better to have requirements.txt file inside /langchain-ray/open_source_LLM_retrieval_qa/requirements.txt move to the move one level up and add instructions in the repo rather that inside retrieval_qa ; a lot of new users will also face same issue.

Also, please include documentation links on how to spin up an ray cluster for all cloud platforms ; whether it's cluster.yaml or any other way. Writing that it's a hefty setup will not guide a user on how to do it ;
This demo requires a bit of a hefty setup. It requires one machine with a 24GB GPU (eg. an AWS g5.xlarge) or a machine with 2 GPUs (minimum 16GB each) or a Ray cluster with at least 2 GPUs available.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants