IndexError: index 0 is out of bounds for axis 0 with size 0 #8

skuma307 · 2023-05-05T05:39:34Z

Hi, thanks for the great work in the open-source space. I am facing the below error:
index = faiss.IndexFlatL2(len(embeddings[0])) IndexError: index 0 is out of bounds for axis 0 with size 0

The faiss index is empty. There are no embeddings?

Can you help me debug this? I really appreciate any help you can provide.

The text was updated successfully, but these errors were encountered:

kamil-kaczmarek · 2023-05-06T18:46:39Z

Hi @skuma307 thanks for reaching out.

Let me ask two quick questions:

Can you point me to code in the example?
Please paste full stack trace for better context.

skuma307 · 2023-05-08T04:26:03Z

Thanks for your reply @kamil-kaczmarek ! I am using below code base:
`import time

import numpy as np
import ray
from langchain.document_loaders import ReadTheDocsLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from embeddings import LocalHuggingFaceEmbeddings

To download the files locally for processing, here's the command line

wget -e robots=off --recursive --no-clobber --page-requisites --html-extension \

--convert-links --restrict-file-names=windows \

--domains docs.ray.io --no-parent https://docs.ray.io/en/master/

FAISS_INDEX_PATH = "faiss_index_fast"
db_shards = 8

loader = ReadTheDocsLoader("docs.ray.io/en/master/")

text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=300,
chunk_overlap=20,
length_function=len,
)

@ray.remote(num_gpus=1)
def process_shard(shard):
print(f"Starting process_shard of {len(shard)} chunks.")
st = time.time()
embeddings = LocalHuggingFaceEmbeddings("multi-qa-mpnet-base-dot-v1")
result = FAISS.from_documents(shard, embeddings)
et = time.time() - st
print(f"Shard completed in {et} seconds.")
return result

Stage one: read all the docs, split them into chunks.

st = time.time()
print("Loading documents ...")
docs = loader.load()

Theoretically, we could use Ray to accelerate this, but it's fast enough as is.

chunks = text_splitter.create_documents(
[doc.page_content for doc in docs], metadatas=[doc.metadata for doc in docs]
)
et = time.time() - st
print(f"Time taken: {et} seconds. {len(chunks)} chunks generated")

Stage two: embed the docs.

print(f"Loading chunks into vector store ... using {db_shards} shards")
st = time.time()
shards = np.array_split(chunks, db_shards)
futures = [process_shard.remote(shards[i]) for i in range(db_shards)]
results = ray.get(futures)
et = time.time() - st
print(f"Shard processing complete. Time taken: {et} seconds.")

st = time.time()
print("Merging shards ...")

Straight serial merge of others into results[0]

db = results[0]
for i in range(1, db_shards):
db.merge_from(results[i])
et = time.time() - st
print(f"Merged in {et} seconds.")

st = time.time()
print("Saving faiss index")
db.save_local(FAISS_INDEX_PATH)
et = time.time() - st
print(f"Saved in: {et} seconds.")`

I have created a virtual env on Python 3.9 on Windows.

kamil-kaczmarek · 2023-05-08T05:32:08Z

Hi, You need to make sure that you build DB first. Have a look at this script: https://github.com/ray-project/langchain-ray/blob/main/open_source_LLM_retrieval_qa/build_vector_store.py

skuma307 · 2023-05-08T06:36:47Z

Thanks for your reply, but am I also using the same code I pasted above? Am I missing anything? I would appreciate your help. @kamil-kaczmarek

kamil-kaczmarek · 2023-05-08T20:28:59Z

@skuma307 you need to create embeddings store first. Please check these instructions for more details.

noperator · 2023-05-11T20:48:48Z

@kamil-kaczmarek , when I run python build_vector_store.py as part of the step "Building the vector store index," I get the same error described above:

Traceback (most recent call last):
  File "/root/langchain-ray/open_source_LLM_retrieval_qa/build_vector_store.py", line 64, in <module>
    results = ray.get(futures)
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2521, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::process_shard()
  File "/root/langchain-ray/open_source_LLM_retrieval_qa/build_vector_store.py", line 42, in process_shard
    result = FAISS.from_documents(shard, embeddings)
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/base.py", line 272, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/faiss.py", line 385, in from_texts
    return cls.__from(
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/faiss.py", line 347, in __from
    index = faiss.IndexFlatL2(len(embeddings[0]))
IndexError: index 0 is out of bounds for axis 0 with size 0

noperator · 2023-05-11T21:21:41Z

Following the guidance in langchain-ai/chat-langchain#26 (comment), I fixed this error by:

changing the loader to UnstructuredURLLoader
installing the libmagic-dev package
prepending https:// to the docs URL

diff --git a/open_source_LLM_retrieval_qa/build_vector_store.py b/open_source_LLM_retrieval_qa/build_vector_store.py
index e530b54..9a519a8 100644
--- a/open_source_LLM_retrieval_qa/build_vector_store.py
+++ b/open_source_LLM_retrieval_qa/build_vector_store.py
@@ -4,7 +4,7 @@ from typing import List

 import numpy as np
 import ray
-from langchain.document_loaders import ReadTheDocsLoader
+from langchain.document_loaders import UnstructuredURLLoader
 from langchain.embeddings.base import Embeddings
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.vectorstores import FAISS
@@ -21,7 +21,7 @@ FAISS_INDEX_PATH = "faiss_index_fast"
 db_shards = 8
 ray.init()

-loader = ReadTheDocsLoader("docs.ray.io/en/master/")
+loader = UnstructuredURLLoader(urls=["https://docs.ray.io/en/master/"])

 text_splitter = RecursiveCharacterTextSplitter(
     # Set a really small chunk size, just to show.

bharaniabhishek123 · 2023-07-15T04:28:04Z

I see a lot of users following the tutorial are getting same error
IndexError: index 0 is out of bounds for axis 0 with size 0
The solution does require to create embeddings store first. Please check these instructions for more details.

It would be better to have requirements.txt file inside /langchain-ray/open_source_LLM_retrieval_qa/requirements.txt move to the move one level up and add instructions in the repo rather that inside retrieval_qa ; a lot of new users will also face same issue.

Also, please include documentation links on how to spin up an ray cluster for all cloud platforms ; whether it's cluster.yaml or any other way. Writing that it's a hefty setup will not guide a user on how to do it ;
This demo requires a bit of a hefty setup. It requires one machine with a 24GB GPU (eg. an AWS g5.xlarge) or a machine with 2 GPUs (minimum 16GB each) or a Ray cluster with at least 2 GPUs available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: index 0 is out of bounds for axis 0 with size 0 #8

IndexError: index 0 is out of bounds for axis 0 with size 0 #8

skuma307 commented May 5, 2023

kamil-kaczmarek commented May 6, 2023

skuma307 commented May 8, 2023

kamil-kaczmarek commented May 8, 2023

skuma307 commented May 8, 2023

kamil-kaczmarek commented May 8, 2023

noperator commented May 11, 2023 •

edited

Loading

noperator commented May 11, 2023

bharaniabhishek123 commented Jul 15, 2023

IndexError: index 0 is out of bounds for axis 0 with size 0 #8

IndexError: index 0 is out of bounds for axis 0 with size 0 #8

Comments

skuma307 commented May 5, 2023

kamil-kaczmarek commented May 6, 2023

skuma307 commented May 8, 2023

To download the files locally for processing, here's the command line

wget -e robots=off --recursive --no-clobber --page-requisites --html-extension \

--convert-links --restrict-file-names=windows \

--domains docs.ray.io --no-parent https://docs.ray.io/en/master/

Stage one: read all the docs, split them into chunks.

Theoretically, we could use Ray to accelerate this, but it's fast enough as is.

Stage two: embed the docs.

Straight serial merge of others into results[0]

kamil-kaczmarek commented May 8, 2023

skuma307 commented May 8, 2023

kamil-kaczmarek commented May 8, 2023

noperator commented May 11, 2023 • edited Loading

noperator commented May 11, 2023

bharaniabhishek123 commented Jul 15, 2023

noperator commented May 11, 2023 •

edited

Loading