DR-629 index refactor #2567

mungitoperrito · 2024-09-09T19:36:13Z

issue
staging - concepts
staging - configure

weaviate-git-bot · 2024-09-10T10:01:21Z

Great to see you again! Thanks for the contribution.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

…dex-refactor-howto-section

databyjp · 2024-09-24T12:46:11Z

developers/weaviate/concepts/indexing/inverted-indexes.md

+
+## indexFilterable
+
+`indexFilterable` is enabled by default. This index is not required for filtering of BM25 search. However, this index is much faster for filtering than the `indexSearchable` index.


This index is not required for filtering of BM25 search.
I would remove this. Filtering and BM25 are independent, but linked, processes.
Pls see https://weaviate.io/developers/weaviate/concepts/search

If someone wanted to perform "filtering of BM25 search", that sounds to me like combining a filter and a BM25 search. In which case, you would typically want indexFilterable to speed up the filtering.

Filtering can be performed without indexFilterable. But it would be slower.

databyjp · 2024-09-24T12:56:48Z

developers/weaviate/concepts/indexing/inverted-indexes.md

+| Less than | `indexRangeFilters` | `indexFilterable` | `indexRangeFilters` |
+| Less than equal | `indexRangeFilters` | `indexFilterable` | `indexRangeFilters` |
+
+## Collection level settings


Another way to look at it is metadata filtering imo.

To clarify, I believe these options only apply to filters, not to searches. So I think they fit naturally under some kind of filtering discussion.

databyjp · 2024-09-24T13:01:28Z

developers/weaviate/concepts/indexing/vector-indexes.md

+# tags: ['vector index plugins']
+---
+
+Weaviate is a vector database. Most objects in Weaviate collections have one or more vectors. Individual vectors can have thousands of dimensions. Collections can have millions of objects. The resulting vector space can be exceedingly large.


I believe the vector space size is independent of the number of vectors. A vector space is the precision of each dimension x number of dimensions. In this case floats x n_dimensions. The number of actual vectors in the DB does not change the size of the vector space.

databyjp · 2024-09-24T13:03:31Z

developers/weaviate/concepts/indexing/vector-indexes.md

+
+[Vector embeddings](https://weaviate.io/blog/vector-embeddings-explained) are arrays of elements that can capture meaning. The original data can come from text, images, videos, or other content types. A model transforms the underlying data into an embedding. The elements in the embedding are called dimensions. High dimension vectors, with thousands of elements, capture more information, but they are harder to work with.
+
+Vector databases make it easier to work with high dimensional vector embeddings. Embeddings that capture similar meanings are closer to each other than embeddings that capture different meanings. To find objects that have similar semantics, vector databases must efficiently calculate the "distance" between the objects' embeddings.


I think this could be improved by explicitly mentioning the problem of "search". Why does it need to compare meaning, and why at speed?

Adding to this - the biggest reason for having a vector index is to scale vector search. So I think the page should set out the need to have performant, scalable search, and how an index solves it.

databyjp · 2024-09-24T13:05:29Z

developers/weaviate/concepts/indexing/vector-indexes.md

+
+In a real example, the embeddings would have hundreds or thousands of elements. The vector space is difficult to visualize, but the concept is the same. Similar embeddings capture similar meanings and are closer to each other than to embeddings that capture different meanings.
+
+For more details on this representation, see the [GloVe model](https://github.com/stanfordnlp/GloVe) from Stanford or our [vector embeddings blog post](https://weaviate.io/blog/vector-embeddings-explained#what-exactly-are-vector-embeddings).


GloVe is very very old now :D. I would remove this.

databyjp · 2024-09-24T13:06:03Z

developers/weaviate/concepts/indexing/vector-indexes.md

+
+For more details on this representation, see the [GloVe model](https://github.com/stanfordnlp/GloVe) from Stanford or our [vector embeddings blog post](https://weaviate.io/blog/vector-embeddings-explained#what-exactly-are-vector-embeddings).
+
+### Example - supermarket layout


Is this more of an analogy than example?

databyjp · 2024-09-24T13:07:55Z

_includes/indexes/hnsw-how.mdx

@@ -0,0 +1,3 @@
+HNSW indexes build a multi-layered object graph. The graph structure and HNSW algorithm result in efficient, approximate nearest neighbor [(ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) searches.
+
+The index and graph structure are stored in RAM memory. This makes HNSW indexes fast, but RAM is an expensive resource. Consider using [compression](/developers/weaviate/starter-guides/managing-resources/compression) to reduce the size of for your HNSW indexes.


I think this should mention that the actual vectors are loaded into RAM for speed, and that's the biggest reason for the memory footprint.

databyjp · 2024-09-24T13:13:43Z

_includes/indexes/flat-intro.mdx

+
+Flat indexes do brute-force vector searches. The search latency increases linearly with the number of objects. For that reason, flat indexes work best with small collections, less than 10,000 objects.
+
+Flat indexes are best suited for collections that have relatively small object counts. If you expect the object count to grow significantly, consider using a [dynamic index](#dynamic-indexes).


The first sentence basically repeats the previous imo.

databyjp · 2024-09-24T13:18:27Z

_includes/indexes/hnsw-intro.mdx

@@ -0,0 +1 @@
+[Hierarchical Navigable Small World (HNSW) indexes](/developers/weaviate/concepts/indexing/hnsw-indexes) are high-performance, in-memory indexes. HNSW indexes scale well; vector searches are fast, even for very large data sets.


Might be worth noting how they scale. (I believe logarithmically, in O(log n) - using CS terms.)

databyjp · 2024-09-24T13:19:00Z