Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DR-629 index refactor #2567

Draft
wants to merge 80 commits into
base: main
Choose a base branch
from
Draft

Conversation

mungitoperrito
Copy link
Contributor

@mungitoperrito mungitoperrito commented Sep 9, 2024

@mungitoperrito mungitoperrito changed the title framework DR-629 index refactor - HowTo section Sep 9, 2024
@weaviate-git-bot
Copy link

Great to see you again! Thanks for the contribution.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?


## indexFilterable

`indexFilterable` is enabled by default. This index is not required for filtering of BM25 search. However, this index is much faster for filtering than the `indexSearchable` index.
Copy link
Contributor

@databyjp databyjp Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This index is not required for filtering of BM25 search.
I would remove this. Filtering and BM25 are independent, but linked, processes.
Pls see https://weaviate.io/developers/weaviate/concepts/search

If someone wanted to perform "filtering of BM25 search", that sounds to me like combining a filter and a BM25 search. In which case, you would typically want indexFilterable to speed up the filtering.

Filtering can be performed without indexFilterable. But it would be slower.

| Less than | `indexRangeFilters` | `indexFilterable` | `indexRangeFilters` |
| Less than equal | `indexRangeFilters` | `indexFilterable` | `indexRangeFilters` |

## Collection level settings
Copy link
Contributor

@databyjp databyjp Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to look at it is metadata filtering imo.

To clarify, I believe these options only apply to filters, not to searches. So I think they fit naturally under some kind of filtering discussion.

# tags: ['vector index plugins']
---

Weaviate is a vector database. Most objects in Weaviate collections have one or more vectors. Individual vectors can have thousands of dimensions. Collections can have millions of objects. The resulting vector space can be exceedingly large.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the vector space size is independent of the number of vectors. A vector space is the precision of each dimension x number of dimensions. In this case floats x n_dimensions. The number of actual vectors in the DB does not change the size of the vector space.


[Vector embeddings](https://weaviate.io/blog/vector-embeddings-explained) are arrays of elements that can capture meaning. The original data can come from text, images, videos, or other content types. A model transforms the underlying data into an embedding. The elements in the embedding are called dimensions. High dimension vectors, with thousands of elements, capture more information, but they are harder to work with.

Vector databases make it easier to work with high dimensional vector embeddings. Embeddings that capture similar meanings are closer to each other than embeddings that capture different meanings. To find objects that have similar semantics, vector databases must efficiently calculate the "distance" between the objects' embeddings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be improved by explicitly mentioning the problem of "search". Why does it need to compare meaning, and why at speed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to this - the biggest reason for having a vector index is to scale vector search. So I think the page should set out the need to have performant, scalable search, and how an index solves it.


In a real example, the embeddings would have hundreds or thousands of elements. The vector space is difficult to visualize, but the concept is the same. Similar embeddings capture similar meanings and are closer to each other than to embeddings that capture different meanings.

For more details on this representation, see the [GloVe model](https://github.com/stanfordnlp/GloVe) from Stanford or our [vector embeddings blog post](https://weaviate.io/blog/vector-embeddings-explained#what-exactly-are-vector-embeddings).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GloVe is very very old now :D. I would remove this.


For more details on this representation, see the [GloVe model](https://github.com/stanfordnlp/GloVe) from Stanford or our [vector embeddings blog post](https://weaviate.io/blog/vector-embeddings-explained#what-exactly-are-vector-embeddings).

### Example - supermarket layout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more of an analogy than example?

@@ -0,0 +1,3 @@
HNSW indexes build a multi-layered object graph. The graph structure and HNSW algorithm result in efficient, approximate nearest neighbor [(ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) searches.

The index and graph structure are stored in RAM memory. This makes HNSW indexes fast, but RAM is an expensive resource. Consider using [compression](/developers/weaviate/starter-guides/managing-resources/compression) to reduce the size of for your HNSW indexes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should mention that the actual vectors are loaded into RAM for speed, and that's the biggest reason for the memory footprint.


Flat indexes do brute-force vector searches. The search latency increases linearly with the number of objects. For that reason, flat indexes work best with small collections, less than 10,000 objects.

Flat indexes are best suited for collections that have relatively small object counts. If you expect the object count to grow significantly, consider using a [dynamic index](#dynamic-indexes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence basically repeats the previous imo.

@@ -0,0 +1 @@
[Hierarchical Navigable Small World (HNSW) indexes](/developers/weaviate/concepts/indexing/hnsw-indexes) are high-performance, in-memory indexes. HNSW indexes scale well; vector searches are fast, even for very large data sets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth noting how they scale. (I believe logarithmically, in O(log n) - using CS terms.)


![HNSW layers](../img/hnsw-layers.svg "HNSW layers")

Layer zero is the lowest layer. Layer zero contains every object in the database, and the objects are well connected to each other.
Copy link
Contributor

@databyjp databyjp Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layer zero contains every object in the database

Layer zero contains every object in the index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objects are well connected to each other

Not sure what this means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "all connected to each other"? That's partly true - each object will have a maximum of maxConnections * 2 connections on the bottom layer, to their nearest neighbors.


Layer zero is the lowest layer. Layer zero contains every object in the database, and the objects are well connected to each other.

Some of the objects are also represented in the layers above layer zero. Each layer above layer zero has fewer objects and fewer connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each layer above layer zero has fewer objects and fewer connections.

Above layer 0, I think each object should theoretically have the same maxConnections number.


Some of the objects are also represented in the layers above layer zero. Each layer above layer zero has fewer objects and fewer connections.

When HNSW searches the graph, it starts at the highest layer. The algorithm finds the closest matching data points in the highest layer. Then, HNSW goes one layer deeper, and finds the closest matching data points in the lower layer that correspond to the objects in the higher layer. These are the nearest neighbors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, HNSW goes one layer deeper, and finds the closest matching data points in the lower layer that correspond to the objects in the higher layer. These are the nearest neighbors.

I believe it traverses downwards in each graph, and builds a "candidate list" - determined by ef or dynamicEf factors.


The HNSW algorithm searches the lower layer and creates list of nearest neighbors. The nearest neighbors list is the starting point for a similar search on the next layer down. The process repeats until the search reaches the lowest (deepest) layer. Finally, the HNSW algorithm returns the data objects that are closest to the search query.

Since there are relatively few data objects on the higher layers, HNSW 'jumps' over large amounts of data that it doesn't need to search. In contrast, when a data store has only one layer, the search algorithm can't skip unrelated objects. Flat hierarchies mean the search engine has to scan significantly more data objects even though those object are unlikely to match the search criteria.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when a data store has only one layer,

Perhaps might be better to say "when a vector index ..."


Since there are relatively few data objects on the higher layers, HNSW 'jumps' over large amounts of data that it doesn't need to search. In contrast, when a data store has only one layer, the search algorithm can't skip unrelated objects. Flat hierarchies mean the search engine has to scan significantly more data objects even though those object are unlikely to match the search criteria.

Weaviate's HNSW implementation is a very fast, memory efficient, approach to similarity search. The memory cache only stores the highest layer of the index instead of storing all of the data objects from the lowest layer. As a search moves from a higher layer to a lower one, HNSW only adds the data objects that are closest to the search query. This means HNSW uses a relatively small amount of memory compared to other search algorithms.
Copy link
Contributor

@databyjp databyjp Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weaviate's HNSW implementation is a very fast, memory efficient, approach to similarity search. The memory cache only stores the highest layer of the index instead of storing all of the data objects from the lowest layer.

Is this true? That's not how I understood what happens.


### Configure dynamic ef

The `ef` parameter controls the size of the ANN list at query time. You can configure a specific list size or else let Weaviate configure the list dynamically. If you choose dynamic `ef`, Weaviate provides several options to control the size of the ANN list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be clearer that ef needs to be set to -1 for dynamic ef factors to take effect.


The `ef` parameter controls the size of the ANN list at query time. You can configure a specific list size or else let Weaviate configure the list dynamically. If you choose dynamic `ef`, Weaviate provides several options to control the size of the ANN list.

The length of the list is determined by the query response limit that you set in your query. Weaviate uses the query limit as an anchor and modifies the size of ANN list according to the values you set for the `dynamicEf` parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of the list is determined by the query response limit that you set in your query.

What happens if a limit is not specified? I think it would use the default limit.


### Dynamic ef example

Consider this GraphQL query that sets a limit of 4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I would lead with a GQL example 😄. Maybe a client example?

}
```

The resulting search list has these characteristics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search list

I see the doc uses a few different versions of this, like "nearest neighbor list", "ANN list" and "search list". I think this could be confusing. The typical term imo is "candidate list", as it gets updated throughout the search process, and the top n results are returned once the algorithm terminates.


## Other considerations

HNSW indexes enable very fast queries, but they are not as fast at import time. Rebuilding the index when you add new vectors can be resource intensive. If you use HNSW, consider [enabling asynchronous indexing](/developers/weaviate/configuration/indexing-vector/dynamic-indexes#asynchronous-indexing) to improve system response during imports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HNSW indexes enable very fast queries, but they are not as fast at import time.

Maybe we could say they incur "overhead to build the index" or similar. The first part of the sentence is absolute ("very fast queries") but the second is relative - but it's not super clear what it is being compared to. It almost reads as though imports are slower than queries, whereas I think we mean that building an HNSW index is slower than building a flat index.


## Overview

Dynamic indexes are flat indexed collections that Weaviate converts to HNSW indexed collections when the collection reaches a certain size. Flat indexes work well for collections with less than 10,000 objects. At that size, flat indexes have low memory overhead and good latency. But, search latency increases as the number of objects in a collection increases. When the collection grows to about 10,000 objects, an HNSW index usually has better latency than a flat index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic indexes are flat indexed collections that Weaviate converts to HNSW indexed collections

I would frame this around an "index" rather than collection, because:

  • This applies to tenants as well as collections
  • Each collection/tenant can have multiple vector indexes (with named vectors)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the collection grows to about 10,000 objects, an HNSW index usually has better latency than a flat index.

Oh I didn't know this. Is there a crossover point?


Dynamic indexes are flat indexed collections that Weaviate converts to HNSW indexed collections when the collection reaches a certain size. Flat indexes work well for collections with less than 10,000 objects. At that size, flat indexes have low memory overhead and good latency. But, search latency increases as the number of objects in a collection increases. When the collection grows to about 10,000 objects, an HNSW index usually has better latency than a flat index.

The dynamic index helps to balance resource costs against search latency times. Flat indexes are disc-based. They are responsive at low object counts, but get slower as object counts grow. HNSW indexes reside in RAM. They are very fast, but RAM is expensive. Disk storage is orders of magnitude cheaper than RAM memory, so hosting an index on disc is significantly cheaper than hosting it in RAM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might have used "disk" (not "disc") on the other pages.


The dynamic index helps to balance resource costs against search latency times. Flat indexes are disc-based. They are responsive at low object counts, but get slower as object counts grow. HNSW indexes reside in RAM. They are very fast, but RAM is expensive. Disk storage is orders of magnitude cheaper than RAM memory, so hosting an index on disc is significantly cheaper than hosting it in RAM.

If your collection size grows over time, or if you have a mix of smaller and larger tenants, dynamic indexes let you take advantage of lower cost flat indexes while object counts and search latency times are low. When the object count increase, and latencies grow larger, converting the flat index to an HNSW index preserves low search latencies at the expense of increased RAM memory costs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your collection size grows over time, or if you have a mix of smaller and larger tenants,

This kind of comes out suddently without discussing multi-tenancy, and might be a bit jarring for the reader.


The `indexFilterable` index improves [filtering](/developers/weaviate/search/filters). This index is enabled by default.

If you don't anticipate searching on a property field, disable this index to save disk space and import time. The property is still filterable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't anticipate searching on a property field

Should it be "filtering on a property ..."?


Weaviate uses [inverted indexes](/developers/weaviate/concepts/indexing#inverted-indexes), also known as keyword indexes, to make textual and numeric searches more efficient. Weaviate provides different kinds to inverted index so you can match better match the index to your data.

These indexes are normally configured on a property level:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree with "normally" configured on a property level. Some aspects like whether to enable or disable a particular inverted index, are configurable on a property level - but others, like whether to index metadata, are configurable on a collection level.


## indexSearchable

The `indexSearchable` index improves property search times. This index is enabled by default. [Keyword search](/developers/weaviate/search/bm25) and [hybrid search](/developers/weaviate/search/hybrid) use this index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexSearchable index improves property search times.

Just tried disabling this; and I think BM25 searches do not work at all without it.


## Collection level properties

These properties are configured on the collection level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the text could briefly state what these options do (enable/disable metadata filtering). They could be comments in code or in the prose here.

The same with the header. I think "Enable metadata filters" or something like that could really help the users find what this section does, rather than where (collection level) it is configured.

import TSCodeV3 from '!!raw-loader!/_includes/code/howto/indexes/indexes-v3.ts';
import TSCodeV2 from '!!raw-loader!/_includes/code/howto/indexes/indexes-v2.ts';

Items in a collection can have multiple named vectors. Each named vectors has it's own vector index. These vector indexes can be configured independently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's

Should be its :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants