Vector databases

August 12, 2024 — August 12, 2024

approximation
feature construction
geometry
high d
language
linear algebra
machine learning
metrics
neural nets
NLP
Figure 1

Databases for proximity search over vectors. Made important especially by vector embeddings which are important in classic search, recommendation, and AI search systems.

1 Tools

1.1 ChromaDB

ChromaDB is a vector database with a focus on search and retrieval. I used it to store vector embedding to note “similar posts” on this site and I can report it was incredibly easy for my use case. No fancy extra DB servers required.

1.2 Pinecone

TBD

1.3 Milvus

Milvus:

Built on top of popular vector search libraries including Faiss, Annoy, HNSW, and more, Milvus was designed for similarity search on dense vector datasets containing millions, billions, or even trillions of vectors. Before proceeding, familiarize yourself with the basic principles of embedding retrieval.

Milvus also supports data sharding, data persistence, streaming data ingestion, hybrid search between vector and scalar data, time travel, and many other advanced functions. The platform offers performance on demand and can be optimized to suit any embedding retrieval scenario. We recommend deploying Milvus using Kubernetes for optimal availability and elasticity.

Milvus adopts a shared-storage architecture featuring storage and computing disaggregation and horizontal scalability for its computing nodes. Following the principle of data plane and control plane disaggregation, Milvus comprises four layers: access layer, coordinator service, worker node, and storage. These layers are mutually independent when it comes to scaling or disaster recovery.

Milvus Lite is a simplified alternative to Milvus that offers many advantages and benefits.

  • You can integrate it into your Python application without adding extra weight.
  • It is self-contained and does not require any other dependencies, thanks to the standalone Milvus’ ability to work with embedded Etcd and local storage.
  • You can import it as a Python library and use it as a command-line interface (CLI)-based standalone server.
  • It works smoothly with Google Colab and Jupyter Notebook.
  • You can safely migrate your work and write code to other Milvus instances (standalone, clustered, and fully-managed versions) without any risk of losing data.

Voyager Python Documentation

Voyager features bindings to both Python and Java, with feature parity and index compatibility between both languages. It uses the HNSW algorithm, based on the open-source hnswlib package, with numerous features added for convenience and speed. Voyager is used extensively in production at Spotify and is queried hundreds of millions of times per day to power numerous user-facing features.

Think of Voyager like Sparkey, but for vector/embedding data; or like Annoy, but with much higher recall. It got its name because it searches through (embedding) space(s), much like the Voyager interstellar probes launched by NASA in 1977.