Vector search and storage

As a data type, a vector is just an array of floating-point numbers. Those numbers are generated by submitting some resource — a word, a string, a document, an image, audio, etc. — to an embedding model,¹ which converts the resource to a vector. A vector database stores those vector embeddings alongside other relational data. In practice, that might look like a table with columns for ID (a primary key), content (as a BLOB or VARCHAR), and a vector. Then it becomes possible to perform queries that find content similar to a search vector, like so:

SELECT id
  FROM t1
  ORDER BY DISTANCE(TO_VECTOR('[1.2, 3.4, 5.6]'), embedding, 'L2_squared')
  LIMIT 10;

Possible applications include recommendation engines that show products similar to a user’s purchase history, or search engines that find documents or other resources based on natural-language queries. Read our applications of vector databases docs to learn more about how vector databases can be applied in the real world. PlanetScale has added support for vector columns, vector distance functions, and vector indexes, as described below. If at any point you experience issues with vectors, we highly encourage you to get in touch. Your feedback is extremely valuable as we continue to refine vectors support, so don’t hesitate to reach out. You can submit a support ticket to relay any feedback or issues. We also have a vectors channel in our Discord where you can ask questions and share feedback. [¹]: PlanetScale does not currently provide an embedding service. You can find several good cloud-based options like OpenAI or AWS Titan, or local options like Python sentence_transformers.

Enabling PlanetScale vectors

PlanetScale has a custom version of MySQL that has been extended with vector support. Vector support can be enabled on a per-branch basis. The branch will be updated to the vectors-capable version of MySQL. Branches with vectors enabled will repurpose half of the memory in the InnoDB buffer pool for use with vectors. To enable the vector support on a branch:

In the database where you want to use vectors, navigate to the “Branches” tab, and click on the branch where you would like to enable vector support.

Click on the small gear icon underneath the “Connect” button on the right.

Click the toggle next to “Enable vectors”.

Click “Save branch settings”.

The branch will upgrade asynchronously to the correct version of MySQL, which may take 30-60 minutes. While this happens, the database dashboard will show an “Enabling vectors” badge, which changes to a “Vector-enabled” badge when the upgrade is complete.

Adding vector columns

As a first step, create one or more columns with the VECTOR type and then insert some vectors. Here’s an example:

CREATE TABLE t1(id INT PRIMARY KEY auto_increment, embedding VECTOR(4));
INSERT INTO t1(embedding) VALUES (TO_VECTOR('[1, 2, 3, 4]')),
                                 (TO_VECTOR('[5, 6, 7, 8]'));

At some point, you’ll want to build an index on this vector column to facilitate fast similarity searches. Here’s how to create one:

CREATE /*vt+ QUERY_TIMEOUT_MS=0 */
  VECTOR INDEX embedding_index ON t1(embedding);

The QUERY_TIMEOUT_MS comment is optional, but important for tables over 100,000 rows. Vector indexes take a significant amount of time to build, and the comment allows the CREATE statement to bypass Vitess’s usual timeout. Then, you can perform similarity searches like so:

SELECT id, DISTANCE(TO_VECTOR('[3, 3, 3, 3]'), embedding, 'L2_SQUARED') AS d
  FROM t1
  ORDER BY d
  LIMIT 10;

Use an EXPLAIN query to confirm that the query uses the new index. This query actually won’t use the index until the table has around 50 rows in it. Vector indexes provide approximate results. An unindexed query with LIMIT 100 returns exactly the 100 rows closest to the reference vector, after performing a full table scan and a sort. An indexed query returns, on average, about 100 of the top 105 (around 95%) of the rows closest to the reference vector, but much faster than a full table scan. This is expected, because all efficient vector indexes, including PlanetScale’s vector indexes, perform approximate nearest neighbor (ANN) searches. If you are adding vectors to your database from an application, you may want to use prepared statements, although we do not recommend it. TO_VECTOR works in that setting, but serializing the vectors on the client side and uploading them as binary is faster. The serialized format is IEEE-754 32-bit floats, which you can serialize with code like this:

Python: struct.pack(f'{len(float_array)}f', *float_array)
Ruby: float_array.pack(“f*”)
Rust: float_array.map(|f| f.to_ne_bytes()).flatten().collect()

You can use the resulting blob (which will be 4 bytes times the number of dimensions in the vector) in an INSERT statement like this:

INSERT INTO t1(embedding) VALUES
  (CAST(? AS CHAR CHARACTER SET binary));

Vector index parameters

Statements that create a vector index take optional parameters, which can be specified as JSON key-value pairs, like so:

CREATE VECTOR INDEX embedding_index ON t1(embedding)
  SECONDARY_ENGINE_ATTRIBUTE='{"type":"spann", "distance":"cosine"}';

The type attribute specifies the algorithm used to build and query the vector index. It is optional, and the only valid value is spann, which is the default. The distance attribute specifies the distance metric that queries will use, and can be any of the following:

dot for the dot product
cosine for the cosine of the angle between the two vectors.
l2 or euclidean for the length of a line between the ends of the vectors
l2_squared or euclidean_squared for the square of the Euclidean distance. This is the default.

The distance metric specified at index creation time must match the distance metric used at query time, or the index cannot be used, and MySQL will perform a full-table scan instead. Other possible options include fixed_quantization and product_quantization, which configure vector compression in the index, resulting in lower storage requirements and faster query times. See the Quantization section for more information.

Filtered vector queries

Vector data is stored in tables alongside any other relational data, and sometimes applications need to query, join, or filter based on that relational data. In particular, queries with WHERE clauses work, and MySQL chooses the index or indexes that allow the query to complete the fastest. For example, in a table of products for sale containing indexed columns for price and seller ID, here are how two queries might be executed:

SELECT id,price,seller_id
  FROM products
  WHERE price < 20.0
  ORDER BY DISTANCE(TO_VECTOR('[1.2, 3.4, 5.6]'), embedding, 'L2_squared')
  LIMIT 10;

This query selects the ten products with a price under

20 closest to some reference vector. If a non-trivial fraction of products cost less than

20, then MySQL will use the vector index to produce more than ten results, and filter them down to exactly ten results that meet the price constraint.

SELECT id,price,seller_id
  FROM products
  WHERE seller_id=789
  ORDER BY DISTANCE(TO_VECTOR('[1.2, 3.4, 5.6]'), embedding, 'L2_squared')
  LIMIT 10;

This query selects the ten products from a given seller that are closest to some reference vector. If that seller’s products are a small enough fraction of the rows in the table, then MySQL will use the index on seller_id to find all products from that seller, and will sort those products by vector distance and return the top ten. The MySQL query planner chooses whether to use the vector index or some other index automatically based on the query and based on the contents of the table, to maximize query performance. Use EXPLAIN on any given query to see how it will execute. As we continue to refine vectors support, we’re looking for feedback on how well MySQL plans vector queries. If you believe you’ve hit an edge case or something looks wrong, please open a support ticket and let us know.

Resource requirements

A vector index with default parameters requires 3-4x as much disk space as the underlying data in the vector column, and it requires around 15% as much memory as the underlying data. For example, a vector column with 1536 dimensions and a million rows takes up 1536 * 1000000 * 4 = 5.7GiB of space in the table. The corresponding index will be 18-24GiB, and it will require around 800MiB of memory to build and query. Using quantization settings other than the default will increase or decrease the required disk space and memory proportional to how much less or more each vector gets compressed relative to the 16-bit default. PlanetScale scales storage space automatically for any branch backed by EBS. On Metal, make sure the instance size you choose has enough storage for the vector indexes you plan to create. By default, vector indexes reserve half of the buffer-pool memory allocated to each instance, equal to about one third of overall instance memory. This tradeoff can be changed in the cluster configuration for any branch with vectors enabled.

Usage recommendations

Incremental indexes (indexes that begin empty and update as new vectors are added) function correctly, but are significantly slower to build compared to a one-shot index (an index built on an existing set of vectors). Disk usage is much higher due to potentially very high InnoDB blob fragmentation issues, so it’s much easier to run out of disk space.
Once you opt a branch into the vectors feature, that branch must continue to run a vectors-enabled version of MySQL. You can remove your vector columns/tables, but you cannot downgrade that branch to its prior version of MySQL.

Feedback

We want to make our vectors offering as reliable, fast, and feature-rich as possible. Feedback from our early users will help make this possible. If you encounter any issues, crashes, unexpected errors or poor performance, please submit a support ticket. You are also welcome to reach out with general feedback and suggestions. We also have a Discord channel for the vectors feature where you can ask questions, share feedback, and discuss what you’re working on.

Need help?

Get help from the PlanetScale Support team, or join our GitHub discussion board to see how others are using PlanetScale.

Platform

Plans and billing

Integrations

Guides

Vector search and storage

Enabling PlanetScale vectors

Adding vector columns

Vector index parameters

Filtered vector queries

Resource requirements

Usage recommendations

Feedback

Need help?

Platform

Plans and billing

Integrations

Guides

​Enabling PlanetScale vectors

​Adding vector columns

​Vector index parameters

​Filtered vector queries

​Resource requirements

​Usage recommendations

​Feedback

​Need help?

Enabling PlanetScale vectors

Adding vector columns

Vector index parameters

Filtered vector queries

Resource requirements

Usage recommendations

Feedback

Need help?