Introducing ScaNN in BigQuery vector search for large query batches

2024-08-20

Google Cloud

Google Cloud announced the preview of the TreeAH vector index, bringing core pieces from Google’s research and innovation in approximate nearest neighbor algorithms to BigQuery. This new index type uses the same underlying technology that powers some of Google’s most popular services and delivers significant latency and cost reductions in certain situations compared to the first index implemented in BigQuery, the inverted file index (IVF).

One of the key advantages of the TreeAH index is its use of asymmetric hashing (the “AH” in TreeAH), which uses product quantization to compress embeddings. Coupled with a CPU-optimized distance computation algorithm, vector search using TreeAH can be orders of magnitude faster and more cost-efficient than IVF. Index generation can also be 10x faster and cheaper and have a smaller memory footprint, as only the compressed embeddings are stored.

Benchmarks conducted by Google’s engineering team showed that TreeAH significantly outperforms IVF when the query batch size is large. For example, for query batches with 10,000 vectors, TreeAH was up to 23x faster and 95% cheaper than IVF. TreeAH index training was also significantly faster and cheaper than IVF in most cases.

However, it is worth noting that TreeAH is still under active development and there are some current limitations. For example, the base table can have a maximum of 200 million rows, and stored columns and pre-filtering are not supported for the TreeAH index.

Overall, TreeAH is a valuable addition to BigQuery, offering significant performance and cost benefits for certain types of vector search workloads. This is expected to enable more use cases for vector search in BigQuery, such as semantic search and LLM-based retrieval-augmented generation (RAG).

Introducing ScaNN in BigQuery vector search for large query batches

Recommends