How to Store and Query Embeddings in ClickHouse for Scalable Vector Search
clickhousevector-searchdatabases

How to Store and Query Embeddings in ClickHouse for Scalable Vector Search

ffuzzy
2026-02-24
11 min read
Advertisement

Practical guide (2026) to store embeddings, build ANN indices, and run hybrid fuzzy+vector search in ClickHouse for scalable recommendations.

Stop losing users to misspellings and scaleable vector costs — store and query embeddings in ClickHouse

If your product requires recommendations, semantic search, or fuzzy matching at OLAP scale, you know the pain: dozens of false negatives in search results, unpredictable latency when traffic spikes, and a messy stack that splits analytics and retrieval across separate stores. In 2026, many engineering teams are consolidating vector search into their analytics tier. ClickHouse's 2025–26 momentum (including a high-profile funding round) and its OLAP-first architecture make it a compelling place to host embeddings and run hybrid fuzzy + vector queries that are both scalable and cost-efficient.

Why this matters: ClickHouse raised a $400M round led by Dragoneer and was valued at $15B in late 2025 — that capital accelerated investments into native vector features and operational tooling for production vector search.

— Dina Bass / Bloomberg (Jan 2026)

What you'll get from this guide

  • Practical ingestion pipelines for embeddings (batch and streaming)
  • How to store vectors in ClickHouse (two safe schemas)
  • Step-by-step to build ANN indexes and tune them (HNSW / IVF patterns)
  • Hybrid query patterns combining fuzzy lexical filters and ANN for precision and speed
  • Operational advice: scaling, monitoring, reindexing, and cost tradeoffs

The 2026 context: ClickHouse, vectors, and OLAP + retrieval convergence

By early 2026, two trends are clear:

  1. Vector consolidation: Enterprises want fewer moving parts. Putting vectors into the analytics store reduces data movement and simplifies ownership.
  2. OLAP meets ANN: ClickHouse’s architecture (columnar storage, efficient merges, and compact on-disk formats) is well-suited for large-scale ANN because it can host billions of vectors with good IO patterns and predictable cost.

These trends are backed by product and community investments following ClickHouse’s late-2025 funding. Expect continued improvements to native vector types, ANN index primitives, and cloud-managed vector features through 2026.

High-level architecture: Where ClickHouse fits in your stack

Recommended role for ClickHouse:

  • Primary storage for embeddings and associated metadata (product text, user ids, timestamps)
  • Host for hybrid queries: apply cheap lexical filters first, then run ANN to refine candidates, finally rerank with business signals
  • Analytics for retraining models: run large-scale aggregations and build monitoring views without moving data

What NOT to do

  • Don’t treat ClickHouse as a single-node vector index for low-latency per-request nearest-neighbour at microsecond scale — it is optimized for high-throughput, low-cost OLAP workloads.
  • Don’t store embeddings without normalization or metadata — you’ll lose ability to tune similarity and filter candidates.

Step 1 — Choose a storage schema (two safe patterns)

ClickHouse supports multiple ways to store vectors. Use whichever matches your ClickHouse version and operational needs.

Option A — Portable and battle-tested: Array(Float32)

Use this when you want compatibility across ClickHouse releases. Array(Float32) is simple and works with built-in functions.

CREATE TABLE product_embeddings (
  id UUID,
  product_id UInt64,
  title String,
  description String,
  embedding Array(Float32), -- 1536 / 768 dims
  embedding_norm Float32,
  created_at DateTime
) ENGINE = MergeTree()
ORDER BY (product_id);

Store an embedding_norm (L2 norm) at write time to speed up cosine distance computations and avoid recomputing during queries.

Option B — Native vector type (if your ClickHouse version has it)

Recent ClickHouse releases in 2025–26 introduced a vector data type in many distributions and added ANN-specific index primitives. If available, it can be more compact and faster.

CREATE TABLE product_embeddings_v (
  id UUID,
  product_id UInt64,
  title String,
  embedding Vector(Float32, 1536),
  created_at DateTime
) ENGINE = MergeTree()
ORDER BY (product_id);

If you use the native vector type, you can rely on built-in distance functions (cosineDistance, dotProduct) and, in many releases, native ANN indexes.

Step 2 — Ingestion: embedding generation and best practices

Embedding ingestion has two patterns: batch (reindex), and streaming (near-real-time updates). Both share the same best practices.

Embedding generation

  • Use a consistent model (for reproducibility). Record model metadata (name, version, hyperparams) in a separate table.
  • Normalize embeddings at write time. For cosine similarity: divide by L2 norm and save the norm for fallback checks.
  • Binarize or quantize only when you have a measured memory/latency reason and baseline recall tests.

Example Python ingestion (batch)

from clickhouse_connect import Client
import numpy as np

client = Client(host='ch-cluster.local', username='default', password='')

# example embed call (replace with your model)
def embed(text):
    v = model.encode(text).astype('float32')
    norm = np.linalg.norm(v)
    if norm > 0:
        v = v / norm
    return v.tolist(), float(norm)

rows = []
for product in products:
    vec, norm = embed(product['title'] + ' ' + product['description'])
    rows.append((product['id'], product['title'], vec, norm, now()))

client.insert_rows('product_embeddings', rows)

Step 3 — Build an ANN index and tuning knobs

Two patterns work well in production with ClickHouse:

  1. Native ANN index (if available): use ClickHouse’s built-in ANN index or index metadata introduced in 2025–26 for HNSW and IVF.
  2. External ANN + ClickHouse metadata: use FAISS / HNSWLIB / Milvus or Redis for the index and store metadata and embeddings in ClickHouse. Use a join step to enrich results.

When ClickHouse supports an ANN index type, you typically add an index on the vector column. Example (syntax varies by release):

ALTER TABLE product_embeddings_v
  ADD INDEX ann_idx (embedding) TYPE ann(ALGO='hnsw', M=16, efConstruction=200) GRANULARITY 1;

Key knobs (HNSW):

  • M (connectivity): controls graph degree. Higher M → better recall, more memory. Typical start: 12–32.
  • efConstruction: index build quality and build time. Start 100–200 for production quality.
  • efSearch: query-time recall/latency trade-off. Start 50–200; increase until recall stabilizes.

IVF / quantization (memory efficient)

For very large corpora (hundreds of millions to billions of vectors), IVF + PQ or scalar quantization reduces RAM footprint at the cost of recall. Use it if your budget limits RAM and you accept a small recall reduction.

External ANN (when you want specialized control)

Run FAISS/HNSWLIB/Milvus for the index and keep ClickHouse as the source of truth. Use a two-step search:

  1. Query the ANN engine for top-K IDs.
  2. Batch-retrieve those IDs' metadata from ClickHouse for final scoring.

Step 4 — Hybrid search: combine fuzzy lexical filters + ANN

Hybrid search is a core pattern: use a cheap lexical filter to reduce candidates, then run ANN or distance computation for semantic ranking. This saves CPU and memory and often improves precision.

Why hybrid works

  • Fuzzy lexical filtering removes gross mismatches (e.g., different product categories).
  • ANN handles paraphrase and intent mismatch that lexical can’t catch.
  • Combining both yields faster, more precise top-K results — fewer vector distance computations and less memory pressure.

Lexical filters you can use in ClickHouse

  • Full-text-ish token filters: use ngrambf_v1 or tokenbf_v1 indexes (n-gram bloom filters) for high-selectivity filters.
  • Simple predicates: category_id = X, availability = 1
  • Fuzzy string matching: use trigram or edit-distance functions where available, but prefer n-gram indexes for scale.

Hybrid query pattern (example)

Two-step query: filter by tokens → ANN search on the reduced set → final rerank combining lexical score and cosine similarity.

-- 1) Lexical filter to narrow to candidates (fast)
  SELECT product_id
  FROM product_embeddings
  WHERE category_id = 42
    AND (title ILIKE '%waterproof%' OR description ILIKE '%water-proof%')
  LIMIT 2000
  
  -- 2) ANN / distance computation on candidates
  SELECT p.product_id, p.title,
         1 - cosineDistance(p.embedding, query_emb) AS vec_sim,
         levenshteinUTF8(lower(p.title), lower(:q)) AS edit_dist
  FROM product_embeddings p
  WHERE p.product_id IN (/*list of IDs from step 1*/)
  ORDER BY (0.7 * vec_sim + 0.3 * (1.0 - edit_dist / 20.0)) DESC
  LIMIT 10;

Explanation:

  • Step 1 returns about 500–5,000 candidates — cheap and highly selective.
  • Step 2 computes vector similarity only for those candidates, then combines semantic score and normalized lexical score for final rank.

Tuning: recall, latency, and cost trade-offs

Measure these metrics:

  • Recall@k vs a golden dataset
  • Latency P50 / P95 for queries under expected concurrency
  • Throughput (QPS) and resource usage (CPU, memory, IO)

Recommended tuning loop:

  1. Start with a small sample index and realistic query set.
  2. Set ANN parameters for high recall (M higher, efSearch higher), measure latency.
  3. Adjust hybrid filter selectivity to reduce candidate set until latency targets met.
  4. Consider quantization only if memory is the bottleneck and recall tests pass.

Operational guidance

Sharding and replication

  • Shard by product_id ranges or a hash of product_id to spread index memory across nodes.
  • Replicate for HA and for read-scaling. ANN indexes may be local to nodes — account for index rebuilds during node replacement.

Index rebuilds and incremental updates

ANN indexes can be expensive to rebuild. Best practices:

  • Buffer writes and run periodic batch rebuilds (nightly) for large corpora.
  • For near-real-time needs, maintain a small ‘recent’ index and periodically merge into the main index.
  • Automate health checks: track index sizes, efSearch defaults, and any recovery operations.

Monitoring and alerts

  • Track query latency by type (hybrid vs pure lexical vs pure vector).
  • Monitor memory used by ANN structures per node (graph/node counts, PQ memory).
  • Alert on dropped recall or sudden latency spikes after model rollout.

Benchmarks and cost modeling (how to estimate)

Run microbenchmarks on sample shards:

  • Latency per query as efSearch varies (plot P50 / P95)
  • Memory per vector for HNSW (roughly M * vector_count * bytes_per_entry)
  • IO pattern for cold vs warm caches (NVMe hot cache is far faster than network storage)

Cost modeling tips:

  • Compute RAM needed for your ANN config and multiply by node count + headroom.
  • Estimate cost savings vs running a separate vector DB: fewer egresses, single backup, single ops team.
  • Model rebuild windows: how often will you reindex and what is the CPU impact?

Real-world case study (pattern)

Company: a SaaS vendor with 200M product rows and multi-tenant traffic. Goals: reduce search latency to <200ms p95 and increase conversion by surfacing correct results despite typos.

  1. Ingested embeddings (1536 dims) into ClickHouse using Array(Float32).
  2. Built a hybrid pipeline: category ngram bloom filters → ANN candidates (top 2k) → final rerank combining vector cosine and business score.
  3. Tuned HNSW parameters (M=24, efConstruction=300, efSearch=120) and sharded by product_id hash across 16 shards.
  4. Result: p95 latency dropped by 35%, recall@10 increased 18%, overall infra cost decreased by 25% vs running a separate FAISS service and duplicating data.

Common pitfalls and how to avoid them

  • Too-large candidate sets — use selective lexical filters or business filters before ANN.
  • Inconsistent normalization — always store normalized embeddings or norms to avoid ranking drift when models change.
  • Index rebuild surprises — automate and test rebuilds on a QA cluster before production runs.
  • Model drift — version your models, keep a replay log of queries and periodically validate recall with ground truth.
  • Model-aware retrieval: store additional dense features (popularity, CTR embeddings) and concat them into hybrid vectors when re-ranking.
  • Adaptive efSearch: dynamically increase efSearch for long-tail queries or VIP customers and use lower efSearch for common queries to save resources.
  • Query routing: route complex hybrid queries to a dedicated pool of nodes tuned for high efSearch and route lightweight filters to cheaper nodes.
  • Vector quantization on ingestion: do PQ for cold vectors and keep hot vectors unquantized.

Checklist before shipping to production

  1. Have a golden evaluation set and measure recall@k regularly.
  2. Establish latency SLOs and test under realistic concurrency.
  3. Automate index builds, backups, and restores of the vector payload.
  4. Monitor model and embedding drift; rollback plan for model updates.
  5. Document query behavior and ranking formula for the product and data teams.

Actionable takeaways

  • Store normalized vectors and model metadata in ClickHouse to make semantic search reproducible.
  • Use hybrid search: filter first with cheap lexical predicates, then run ANN on the smaller candidate set.
  • Tune ANN parameters (M, efConstruction, efSearch) on a realistic sample and measure recall vs latency.
  • Shard and replicate to distribute ANN memory and provide read-scaling and HA.
  • Run continuous evaluation against a golden set and automate reindexing and monitoring.

Final notes — the next 12 months

Given ClickHouse’s increased investment in 2025–26 and the broader industry push for single-store vector + analytics solutions, expect faster native vector features, improved index primitives, and richer SQL primitives for hybrid ranking throughout 2026. Teams that adopt a hybrid, OLAP-driven approach today will benefit from simpler operations and lower TCO as features mature.

Get started checklist (quick)

  1. Pick schema (Array(Float32) for compatibility or native Vector if available).
  2. Implement an ingestion job that normalizes embeddings and stores model metadata.
  3. Build a small ANN index, benchmark with a realistic query set, tune parameters.
  4. Add selective lexical filters and implement the two-step hybrid query path.
  5. Automate monitoring and schedule regular recall evaluations.

Call to action

Ready to ship hybrid fuzzy + vector search in ClickHouse? Start with a 1-week spike: ingest a 100k-sample, build an ANN index, and benchmark recall/latency. If you want a template cluster config, sample ingestion script, and a benchmark harness (PyTest-based) tuned for ClickHouse 2026 releases, email hello@fuzzy.website or download our open-source starter pack linked from the article page.

Advertisement

Related Topics

#clickhouse#vector-search#databases
f

fuzzy

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-25T05:52:53.367Z