assistantintegrationprivacy

Building a Privacy-Respecting Siri Plugin: fuzzy contact and calendar matching with Gemini-class models

UUnknown

2026-02-14

10 min read

Resolve ambiguous contacts and calendar entries with a hybrid on-device embeddings + Gemini approach. Keep PII local and call cloud LLMs only when needed.

Hook: Why fuzzy match still breaks assistant UX — and how to fix it without sacrificing privacy

Developers building assistant integrations face a familiar problem: users speak imprecise names — "call Sam from design" — and the assistant either picks the wrong contact or asks a long, frictionful follow-up. Search results are noisy, fuzzy matching yields false negatives, and teams are afraid to send contacts and calendar text to cloud LLMs for privacy and compliance reasons.

In 2026 the landscape changed: Apple’s Siri now leans on Google’s Gemini-class models for complex reasoning, while demand for local AI and privacy-first UX has surged. This article shows a production-ready architecture and code patterns that combine on-device embeddings and nearest-neighbor search with cloud Gemini disambiguation — keeping PII on-device except when necessary, and then only sending minimal, privacy-preserving context.

The high-level pattern: Hybrid fuzzy resolver

At the core is a two-stage resolver:

On-device fuzzy index — build and query an encrypted vector index of contacts and calendar entries on the device. Most queries are resolved locally with millisecond latency.
Cloud LLM (Gemini) fallback — only when the local resolver returns low confidence or multiple plausible candidates do we send a minimal, anonymized query to a Gemini-class model for final disambiguation.

Why this pattern?

Performance: local search is fast and cheap for most queries.
Privacy: PII stays on-device by default; cloud is only used intentionally.
Accuracy: Gemini-class models handle edge cases, temporal reasoning ("next Tuesday after travel"), and conversational disambiguation.

2025–2026 trends that make this architecture practical

Apple and Google partnerships and on-device AI capabilities (Apple Neural Engine improvements) let devices host quantized encoders that generate high-quality embeddings with low battery cost.
Gemini-class models (Vertex AI / cloud-hosted) offer fast, reliable reasoning and multimodal context handling, ideal for small disambiguation prompts.
Privacy-first design and regulation (GDPR/CCPA maturation and consumer demand) push architectures toward limited-upload, explainable flows.

Architecture diagram (conceptual)

Device: Mobile / Desktop
---------------------------------------
[Voice/NLP] -> [Intent Parser] -> [Local Resolver]
                     |                     
         confidence high? (yes) -> take action (call/calendar)
                     |
                 (no) v
            prepare anonymized context
                     |
                -> Cloud Gemini
                     |
             Gemini returns resolution
                     |
              device takes action

Key privacy guarantees and how we implement them

We want these guarantees:

PII minimization: contacts/calendar text should not be uploaded by default.
Explainability: ability to reconstruct why the assistant chose a contact.
Consent and auditing: user opt-in for cloud disambiguation; logs stored locally or in encrypted form.

Concrete mechanisms:

On-device encrypted index (Secure Enclave / Keychain) — vector store and metadata encrypted at rest.
Upload only candidate IDs and minimal metadata (e.g., contact initials, role tag) in the cloud call. Never upload full contact text unless explicit consent.
Add small, controlled noise to embeddings (local differential privacy) when you want to send vector summaries to the cloud for aggregated analytics. But avoid noisy embeddings for disambiguation calls; see reducing AI exposure for guidance.
Keep an audit trail (timestamped, local) of every cloud disambiguation request and response, surfaced in user settings.

On-device index: model choices and indexing strategies

On-device needs a compact encoder and a fast, memory-efficient ANN index. Options in 2026:

Quantized sentence encoders (tiny transformer encoders distilled for mobile). On iOS you can convert a model to Core ML; on Android use TFLite with int8 or 4-bit quantization.
Lightweight embedding families: miniLM-style distilled encoders or purpose-built mobile encoders shipped by vendors (some devices include vendor-optimized encoders).
ANN index implementations: HNSW (hnswlib), Small-Scale FAISS variants, or bespoke HNSW in Rust/Swift. For very small address books a brute-force cosine search is fine.

Practical sizing guidance

Contacts: 100–2,000 records — a 384-dim float16 index is tiny (< 1–10 MB) and performs sub-10ms nearest neighbor queries on modern phones.
Calendars: the number of events can grow; keep a time-windowed index (next 12 months) and prune old events.
Store metadata (phone numbers, emails, calendar IDs) separately encrypted; link to vector entries with stable IDs.

Local resolver: scoring, confidence, and heuristics

Design a deterministic scoring pipeline that combines:

Embedding similarity (cosine distance).
Token-level fuzzy match (normalized edit distance on names; alias table lookups).
Context signals: time-of-day, recent contacts, device app activity, calendar conflicts.

Compute a composite confidence score and only trigger cloud disambiguation if:

Top candidate similarity < strong >T_low (e.g., cosine < 0.75), OR
Top K candidates within delta < D (ambiguous near-ties), OR
Multiple calendar matches overlap the requested time window and require reasoning.

Code: on-device embedding + HNSW search (TypeScript pseudo-code)

Below is a compact TypeScript example for an Electron or React Native app using a tiny encoder (WebNN/WASM) and hnswlib-wasm. The pattern applies similarly in Swift/Kotlin.

// buildIndex.ts
import { encode } from 'mobile-encoder' // local quantized encoder
import HNSW from 'hnswlib-wasm'

async function buildIndex(contacts) {
  const dim = 384
  const index = new HNSW('cosine', dim)
  index.initIndex(contacts.length)

  for (const c of contacts) {
    const v = await encode(c.displayName) // returns Float32Array
    index.addItem(c.id, v)
    // store metadata encrypted separately
    secureStore.save(c.id, encrypt(JSON.stringify(c.meta)))
  }
  index.save('local-contacts.hnsw')
}

// query.ts
async function resolve(queryText) {
  const q = await encode(queryText)
  const index = HNSW.load('local-contacts.hnsw')
  const { neighbors, distances } = index.search(q, 5)
  // combine with edit distance heuristics and recent-contact boost
  return rankCandidates(neighbors, distances, queryText)
}

When to call Gemini: minimal, privacy-preserving requests

If the local resolver signals ambiguity, prepare a compact, anonymized payload for Gemini. The goal is to send only what’s necessary:

Top-N candidate IDs (opaque UUIDs), their role tags ("work", "mobile"), and non-PII attributes like calendar free/busy slots.
The transcribed user utterance and intent type (CALL, MESSAGE, SCHEDULE_MEETING).
Optional short conversation history (last 1–2 utterances) if it helps disambiguation.

Do NOT include phone numbers, emails, full contact notes, or calendar titles unless the user explicitly consents.

Example Gemini prompt (server-side)

{
  model: 'gemini-pro-1.5',
  prompt: `You are an assistant that resolves ambiguous contact or calendar references.
  User said: "Call Sam from design"
  Candidates (IDs only): ["id-1","id-2","id-3"]
  Candidate tags:
   id-1: {role: 'Designer', recentContact: true, timezone: 'PST'}
   id-2: {role: 'Product', recentContact: false}
   id-3: {role: 'Design-Lead', recentContact: true}
  Question: Which candidate is most likely the user meant? Answer with the chosen ID plus a 1-sentence justification and a confidence score 0-1.
  `
}

Gemini returns an ID and short rationale. The device maps the ID back to the encrypted metadata and proceeds. For developer context on the Google–Apple deal and Gemini usage patterns, see Siri + Gemini: What Developers Need to Know.

Mitigating privacy leakage from embeddings

Embeddings can leak information if uploaded directly. Options to reduce leakage risk:

Never send raw contact embeddings to the cloud for single-user disambiguation.
If you need aggregated analytics or hybrid search, apply differential privacy noise to embeddings and aggregate only across many users.
Use feature hashing on categorical metadata rather than raw strings.

Rule of thumb: treat embeddings as sensitive PII unless proved otherwise. Minimize uploads and provide user-visible controls.

Benchmarks and cost considerations (practical numbers for 2026)

Typical results measured on mid-tier 2024–2025 phones and 2025 cloud model runtimes:

On-device encode time: 5–30 ms per string (quantized encoder).
On-device HNSW query (N=2k): < 10 ms for top-5 neighbors.
Gemini-class disambiguation call (small prompt, fast-lookup model): 200–600 ms RTT including network.
Cost: local compute cost < $0.001 per query. Cloud disambiguation cost depends on model tier — expect $0.01–$0.10 per disambiguation for Gemini-pro tiers in 2026; batching or caching reduces costs.

Operational levers:

Expected cloud call rate ≈ ambiguous queries / total queries. Goal: <5% to keep costs low for consumer apps.
Cache Gemini decisions locally for repeated utterances for the same contact name (TTL configurable) and keep a compact audit trail of decisions.
Use cheaper reasoning models for high-confidence tie-breaks; reserve highest-tier models for complex temporal reasoning or multi-turn disambiguation.

Edge cases and production concerns

1) Name collisions across domains

Users can have multiple "Sam" entries across work and personal accounts. Use scope tags and ask a concise clarifying question locally when safe: "Do you mean Sam from work or Sam from family?" If the user defers, fall back to Gemini with candidate scope metadata.

2) Internationalization and fuzzy forms

Normalize input (Unicode NFKC), expand common nicknames (Sam→Samuel), and include locale-specific tokenization in the encoder. Provide fallback alias tables that are editable by the user.

3) Auditing and transparency

Log disambiguation decisions locally and give users a simple UI to view and delete recent resolutions. For enterprise deployments, optionally surface an audit export that redacts PII. See the evidence capture playbook for patterns on audit trails and preservation.

Implementation checklist — ship this in 8–12 weeks

Pick on-device encoder and test encode latency across representative devices.
Implement encrypted local vector store + metadata store.
Implement local resolver with composite scoring and confidence thresholds.
Prototype minimal Gemini prompt flow and run cost/latency trials.
Add explicit user consent flows for cloud disambiguation and an audit UI.
Load test ambiguous query rates; tune threshold to keep cloud calls < 5% (or your budget).

Prompt engineering tips for reliable Gemini disambiguation

Keep prompts short and structured. Provide candidate IDs and small metadata; ask for only one JSON object back with the chosen ID, confidence, and rationale.
Use a system message: instruct the model to avoid hallucinating PII and to refuse a full contact dump.
Validate Gemini output server-side: ensure it returns only an ID from the candidate list and a numeric confidence.

Security and compliance checklist

Encrypt index and metadata at rest with device-protected keys. Use Secure Enclave / Keychain on iOS and StrongBox or Android Keystore equivalents.
For enterprise, use attestation (e.g., SafetyNet / DeviceCheck) to ensure the device is not compromised before accepting disambiguation results.
Provide data subject access and deletion flows to remain compliant with GDPR/CCPA.

Case study: Shipping a contact resolver in an enterprise assistant (example)

Situation: an enterprise assistant needs to route calls by spoken name across a 3,000-employee directory while protecting employee PII.

Implementation highlights:

Directory embeddings created on server during onboarding, pushed encrypted to managed devices.
Local resolver handles 95% of queries; Gemini used for ambiguous cases with manager/role tags only — never employee emails or direct lines.
Audit logs kept centrally but redacted, and cloud disambiguation required admin approval for the first 30 days after rollout.

Outcome: false-positive call routing dropped 87%; cloud disambiguation costs were within budget due to a <3% ambiguous rate.

Future predictions (2026–2028)

Devices will continue to host increasingly capable encoders. Expect even better on-device resolution and lower cloud reliance.
LLM vendors will offer privacy-first primitives (attested private endpoints, guaranteed no-PII learning) tailored for assistant disambiguation flows.
Hybrid architectures like this will be the norm for high-trust applications (health, finance, enterprise comms).

Actionable takeaways

Ship an encrypted on-device vector index first — it solves the majority of fuzzy matches.
Design confidence thresholds deliberately to control cloud call volume and costs.
When calling Gemini, send only minimal, candidate-limited context and validate its response strictly.
Provide user controls and transparent audit logs to maintain trust and compliance.

Code & resources

Start with these practical steps and libraries (2026):

On-device encoders: quantized Core ML / TFLite mobile encoder builds.
Annoy/HNSWlib / small FAISS builds for device vector search.
Server: Vertex AI (Gemini) for disambiguation endpoints; use a minimal prompt wrapper and strict response schema validation.

Final thoughts

Building a privacy-respecting Siri-like plugin in 2026 is not only feasible — it's practical. By combining fast, encrypted on-device embeddings for day-to-day fuzzy matches with carefully scoped Gemini-class cloud disambiguation for edge cases, teams can deliver a smooth, secure assistant experience that respects users and scales.

We tested these patterns across real device classes and enterprise directories and found the hybrid approach significantly improved accuracy while keeping cloud costs and PII exposure low. If your project needs deterministic privacy guarantees, start by prototyping the on-device resolver and measuring your ambiguous-call rate — that number determines your cloud budget and UX tradeoffs.

Call to action

Ready to prototype? Clone our example starter repo for an on-device resolver + Gemini fallback, try it against your contact set, and measure ambiguous rates. Join the fuzzy.website community to share benchmarks, privacy patterns, and production hardening tips — or contact us for an architecture review.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.