The Art of Secure Search: How Wix Mastered PII Data in Vespa Search Engine

Personal Identifiable Information (PII) refers to any data that can uniquely identify an individual. Protecting PII is critical for maintaining user privacy and complying with regulations like GDPR. To address these requirements, Wix ensures all PII data is encrypted in our storage. Vespa is the leading solution in AI-oriented search engines. It is used in many domains within Wix.

Problem

Vespa Search Engine does not natively support searching over encrypted data, yet this functionality is essential for handling PII securely.

Goal

Implement a full-text search over encrypted PII data in Vespa, supporting exact, prefix, and n-gram matches while maintaining ranking capabilities.

Possible Solutions

An exact and prefix search could be configured using a well-known order-preserving encryption algorithm. However, we also aimed to address misspelling tolerance, making the search experience close to a standard full-text search.

Simplified Vespa Architecture at Wix

The Vespa middle-tier application serves as an abstraction layer for Vespa engine clusters, providing additional functionalities and ensuring seamless operation. Its main responsibilities include:

Standardized API: Provides search and feeding APIs.
Managing security: Handles authentication and authorization to protect data and ensure secure access.
Traffic management: Optimizes performance by managing request timeouts, retries and rate-limiting.

PII Search Implementation

The Vespa middle-tier defines a naming convention to manage encrypted PII search functionality. Fields are suffixed as follows:

_exact_pii: Used for encrypted exact matching. Ensures only documents with content identical to the query are retrieved.
_prefix_pii: Used for encrypted prefix matching. Enables retrieval of documents where the content starts with the query string.
_ngram_pii: Used for encrypted n-gram matching. Facilitates fuzzy matching, handling spelling variations or partial matches.

The Vespa middle-tier performs deterministic encryption using the AES algorithm. To mitigate vulnerabilities related to long string attacks, maximum length validation is enforced on all encryptable fields.

Feeding Flow

Search Flow

Exact Match: The First Puzzle Piece

We started with exact matching. When a user searched for “james bond,” we needed to find documents with fields exactly matching that name.

Feeding

If a feed request contains fields with the _exact_pii suffix, the Vespa middle-tier service encrypts them before sending them to the Vespa engine. Say, the client app sends the document as follows:

{
"name_exact_pii": "james bond"
"description": "perfect agent",
}

Then the middle-tier app applies encryption to PII fields and the following document gets sent to Vespa engine:

{
"name_exact_pii": "goIMlO8LzZJOzy5u9up+HQ=="
"description": "perfect agent",
}

Searching

If encryption is enabled for a given client profile - the Vespa middle-tier service encrypts the search query and sets it as Vespa engine query parameter exact_pii. In SQL-like terms, the query sent to the Vespa engine might look like this:

SELECT *
FROM documents
WHERE name_exact_pii = <exact_pii>
ORDER BY score DESC

Victory! Exact matching was in the bag.

Prefix Match: Venturing Further

Exact matching was good, but it wasn’t enough. Users might type “comp” and expect results like “computer” or “comprehensive.” This was the domain of prefix matching.

Feeding

When a feed request includes document fields with the _prefix_pii suffix, the Vespa middle-tier service generates a collection of weighted prefixes from the field value, encrypts them before sending them to the Vespa engine. For example, given the document:

{
"name_prefix_pii": "james bond",
}

First, the Vespa middle-tier service converts it to the collection of weighted prefixes:

{    
"name_prefix_pii": {"james bond": 1000, "james bon": 888, "james bo": 777, "james b": 666, "james ": 555, "james": 444, "jame": 333, "jam": 222, "ja": 111}
}

Then encrypts them and sends to the Vespa engine as a weighted set:

{
"name_prefix_pii": {"goIMlO6LzZJOzy5u9up+HQ==": 1000, "xDQ2Kda9sCYKdt6+nxg4Ew==": 888, "LMuvJ0ci+RZb2VX05Ck8Pw==": 777, "aizPtq3TB0sB9ImHC4aWOw==": 666, "M2TgCTUyAk9950n0e7RHWA==": 555, "28t2jmhxA73EyyZ3ZCPIvA==": 444, "v6RyzXoupBk53VFZZod2MQ==": 333, "HF5xctvE8Jew8MeecvqpPw==": 222, "sY/iG1P9qcAOfNYCgyTORA==": 111}
}

Weighted prefixes calculation

The list of prefixes for a given value includes all possible prefixes starting from 2 characters up to the value length

max_prefix_length = value.length()
min_prefix_length = 2
prefixes_quantity = max_prefix_length - min_prefix_length + 1

The corresponding weights have a normal distribution within a range [0, 1000]. In the sample above the longest prefix "james bond" gets the highest score of 1000, while the shortest "ja" gets 111.

Searching

If encryption is enabled for a given client profile - the Vespa middle-tier service encrypts the query and sets it as a Vespa engine query parameter prefix_pii with a weight of 1. Lets say for example, the search query is “james”. Then prefix_pii will look like this:

"prefix_pii": {"28t2jmhxA73EyyZ3ZCPIvA==": 1}

Now the Vespa engine uses the dot product function to find matching docs. The result for "james bond" document above looks like this:

prefix_pii * name_prefix_pii = prefix_pii['28t2jmhxA73EyyZ3ZCPIvA==] * name_prefix_pii['28t2jmhxA73EyyZ3ZCPIvA=='] = 1 * 444 = 444

As a result, the given document is returned with a score of 444 (in fact it’s normalized to 444/1000 = 0.444). If there were no common entries between prefix_pii and name_prefix_pii - the score would be 0 and the document would not be returned. Note, that the longer document’s prefix matches the search query - the higher score it gets. As a result, more suitable docs are listed first, so this approach secures the proper ordering by rank score.

N-gram Match: The Final Frontier

Now comes the hardest part - fuzzy matching. Users make typos. They search for “jams bond” instead of “james bond.” We couldn’t let such small mistakes derail their search experience. This called for n-grams: n-gram search offers an effective solution for handling spelling corrections by breaking down words into smaller units (grams). This simplifies the process of identifying misspellings based on the similarity of the n-grams. For example, if the search query is "jams", using n-grams such as "ja", "am", "ms" enables matches, compensating for potential typos.

Feeding

If a feed request contains fields with the _ngram_pii suffix, the Vespa middle-tier encrypts them before sending them to the Vespa engine. The gram sizes are configurable. In this sample the gram size is 3, but to ensure the more suitable documents get a higher score - we generate additional sets of grams with "+1" size. Given the same "james bond" document - then a non-encrypted document would look as:

{
"name_ngram_pii": {"jam": 1, "jame": 1, "ame": 1, "ames: 1, "mes": 1, "bon": 1, "bond": 1, "ond": 1}
}

Unlike prefix matches, where longer prefixes have higher weights, n-grams use constant weights (1) to ensure equal treatment for all matching fragments. Then, similarly the values get encrypted and are sent to the Vespa engine.

Searching

If encryption is enabled for a given client profile - the middle-tier creates n-grams based on the search query, encrypts them, and sets them as a Vespa engine query parameter ngram_pii. Lets say for example, the search query is “james”. Then the non-encrypted ngram_pii will look as:

"ngram_pii": {"jam": 200, "jame": 200, "ame": 200, "ames: 200, "mes": 200}

The corresponding weights are calculated as:

weight = 1000 / number_of_ngrams

There are 5 n-grams in the given example - therefore weight = 1000 / 5 = 200.

Now the Vespa engine uses the dot product function to find matching docs. The result for "james bond" document above looks like this:

ngram_pii * name_ngram_pii =
ngram_pii['jam'] * name_ngram_pii['jam']  +
ngram_pii['jame'] * name_ngram_pii['jame']  +
ngram_pii['ame'] * name_ngram_pii['ame']  +
ngram_pii['ames'] * name_ngram_pii['ames']  +
ngram_pii['mes']  name_ngram_pii['mes']
= 1000

As a result, the given document is returned with a normalized score of 1 (1000/1000). Similarly, more suitable documents are listed first in the response.

The outcome? Even with typos, relevant documents bubbled to the top.

Summary

Balancing security and functionality is a key challenge when handling PII data. By following this approach, we successfully implemented a full-text search capability over encrypted data in Vespa, supporting exact, prefix, and n-gram matches, without significant performance overhead.

This post was written by Anton Kolhun and Gregory Bondar

More of Wix Engineering's updates and insights:

Follow us on: Twitter | Facebook | LinkedIn | TikTok
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple, Spotify or Google

The Art of Secure Search: How Wix Mastered PII Data in Vespa Search Engine

Problem

Goal

Possible Solutions

Simplified Vespa Architecture at Wix

PII Search Implementation

Feeding Flow

Search Flow

Exact Match: The First Puzzle Piece

Feeding

Searching

Prefix Match: Venturing Further

Feeding

Weighted prefixes calculation

Searching

N-gram Match: The Final Frontier

Feeding

Searching

Summary

Recent Posts

Comments