Skip to main content
Version: Current (1.x)

Privacy-preserving record linkage

Privacy preserving record linkage (PPRL) refers to record linkage conducted on encoded or otherwise obfuscated identifiers. It is used to improve the privacy of the record linkage process. It allows linkage to occur without any personally identifying information being transmitted to the linkage unit.

LinXmart provides facilities for privacy preserving linkage. It uses a published method known as privacy preserving linkage using Bloom filters. This method has been shown to achieve linkage quality equal to that of un-encoded probabilistic linkage. Each field in a record (first name, surname etc.) is encoded separately. Typical probabilistic linkage then occurs using these encoded fields. While all values are encoded, the Bloom filter encoding process allows approximate comparisons to still occur on the encoded values.

To ensure privacy, PPRL datasets should be encoded at source (that is, by the data provider or custodian) before being sent to the linkage unit operating LinXmart. The encoded data is then sent to the linkage unit, who load it into LinXmart and carry out record linkage as usual.

PPRL Data Flow

Privacy preserving linkage data flows

There are thus two steps to privacy preserving linkage:

  1. Encode the data into a privacy preserving format. In LinXmart, this occurs through the Simple Envelope Builder.
  2. Carry out the linkage on the encoded data using LinXmart.
info

Privacy-preserved (encoded) fields and clear text fields can both be used in the same Linkage Project! However, you cannot match an encoded field to a clear text field.

Using LinXmart for PPRL

LinXmart does not have a specific PPRL "module"; rather privacy preserving linkage is carried out using the typical functionality of LinXmart discussed throughout this help.

Preparing LinXmart for PPRL

As with all linkage carried out using LinXmart, Data Providers, Event Types and Linkage Projects must be registered in the system. As privacy-preserving data cannot be directly linked to un-encoded (clear text) data, you will need to configure your Linkage Projects appropriately for linkage. The encoded data will also require a separate Event Type to be created for it, with its own Data Source and Import Format.

Creating an Import Format

When creating an Import Format. for PPRL, individual encoded fields can not be stored under the predefined named fields ('Given Name', 'Surname' etc.) but should be stored in the Binary Fields 1-10. It is useful at this stage to attach meaningful names to each field, by changing the Name field to something suitable.

Modifying the match configuration for PPRL

Similarly, the Match Configuration of the created PPRL linkage project must be edited to use the Binary Fields rather than the default clear text fields. Field comparisons that would normally be carried out using the Jaro-Winkler string comparison function should use one of the available Bloom field comparisons; the Dice coefficient is most commonly used. The weight curve here should also be updated to match the chosen comparator function.

Encoding data for privacy preserving linkage

Personal identifiers can be encoded for privacy preserving linkage using the Simple Envelope Builder. Once the above metadata has been established in the system (i.e. the steps above have been followed), the Project Definition File for each Event Type can be downloaded from the Linkage Project details page.

When loaded into the Simple Envelope Builder, this Project Definition File will now convert the appropriate input fields into their encoded formats for linkage. Fields that are defined in the project's match configuration to be approximately matched will be encoded into Bloom filters. Fields that are defined in the project's match configuration to be 'exact matched' will be cryptographically hashed.

Challenges of privacy preserving record linkage

Privacy preserving record linkage presents the linker with a number of challenges because the individual identifiers cannot be viewed. While thresholds cannot be estimated by viewing samples of record-pairs, as is common with un-encoded record linkage, thresholds and weights can both still be calculated on encoded data using the probability estimation process. Standard manual quality review methods are of not necessarily transferrable to encoded data, as are any other quality assurance methods which utilize manual inspection of records. Due to these factors, a slight drop in linkage quality when using privacy preserving linkage can be expected. However, previous results suggest this method will generally achieve very high quality across a wide range of datasets.