Probabilistic record linkage
Probabilistic linkage, so called because it uses conditional probabilities to compute likelihoods, is one of the most common methods used for record linkage. In this approach, records are compared on a pairwise basis. A comparison of two records involves comparing all the individual fields. Each field comparison results in a score based on specific weights assigned to that field. These scores are summed up for the pair comparison, and if this summed score is over a specific threshold, the two records are designated a match.
An example of a record-pair comparison in probabilistic linkage.
Probabilistic linkage is often compared to a rule-based approach, sometimes refers to as deterministic linkage. One of the key advantages of probabilistic linkage over this approach is that it removes the difficulties of creating large sets of rules and determining the validity of each rule. However, the process still requires a number of parameters to be accurately estimated to ensure high quality linkage.
Blocking
In probabilistic linkage and many other approaches, the comparison of every record in one dataset to every record in another quickly becomes infeasible as dataset sizes increase. For example, a relatively small linkage of 1 million records to a second dataset of 1 million records would involve half a trillion record pair comparisons. The vast majority of these potential comparisons would involve records that belong to different people. Techniques known as blocking are used to dramatically reduce this comparison space, while keeping all (or very nearly all) comparisons which have a possibility of belonging to the same individual. The standard method of blocking, which is the one employed in LinXmart, is to choose a set of fields (blocking fields); only record-pairs which have exactly the same values of these fields are compared further.
Blocking fields should reduce the number of comparisons sufficiently so that the matching process does not take too much time, while ensuring that as few as possible true matches are missed due to not having the same blocking field values. For instance, using ‘gender’ by itself as a blocking field is generally not much use, as it will only reduce the number of comparisons by about half. On the other hand, using ‘address’ as a blocking field would reduce comparisons to a very small number, but many correct matches would be missed (i.e. people who have changed address, or have different address formats, spelling, punctuation etc.).
Typically, multiple blocks are used, with each block allowed to have any number of blocking fields. Records need to have the same values for all blocking fields in a block to be compared further, but only need to do so for one of the blocks.
Matching Fields
The comparison process involves comparing any number of individual fields. Typically. all available linkage fields are compared here, although fields which are highly correlated with another field can be left out. Each linkage field value in the records will be compared in accordance with the Matching Fields section of the project’s Match Configuration, which includes:
- The Comparison method to be used
- The Match and non-match probability ( and probabilities)
- Agreement and disagreement weights
Comparison methods
The Comparison method outlines how the two values are to be compared. Certain methods only apply to certain variables. The two most common methods for comparison are an exact match and a string similarity metric, such as the Jaro-Winkler algorithm. When two fields are compared with an exact match, the disagreement weight will be given for any difference between the two values. The Jaro-Winkler algorithm is often used for alphabetic fields such as names or address, and will give a score between the agreement and disagreement weight if the two values are not exactly the same but similar. A standard approach may be to use Jaro-Winker comparisons for alphabetic fields and Exact matching for other fields.
Match and non-match probabilities ( and probabilities)
When two field values are compared, a score is returned. This score will be equal to either the Agreement or Disagreement Weight (or somewhere between with the Jaro-Winkler method). These are calculated from the Match () and Non-Match () probabilities assigned to that linkage variable.
The and probabilities define the discriminating power (aka predictive value) of the linkage variable. Different linkage variables have different and probs. For instance, two records which have the same surname are far more likely to belong to the same person than two records which have the same gender.
An -probability is the probability that the two values being compared will match if the two records belong to the same person. For instance, because a person’s gender rarely changes and is not usually entered incorrectly, the chance of two records that belong to the same person having the same gender values (i.e. the -probability) would be very high. On the other hand, a person’s address can change a number of times and data entry errors may occur, so it’s -prob might be somewhere around 0.5.
A -probability is the estimated probability that the two values being compared will match if the two records belong to different people. For instance, the chance of 2 records belonging to different people having the same gender value is about a half (0.5) in most populations. However, the chance of two different people having the same address would generally be very small (around 0.0001). Of course, for certain study groups, this might not be the case e.g. a study of family groups or household occupants.
Agreement and disagreement weights
The agreement and disagreement weights are the actual scores given when two fields agree or disagree. They are derived from the and probabilities using the following formulas. The formulas are:
An example of and probabilities (with their agreement and disagreement weights) are shown in the table below.
Variable | Agreement weight | Disagreement weight | ||
---|---|---|---|---|
Gender | 0.9995 | 0.5 | 0.693 | -6.908 |
Address | 0.5 | 0.0001 | 8.517 | -0.693 |
Given Name | 0.8 | 0.001 | 6.685 | -1.608 |
Family Name | 0.85 | 0.0002 | 8.355 | -1.897 |
Date of Birth | 0.97 | 0.012 | 4.392 | -3.494 |
Thresholds
The score for each of the compared linkage variables in a record pair are summed, and if the summed score is over a specific threshold, then the records are designated a matching pair. Comparisons below this threshold are ignored. By lowering the threshold, more pairs will be accepted as pairs. However, even as this increases the number of true matches, it also potentially increases the number of false matches (i.e. pairs of records that do not represent the same person). Determining the correct threshold is often a balance between true matches and false matches.
The traditional method for determining an appropriate threshold has been to manually examine handfuls of record-pairs at different threshold scores to attempt to determine the appropriate score.