Estimating probabilities
LinXmart uses a probabilistic linkage engine that requires a number of parameters for operation. A set of default parameter values are automatically created when a new Linkage Project is created. To maximise linkage quality, it may be necessary to modify these to best suit the data in question.
LinXmart provides a method to estimate Match () and Non-Match () probabilities, as well as the overall threshold, for a particular linkage. This method uses the Expectation-Maximisation algorithm to estimate these values.
Values are estimated for a particular linkage; that is, for an incoming dataset being both de-duplicated and linked to all other datasets already in the Linkage Project. Both the incoming and pre-existing datasets are included in the analysis to determine the estimated values.
A LinXmart operator carries out this estimation process by loading a Probability Estimation Envelope to the system. The data for linkage must be ingested into the system to carry out this estimation – however it is not linked or maintained within the system. The probability estimation process results in estimated probabilities published and downloadable through the web UI. The LinXmart operator can then update the appropriate match configuration with these probabilities as appropriate, and then load the same data file, this time for linkage.
Ingesting data into LinXmart for estimating probabilities
Probability estimation Envelopes can be created using the LinXmart Simple Envelope Builder. The output of the Build must be set to Probability Estimation. Once created, the Envelope can be loaded into LinXmart using the standard Envelope ingestion method. The system will know how to treat the Envelope based on request type defined in the Envelope's manifest.
Probability Estimation Jobs
After performing the usual validation checks (that the Linkage Project and Event Type exist, and the Event Type is attached to the Linkage Project), the Load Probability Calculation Request job begins.
Load Probability Calculation Request Job
This job parses the data file, and loads valid records into the database.
The parsing that occurs here is the same that occurs with a Load Linkage Request. If the number of records which fail parsing is greater than 5% of the total, the entire datafile is rejected and marked as failed.
Calculate Probabilities Job
This job runs the Expectation-Maximisation (EM) algorithm. This process can be time consuming depending on the size of the data involved. The algorithm itself is iterative; that is, it runs repeatedly, stopping only when it cannot improve the estimates further.
Publish Prob Calcs Envelope
This job publishes the results of the EM process to an Envelope, available to download from the Linkage Project.