Skip to main content
Version: Current (1.x)

Linkage in LinXmart

With the appropriate metadata established in the system including a defined Linkage Project and the data ready for linkage, the next step is to ingest the data into LinXmart. This section describes the processing that occurs for linkage in LinXmart. For information on how to input data into LinXmart, see ingesting data.

LinXmart has been designed to process data in an automated fashion. Once the data is ingested, LinXmart will process the data to determine which records within and between datasets belong to the same person – no further operator input is required. An operator’s main role is to validate data before linkage, and to evaluate and monitor the results and improve these as required.

Each ingested dataset much belong to a registered Event Type, and be ingested into a particular Linkage Project (if a dataset is used in multiple Linkage Projects, it must be ingested multiple times, once for each project). Each ingested dataset is referred to as an Envelope by the system. A Linkage Project can include any number of envelopes, from any number of Data Providers. There is no limit on the number of datasets that belong to the same Event Type being added to the same Linkage Project - more envelopes containing records of the same Event Type can be added to the system at any time.

When a dataset in ingested into a Linkage Project, two linkages occur at the same time:

  • The dataset is de-duplicated (internally linked) to identify records that belong to the same person from within the dataset.
  • The dataset is linked against all other records that have been loaded within the linkage project, if any. (excluding those include records which have been deleted or end-dated from the project)

Linkage On Load

Incoming records are de-duplicated as well as linked to all other records within the Linkage Project

For instance, when adding the 101st envelope into a Linkage Project, the data will be both de-duplicated, and then linked against all records from the previous 100 envelopes added to this project. This method of linkage allows any Linkage Project to function as an ongoing linkage map, receiving updated from core collections on a routine basis, and providing results for any number of research, administrative or business functions.

Ingesting data for linkage

Data can be loaded into LinXmart for linkage in a few ways:

The loading of data into the system for linkage triggers a number of jobs that occur in sequential order. Jobs are tasks performed by the system which require processing time. LinXmart employs a queue system that allows multiple envelopes to be ingested with processing begun when resources are available.

The progress of the linkage can be viewed through the JOBS tab. Here there are two sub-tabs. The All Envelopes sub-tab lists the envelopes processed by the system, including those queued and currently being processed, which are shown above. By clicking on the plus icon, all jobs related to an envelope are shown. The All Jobs tab shows all the jobs processed by LinXmart, including active and queued jobs shown in a panel above the main list.

Validation of data ingest

Upon receipt of data, LinXmart first performs envelope validation to ensure this is a valid linkage request. Upon receiving an envelope, LinXmart checks to ensure that the listed Data Provider exists in the system, the listed Event Type exists and belongs to the Data Provider, the Linkage Project exists and is active, and that the Event Type is attached to the Linkage Project. If the envelope is valid, a new Load Linkage Request job is queued in the system.

If the manifest is valid, a new Load Linkage Request job will be queued in the system.

Envelope Failures

If errors are determined, processing will end and the envelope will be listed as Failed in the All Envelopes tab of the JOBS screen. The particular reason for the failed envelope will be available by clicking the Notifications icon next to the failed envelope.

Load linkage request

The Load Linkage Request job loads the data fields into the LinXmart database. It does this using the Import Format specified for the event type in question.

Record validation

Before loading the data, LinXmart validates the file line by line. Records can fail validation. If more than a set percentage of records fail validation (by default, 5%), the entire envelope will be failed.

info

The acceptable percentage of records that can fail validation threshold is a system wide setting. Your administrator can help you set this for the whole system.

Record validation failures include:

  • The file is delimited format and the number of columns is not as expected.
  • The file is fixed width format and the line is shorter than expected.
  • The date of birth is not a valid date e.g. the 56th of January.
  • The sex value is invalid. Valid entries are defined at a system level.
  • The state value is invalid. Valid entries are defined at a system level.
  • A linkage field's value (after trimming by the system) is longer than its default maximum length. Each field’s maximum length is defined in the system. For instance, name fields have a maximum of 255 characters, while address has a maximum of 1000 characters. Auxiliary fields must be less than 250 characters long.

Information on which records failed validation, along with the reasons for the failure can be found through the web interface. From the JOBS tab, identify the envelope in question and select the notification icon. Alternatively, from the Linkage Project page, identify the envelope in the Data Load History. The notifications will detail the number of records successfully parsed and the total number of records. If records failed validation, a report on these can be downloaded by clicking on the download button.

Envelope Notification

Information on records which failed validation can be found from the Notifications pane.

The Data Load History on the Linkage Project page also summarises the record counts for each envelope.

Data Load History

The data load history indicates the number of records in the datasource, the number of records that failed validation (invalid), the number of records added and the number of records deleted.

Identifying new, amended and previously seen records

As well as being validated, each incoming record is also checked to see if it already exists in the system. If its unique id is found in an existing record of this Event Type in this Linkage Project, then the incoming record is checked to see if it is an amended record.

An amended record is one in which one or more of the personal identifiers have changed. For instance, a Data Provider might have updated the address field value (i.e. the individual may have changed address or the field may previously have been blank). If an amended version of an existing record is received, the system assumes it is a more recent version.

Record Processing

Overview of record processing during the Load Linkage Request job.

Receipt of an amended record will cause the previous version of the record to be end-dated, together with all pairs to which it belongs. The status of the group to which the record belongs will also be changed. The new version of the record is then re-linked and new pair/s and group/s formed as required.

An incoming record that is identical to an existing record is simply ignored. Such records are not considered to have failed validation.

Match recorded events

The Match Recorded Events job carries out the record level matching configured for the Linkage Project and reads the results (in the form of record-pairs) into the database.

During a Match Recorded Events job for a single Linkage Project, LinXmart will link all new records coming into the system (including amended versions of existing records):

  • Internally to the new data set (deduplication)
  • Externally to all active records already in the Linkage Project

Each Match Recorded Events job belongs to a single Linkage Project, and only links records for that Linkage Project.

In LinXmart's record linkage process, individual pairs of records are compared. Comparison occurs by examining the similarity of each set of identifiers (i.e. how similar are the first names? How similar are the last names? Is the date of birth the same?). Each comparison results in a total score. Those pairs of records that cross a threshold of similarity are thought to belong to the same person, and are designated a match.

All such successfully matched pairs, along with their scores are the output of the matching process. All matched pairs over the thresholds defined in the configuration are stored in the database.

Group pairs and events

This process uses the matched pairs found in the previous Match Recorded Events job to update the linkage map (i.e. to determine which records belong to the same person). The two types of grouping provided by LinXmart are:

How records are processed

As well as processing new records, the grouping process also manages the deletion of records. When a record is deleted, the group this record is part of is also removed. All the other records in the deleted record's group are now re-grouped.

For amended records, the grouping process treats this as a combination of a deleted record and a new record. First, the old version of the record is deleted, its associated group is deleted, and the group’s remaining members are re-grouped as if the original record did not exist. The new version of the record is then treated as a previously unseen incoming record, which may or may not end up in the same group as before.

Group IDs changing over time

LinXmart assigns groups of records an internal person identifier called a group ID. Due to the possibility of groups of records splitting through the deletion of records or later quality assurance changes, assigned group IDs are not always enduring. Groups of records may change their group IDs as data changes over time.

Singletons

All records in a Linkage Project, including those which do not form pairs with any other record, are allocated their own group ID (i.e. a person identifier) by LinXmart.

Quality review changes

Groups of records which have been split or joined together through manual quality review or through batch quality reviews have quality pairs created which list the positive and negative associations between records. These quality pairs are used in the grouping process similarly to pairs found through matching. A quality pair that is created to negate a matching pair will take precedence.

Full history of groups maintained

LinXmart stores the full history of groupings (including changes through quality review) for each Linkage Project, allowing the extraction of linkage maps to create a snapshot of the groupings at a point in time..

Execute grouping rules

After completing grouping, the Execute Grouping Rules stage checks each recently modified group against a set of group consistency rules. Groups which failed these rules are marked as such, and are listed on the QUALITY tab of the web UI.

LinXmart has two grouping consistency rules – authoritative events and marked pairs.

Authoritative events

Certain Event Types can be marked as authoritative. An authoritative event type is one for which a person should have only one record e.g. birth or death records. Non-authoritative event types are those for which an individual can have many records e.g. hospital stays.

For an Authoritative event type, any group that has been formed containing two or more records will be flagged to allow manual review.

Marked pairs

Two records can be marked as belonging to the same person (group), or to different persons. However, if two records are marked as belonging to the same person but are in different groups, then those groups will be marked as containing an error. If two records are marked as belonging to different people but are in the same group, then this group will also be marked as containing an error.

Run Linkage Request Report

The final job that occurs after data has been ingested for linkage runs a report on the previous linkage. It includes information on the configuration and parameters of the linkage, and produces a summary of the results, including information on the records, pairs and groups formed from the linkage.