Skip to main content
Version: Next

Data pre-processing

Once the appropriate metadata have been added regarding Data Providers, Event Types, Data Sources, Import Formats and Linkage Projects, LinXmart is ready to accept data for linkage.

An important step prior to linkage is to pre-process all datasets. Pre-processing can take a number of forms. Standardising individual fields into a common format across datasets is key to ensuring these fields can be accurately compared during linkage. This could include for example making sure that sex/gender is coded the same across all datasets. Data cleaning techniques aim to transform particular fields into a format that will improve linkage accuracy, for example by removing punctuation marks or phonetically encoding fields.

LinXmart includes a number of pre-processing techniques. Some of these occur automatically upon data ingest, while others can be manually specified using the Simple Envelope Builder tool.

Linkage field transforms

Cleaning and standardisaiton transforms can be applied to individual linkage fields.

These transforms are applied when data is ingested into the system as a single file, through a direct data source (such as a database table) or via a remote Data Client.

Automatic pre-processing

In addition to the cleaning and standardisation transforms, some additional pre-processing occurs:

  • Dates of birth can be parsed using a defined date format, specified as part of the Import Format.
  • Sex values are standardised on load using a system level setting. This defaults to 1, M for male and 2,F for female. The system level settings can define any number of sexes, each with its own accepted values.
  • The State field is standardised in a similar way to Sex. A system level setting defines accepted value combinations. The default supports full names of Australian states (e.g. New South Wales), abbreviations (e.g. NSW) and standard Australian codes (1, etc).
  • The Soundex and NYSIIS phonetic encodings of Given Name and Surname are automatically computed and stored.
  • The first initial of given name is stored as a separate field.

Pre-processing using the Envelope Builder

The Envelope Builder can be used convert datasets into Envelopes that are accepted by LinXmart for processing. It takes as input the raw data file along with a Project Definition File downloaded from a Linkage Project. The output is an envelope ready to be ingested into LinXmart.

The Envelope Builder provides a range of data manipulation options, including:

  • trim the length of fields
  • remove leading and trailing whitespace
  • change the case of fields
  • filter out specific character sets such as spaces, numbers or special characters
  • match a regular expression pattern
  • convert to missing fields which contain specific values provided by the user
  • replace one character with another
  • replace one word with another
  • phonetically encode a field, using either the Soundex or NYSIIS algorithms
  • select a substring from a field

The Envelope Builder is also used to encode datasets for privacy preserving linkage.

Operators should note that while data cleaning is optional, ensuring data fields are in a common format (standardisation) is highly recommended.

The use of the Envelope Builder for data pre-processing is also optional for clear-text linkages. For PPRL linkage, it is essential. Some operators may find their datasets do not require any pre-processing, or that the pre-processing routines that run on ingestion into LinXmart are sufficient. Others may choose to use carry out their pre-processing use third-party tools, such as statistical programming languages they already use.

Project Definition Files

The Project Definition File contains information on the linkage fields in the dataset, the transforms to apply to each field and how to map each linkage field to a field in the data source. It is used as an input into the Envelope Builder and is downloaded directly from LinXmart via the user interface.

Project Definition Files can be downloaded from the Project Details screen (select the Linkage Project of interest from the Projects tab to access this screen). Select Data Sources from the menu in the top right-hand corner. This will bring up the Event Type Data Sources page. Showing all Event Types attached to this Linkage Project. A download button is found next to compatible data sources of each Event Type member of the project in the options column.

Event Type Data Source