Skip to main content
Version: Next

Data pre-processing

Once the appropriate metadata have been added regarding Data Providers, Event Types, Data Sources, Import Formats and Linkage Projects, LinXmart is ready to accept data for linkage.

An important step prior to linkage is to pre-process all datasets. Pre-processing can take a number of forms. Standardising individual fields into a common format across datasets is key to ensuring these fields can be accurately compared during linkage. This could include for example making sure that sex/gender is coded the same across all datasets. Data cleaning techniques aim to transform particular fields into a format that will improve linkage accuracy, for example by removing punctuation marks or phonetically encoding fields.

LinXmart includes a number of pre-processing techniques. Some of these occur automatically upon data ingest, while others can be manually specified using the Simple Envelope Builder tool.

Automated pre-processing

When data is loaded into LinXmart, the following pre-processing occurs.

  • Dates of birth can be parsed using a defined date format, specified as part of the Import Format.
  • Sex values are standardised on load using a system level setting. This defaults to 1, M for male and 2,F for female. The system level settings can define any number of sexes, each with its own accepted values.
  • The State field is standardised in a similar way to Sex. A system level setting defines accepted value combinations. The default supports full names of Australian states (e.g. New South Wales), abbreviations (e.g. NSW) and standard Australian codes (1, etc).
  • Blank characters (whitespace) at the start and end of all fields are removed.
  • The Soundex and NYSIIS phonetic encodings of Given Name and Surname are automatically computed and stored.
  • The first initial of given name is stored as a separate field.
tip

In addition to trimming of whitespace, you can modify a system level setting to automatically perform defined sets of cleaning/standardisation for different linkage fields in every dataset.

Pre-processing using the Simple Envelope Builder

The Simple Envelope Builder is a stand-alone desktop application used to convert datasets into Envelopes that are accepted by LinXmart for processing. It takes as input the raw data file along with a Project Definition File downloaded from a Linkage Project. The output is an envelope ready to be ingested into LinXmart.

The Simple Envelope Builder provides a range of additional pre-processing options. These include:

  • trim the length of fields
  • remove leading and trailing whitespace
  • change the case of fields
  • filter out specific character sets such as spaces, numbers or special characters
  • match a regular expression pattern
  • convert to missing fields which contain specific values provided by the user
  • replace one character with another
  • replace one word with another
  • phonetically encode a field, using either the Soundex or NYSIIS algorithms
  • select a substring from a field

The Simple Envelope Builder is also used to encode datasets for privacy preserving linkage.

Operators should note that while data cleaning is optional, ensuring data fields are in a common format (standardisation) is highly recommended.

The use of the Simple Envelope Builder for data pre-processing is also optional. Some operators may find their datasets do not require any pre-processing, or that the pre-processing routines that run on ingestion into LinXmart are sufficient. Others may choose to use carry out their pre-processing use third-party tools, such as statistical programming languages they already use.

Project Definition Files

The Project Definition File contains information on the fields in the dataset and the metadata relating to the linkage. It is used as input into the Simple Envelope Builder. It is supplied by LinXmart, and can be downloaded via the user interface.

Project Definition Files can be downloaded from the Project Details screen (select the Linkage Project of interest from the Projects tab to access this screen). Select Data Sources from the menu in the top right-hand corner. This will bring up the Event Type Data Sources page. Showing all Event Types attached to this Linkage Project. A download button is found next to compatible data sources of each Event Type member of the project in the options column.

Event Type Data Source