Welcome to the third installment of a five part tutorial series on the recordlinkage python package. Each tutorial will cover a specific stage of the data integration workflow. The topics and links for each tutorial are included below:
1. Data Pre-processing
3. Record Comparison
4. Classification (Coming Soon…)
5. Data Fusion (Coming Soon…)
By the end of this tutorial you should be comfortable using recordlinkage’s built-in methods and custom functions for record comparison
Record comparison is an important part of the data integration workflow, as it provides the basis for classifying matches and the eventual fusion of data. Both of these steps will be covered in subsequent tutorials.
In this tutorial, we will compare two data sets. The first includes bibliographic data, which was retrieved using the metaknowledge python package. This data includes information on affiliations, papers, authors, etc. The second dataset is a collection of institutions from GRID. Both data sets have been filtered to only include institutions from Canada. Five rows from each of these datasets have been included below.
|7||university of alberta 2 51 south academic building edmonton ab canada t6g2g7||Canada|
|1||departement de microbiologie infectiologie et immunologieuniversite de montreal montreal quebec h3c 3j7 canada electronic addressroxannecollinumontrealca||Canada|
|10||getting to know cancer room 229a 36 arthur st truro nova scotia b2n 1x5canada||Canada|
|17||michael smith laboratories university of british columbia vancouver britishcolumbia canada||Canada|
|13||gastrointestinal research group and inflammation research network department of physiology and pharmacology calvin joan and phoebe snyder institute for chronicdiseases cumming school of medicine university of calgary calgary albertacanada||Canada|
|2457||holstein association of canada||Brantford||Ontario||Canada|
|2499||yellow island aquaculture||Victoria||British Columbia||Canada|
|1030||toronto general hospital||Toronto||Ontario||Canada|
There is a bit of work we need to do before we can compare records in our data sets. First, we need to load and pre-process the data. Then we will need to index the datasets to give us our list of candidate links. Each of these topics has been covered in detailed in previous tutorials (links at the top of the page). The script below contains the code used to prepare the data for comparisons.
import recordlinkage as rl import pandas as pd # *************************** # Load Bib Data # *************************** bib = pd.read_csv("/path/to/canada_bib.csv") # Read bibliometric data bib["Affiliation"] = clean(bib["Affiliation"]) # Clean Affiliation Data # *************************** # Load GRID Data # *************************** grid = pd.read_csv("/path/to/grid.csv") # Read GRID data grid["name"] = clean(grid["name"]) # Clean GRID data # *************************** # Index with Full Index # *************************** indexer = rl.FullIndex() candidate_links = indexer.index(bib, grid)
Set up the Compare Object
In recordlinkage you must initiate a
Compare object prior to performing any comparison functionality between records. This object stores both dataframes, the candidate links, and a vector containing comparison results. Further, the
Compare object contains the methods for performing comparisons. The code block below initializes the comparison object.
compare = rl.Compare(candidate_links, bib, grid)
Built-in Comparison Methods
Now that we have a initiated a comparison method we can go ahead and start comparing records. The simplest way to do comparisons is to use comparison methods that have been built-in to recordlinkage. Currently there are five specific comparison methods within recordlinkage:
Compare.date(). These are all well documented in the documentation. For this example we will be comparing the
Affiliation column from the bibliometric dataset and the
name column from the grid dataset.
We will start by using the
Compare.exact() comparison method. This method is simple, if two values are an exact match a comparison score of 1 is returned, otherwise 0 is retured. Since the affiliation data we are trying to match is quite messy, this method returns very low scores. In fact, looking at the describe table included below the code block, you can see that no exact matches were found between the two columns.
Next, we will use the
Compare.string() comparison method. This method is a bit more complicated and generates a score based on well known string-comparison algorithms. For this example, we will use the Levenshtein method. However, as shown in the describe table below, we still aren’t getting very high comparison scores, with the highest only being 0.52 (of a maximum 1.0).
# Use built-in comparison functions compare.exact("Affiliation", "name", name="exact") compare.string("Affiliation", "name", name="string") # Print description print(compare.vectors.describe())
Custom Comparison Methods
Based on the lack of high comparison scores, it doesn’t seem as though recordlinkage’s built-in methods are going to work for our current use case. Thankfully, there is a general comparison method built-in to recordlinkage called
Compare.compare(). This method takes in a comparison function, two columns to compare, and the name of the column the score should be stored in. So, all we have to do is define and call a custom comparison function.
To demonstrate how the
Compare.compare() method works, we will use a couple of custom comparison function that has already been defined in NetLab’s
labutils package. We will use the
normed_lcss computes a comparison score based on the length of the longest common substring between two strings. This score is normalized based on the length of the longest possible substring (the lesser of the two string lengths).
normed_fuzzy_lcss() computes a score based on the prescence of similar substrings within the two strings. Once again this is normalized based on the length of the longest possible substring. For more information on this function see the documentation.
The code block below shows how each of these functions can be implemented using the generalized
Compare.compare() method in recordlinkage.
# Import custom functions from labutils import normed_lcss, normed_fuzzy_lcss # Perform the comparison compare.compare(normed_lcss, "Affiliation", "name", name="lcss") compare.compare(normed_fuzzy_lcss, "Affiliation", "name", name="fuzzy_lcss") # Print new description, including the 99th and 99.99th percentiles compare.vectors.describe([0.99, 0.9999])
As we can see in the table above, both custom methods are resulting in higher comparison scores, but are still spread out between 0 and 1. The occurence of rare high scores helps to identify candidate links which likely correspond to true matches. The table below shows the results of comparisons on four pairs of records, two having low scores and two having high scores.
|(26, 2507)||institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9||steel structures education foundation||0||0.202381||0.0810811||0.0810811|
|(22, 2225)||transgenic core facility clinical research institute of montreal montreal qc canada||grand river conservation authority||0||0.228916||0.0882353||0.0882353|
|(7, 8)||university of alberta 2 51 south academic building edmonton ab canada t6g2g7||ualberta||0||0.105263||0.875||0.875|
|(17, 11)||michael smith laboratories university of british columbia vancouver britishcolumbia canada||university of british columbia||0||0.333333||1||1|
Defining Custom Comparison Methods
Once you know how to use custom comparison methods, the next important step is knowing how to create a custom method. There is additional information in the recordlinkage documentation, but I hope this example can provide some helpful additional information.
The code block below shows a template that can be used for create a custom comparison method. A custom comparison method takes in two
Pandas.Series (columns) as well as any additional arguments that that specific function might need. To use the template below all you need to do is add to the
inner_apply() method, writing the functionality to compare two values and produce a comparison score.
def my_compare(s1, s2): # This combines the columns you are comparing into a single DataFrame concat = pd.concat([s1, s2], axis=1, ignore_index=True) def inner_apply(x): """ This is where your custom algorithm is housed. Create a function to be applied to each pair in the DataFrame. """ val1 = x val2 = x # Do something to produce the result of comparing val1 and val2 # return the result return concat.apply(inner_apply, axis=1)
As an example, we will use this template to create a new comparison method. This method might not be terribly useful, but it should demonstrate how this template can be customixed for your own purposes. This method will first tokenize both strings (splitting at spaces). Then it will check that the first token from both strings appears in the other string. If both appear in the other a score of 1 is given. If neither appear a score of 0 is given. Finally, if only one appears in the other, a score of 0.5 is given.
def first_token(s1, s2): # This combines the columns you are comparing into a single DataFrame concat = pd.concat([s1, s2], axis=1, ignore_index=True) def apply_first_token(x): """ This is where your custom algorithm is housed. Create a function to be applied to each pair in the DataFrame. """ val1 = x val2 = x # Do something to produce the result of comparing val1 and val2 tkn1 = val1.split() tkn2 = val2.split() score = 0 if tkn1 in tkn2: score += 0.5 if tkn2 in tkn1: score += 0.5 # return the result return score return concat.apply(apply_first_token, axis=1)
The table below shows the resuls of applying this function to our dataset, showing the affiliation and name being compared as well as the resulting score. While the method itself isn’t very useful, it provides a good demonstration of how to define and use your own custom comparison function.
|(17, 1425)||michael smith laboratories university of british columbia vancouver britishcolumbia canada||phenomenome discoveries||0|
|(14, 2691)||department of surgery the university of british columbia vancouver bc canada||strathcona community hospital||0|
|(26, 39)||institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9||university of victoria||0.5|
|(21, 343)||institut de recherches cliniques de montreal montreal qc canada||canada council||0.5|
|(7, 798)||university of alberta 2 51 south academic building edmonton ab canada t6g2g7||university of northern british columbia||1|
|(7, 737)||university of alberta 2 51 south academic building edmonton ab canada t6g2g7||university of ottawa||1|
After you have compared attributes between a set of candidate links, you will want to determine which should be considered matches. This is done through classification. For an introduction to classification, check out the next tutorial in this series (Coming Soon…).
We highly recommend that you check out the recordlinkage documentation section on comparing records. For an in-depth look at each comparison method, take a look at the comparing page of the recordlinkage API Reference.
For more information on types of comparison metrics, we suggest reading Chapter 5 of Peter Christen’s book Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection Paperback. A shorter and more general overview of the process is described in Section 3.5 in Foster et al.’s book Big Data and Social Science: A Practical Guide to Methods and Tools.