Welcome to the third installment of a five part tutorial series on the recordlinkage python package. Each tutorial will cover a specific stage of the data integration workflow. The topics and links for each tutorial are included below:
1. Data Pre-processing
2. Indexing
3. Record Comparison
4. Classification
5. Data Fusion (Coming Soon…)

Goal

By the end of this tutorial you should be comfortable using both threshold and learning based methods to classify candidate record pairs as matching/non-matching.

Classification is an important step in the data integration processes, as it is how we determine which record pairs correspond to matches and non-matches (and sometimes possible matches). Threshold-based methods offer a simple and easy to understand approach to classification. Alternatively, supervised or unsupervised learning models offer a more sophisticated way to address the task of classification. Regardless of the approach we use, the records which are determined to be matches will fused in the next step of data integration.

In this tutorial, we will be examining two product data sets, one from Amazon and one from Google. Our task is to identify the products which appear in both datasets. You can download both datasets here. Five rows from each data set have been included below.

Google

id name description manufacturer price
http://www.google.com/base/feeds/snippets/11125907881740407428 learning quickbooks 2007 learning quickbooks 2007 intuit 38.99
http://www.google.com/base/feeds/snippets/11538923464407758599 superstart! fun with reading & writing! fun with reading & writing! is designed to help kids learn to read and write better through exercises puzzle-solving creative writing decoding and more! 8.49
http://www.google.com/base/feeds/snippets/11343515411965421256 qb pos 6.0 basic software qb pos 6.0 basic retail mngmt software. for retailers who need basic inventory sales and customer tracking. intuit 637.99
http://www.google.com/base/feeds/snippets/12049235575237146821 math missions: the amazing arcade adventure (grades 3-5) save spectacle city by disrupting randall underling's plan to drive all the stores out of business and take over the city. solve real-world math challenges in the uniquely entertaining stores and make theme successful again. 12.95
http://www.google.com/base/feeds/snippets/12244614697089679523 production prem cs3 mac upgrad adobe cs3 production premium mac upgrade from production studio premium or standard adobe software 805.99

Amazon

id title description manufacturer price
b000jz4hqo clickart 950 000 - premier image pack (dvd-rom) broderbund 0.00
b0006zf55o ca international - arcserve lap/desktop oem 30pk oem arcserve backup v11.1 win 30u for laptops ... computer associates 0.00
b00004tkvy noah\'s ark activity center (jewel case ages 3-8) victory multimedia 0.00
b000g80lqo peachtree by sage premium accounting for nonpr... peachtree premium accounting for nonprofits 20... sage software 599.99
b0006se5bq singing coach unlimited singing coach unlimited - electronic learning ... carry-a-tune technologies 99.99

Preliminary Work

The code below tackles the steps of pre-processing, indexing, and record comparison, each of which were introduced in previous blog posts. Since we working with a fair sized data set, this code may take a few minutes to run.

import recordlinkage as rl
from recordlinkage.preprocessing import clean
import pandas as pd

# ********************************
# Load Data
# ********************************
goog = pd.read_parquet('amazon-google/Google')
amzn = pd.read_parquet('amazon-google/Amazon')

# ********************************
# Pre-processing
# ********************************
goog['name'] = clean(goog['name'])
goog['description'] = clean(goog['description'])
goog['manufacturer'] = clean(goog['manufacturer'])
goog['price'] = pd.to_numeric(goog['price'], errors='coerce')
goog.columns = ['idGoogle', 'name', 'description', 'manufacturer', 'price']
goog = goog.set_index('idGoogle')

amzn['title'] = clean(amzn['title'])
amzn['description'] = clean(amzn['description'])
amzn['manufacturer'] = clean(amzn['manufacturer'])
amzn.columns = ['idAmazon', 'name', 'description', 'manufacturer', 'price']
amzn = amzn.set_index('idAmazon')

# ********************************
# Indexing 
# ********************************
cl = rl.SortedNeighbourhoodIndex('name', window=251)
cl = cl.index(goog, amzn)

# ********************************
# Record Comparison
# ********************************
c = rl.Compare()

c.string('name', 'name', label='cmp_name')
c.string('description', 'description', method='jaro_winkler', label='cmp_description')
c.string('manufacturer', 'manufacturer', label='cmp_manufacturer')
c.numeric('price', 'price', method='linear', scale=1, 
          offset=10, label='cmp_price')

feature_vectors = c.compute(cl, goog, amzn)

Classification

We will cover three ways to classify our candidate record pairs as matches or non-matches.

  1. Threshold Based Methods
  2. Supervised Learning Methods
  3. Unsupervised Learning Methods

Once we classify each of the record pairs, it is important to evaluate the classification. We will use three commonly used metrics: precision, recall, and f-measure.

  1. Precision is the portion of record pairs classified as matching which are actually matching.
  2. Recall is the portion of true matches which are identified as matching.
  3. F-Measure combines precision and recall to balance the two in a single metric.

1. Threshold-Based

We will start with implementing a threshold based approach to classification. This works by finding some metric, setting a threshold, and then using that as the boundary to define our two classes.

An example of threshold-based classification could be determining whether an animal is a cat or a dog. Our metric will be weight and 10 lbs(4.5 kgs) will be our threshold. If an animal weighs more than 10 lbs it is classified as a dog, and if it weighs less it is classified as a cat.

This is a very straightforward approach, it is easy to understand and implement. However, its also error-prone. For example, most Chihuahuas are going to be classified as cats and some well-fed cats will be classified as dogs.

For our recordlinkage problem, our metric will be the sum of all comparisons scores for each candidate record pair. We will use a threshold of 2.5 (chosen arbitrarily).

predictions = feature_vectors[feature_vectors.sum(axis=1) > 2.5]

print("Threshold-Based: {} matches".format(len(predictions)))

Evaluate the Predictions

Now, we should evaluate how this threshold based approach performed. First, we will read in the true matches from our dataset. Then, we will use the three the three different metrics which were introduced above.

As a refresher, precision is the proportion of record pairs classified as matching which are actual matches. Recall is the proportion of all the actual matches which are identified as matching. Finally, F-measure can be thought of as metric which balances precision and recall.

# Load True Matches
mapping = pd.read_parquet('amazon-google/Amazon_Google_perfectMapping') 

# Convert dataframe to index required by recordlinkage
matches = mapping.set_index(['idGoogle', 'idAmazon']).index 

# Get the confusion matrix. This is just the table with the numbers of True/False Postives and True/False Negatives.
confusion_matrix = rl.confusion_matrix(matches, predictions, len(feature_vectors))

# Print Metrics
print("Precision:", rl.precision(confusion_matrix))
print("Recall:", rl.recall(confusion_matrix))
print("F-Measure:", rl.fscore(confusion_matrix))

Precision: 0.28607594936708863
Recall: 0.08692307692307692
F-Measure: 0.13333333333333333

We want all three of these numbers to be as close to 1 as possible. This is a pretty good indication that something in our process needs to change. Perhaps we should revisit earlier steps (garbage in, garbage out). But we can also consider use more sophisticated methods of classification.

2. Supervised Learning

Next, we will use a supervised-learning approach to classification. Supervised learning models are trained based on data provided by a user. This data includes a set of features as well as the proper classification.

You can think of a parent teaching a young child about animals as supervised learning. The parent will point to an animal and say something like “Look, a dog!” or “Watch out! There is a goose!”. Soon the child will be able to identify dogs from geese – they will have developed a successful classification model!

We will use a logistic regression model for classification. There are many other supervised learning models, some of which are also available in recordlinkage.

Load True Matches

First we will load the data containing all of the true matches in our data set.

# Load True Matches
mapping = pd.read_parquet('amazon-google/Amazon_Google_perfectMapping') 

# Convert dataframe to index required by recordlinkage
match = mapping.set_index(['idGoogle', 'idAmazon']).index 

Create Training & Testing Datasets

Next we will split the our dataframe of feature vectors generated during the prelimary stage into our training and testing set. The training set will be used to train the logistic regression model. Then, the testing set will be used to evaluate how the model performs. It is very important to know that you should never train and test your model on the same subset of your data.

To get a sense for why this shouldn’t be done, we can think about a professor giving their students a 10 question sample exam, along with all the solutions. Then, come the final exam they pass out an exam made up of the exact same 10 questions. Upon grading the exam the professor discovers every student got 1010 and concludes that the entire class has an amazing grasp of all course content.

However, this is probably not the case at all. We can imagine that some of the students likely just memorized the 10 solutions and were able to ace the test regardless of whether they studied any other content at all!

This is an example of what is called overfitting. Essentially, the model has memorized the provided examples but has failed to learn a more general model that is successful beyond these examples.

Once again, DO NOT evaluate your model on the data it was trained on!

from sklearn.model_selection import train_test_split

# Create a training and test set
train, test = train_test_split(feature_vectors, test_size=0.25)

# Get the true pairs for each set
train_matches_index = train.index & match
test_matches_index = test.index & match

Build the Model

Finally, we can build our classification model and train it.

# Logistic Regression
# ***********************

# Initialize the classifier
classifier = rl.LogisticRegressionClassifier()

# Train the classifier
classifier.learn(train, train_matches_index)

Evaluate the Model

Now, we should evaluate how the model performs on the test set. First, we use the model to make predictions on the test set. Then we will use three the three different metrics which were introduced above.

As a refresher, precision is the proportion of record pairs classified as matching which are actual matches. Recall is the proportion of all the actual matches which are identified as matching. Finally, F-measure can be thought of as metric which balances precision and recall.

# Make Predictions on a test set
predictions = classifier.predict(test)

# Get the confusion matrix. This is just the table with the numbers of True/False Postives and True/False Negatives.
confusion_matrix = rl.confusion_matrix(test_matches_index, predictions, len(test))

# Print Metrics
print("Precision:", rl.precision(confusion_matrix))
print("Recall:", rl.recall(confusion_matrix))
print("F-Measure:", rl.fscore(confusion_matrix))

Precision: 0.5
Recall: 0.23170731707317074
F-Measure: 0.3166666666666667

We want all three of these numbers to be as close to 1 as possible. As we can see, the model is not performing miracles here. However, it is a pretty good improvement over our threhold-based classification.

We will likely want to revisit earlier parts of our process to see if we can improve these measures. Improvements to the model could be achieved in a variety of ways, from adding new features to model (perhaps comparing product names using jaccard similarity) to changing model hyper-parameters.

Its important to realize that these metrics are evaluating how the classification model performs on the data it has been given. It does not consider the fact that during our preliminary work we actually exclude a whole bunch of possible record pairs when indexing.

3. Unsupervised Learning

Finally, we will use unsupervised learning to classify our candidate record pairs. Unsupervised models train themselves without the user having to provide training data to the model.

Once again, we can consider a child learning about animals. Unsupervised learning would have the child learning about the world without the help of her parents. Instead, she might be told that there are two types of animals she is likely to encounter. She doesn’t know anything else about these animals, not even their names. But by examining lots of different examples of animals she eventually groups them together. Animals that have tails, chase sticks, and enjoy being approached are one type, while those which have wings, walk aimlessly across the street, and are scary when you get near them are another. At this point, the child will have developed a successful method of distinguishing the two types of animals.

We will use a K Means Classifier for our example.

Create Training & Testing Datasets

Just as we did for unsupervised learning, we will start with splitting our data into two parts. By excluding some of our data from training, we will be able to evaluate how the model might perform if given data it has never seen before.

from sklearn.model_selection import train_test_split

# Create a training and test set
train, test = train_test_split(feature_vectors, test_size=0.25)

# Get the true pairs for the test set (Used for Evaluation)
test_matches_index = test.index & match

Build the Model

# K-means Classifier
# *******************
# Build The Classifier
kmeans = rl.KMeansClassifier()

# Train the Model

result_kmeans = kmeans.learn(train)

Evaluate the Model

Once again, we will make predictions on the test set and then use precision, recall, and f-measure to evaluate the performance.

# Make Predictions on a test set
predictions = kmeans.predict(test)

# Get the confusion matrix. This is just the table with the numbers of True/False Postives and True/False Negatives.
confusion_matrix = rl.confusion_matrix(test_matches_index, predictions, len(test))

# Print Metrics
print("Precision:", rl.precision(confusion_matrix))
print("Recall:", rl.recall(confusion_matrix))
print("F-Measure:", rl.fscore(confusion_matrix))

Precision: 0.009804922888366867
Recall: 0.5853658536585366
F-Measure: 0.019286790557508787

Remember, we want all three of these numbers to be as close to 1 as possible. Once again, this has offered an improvement over our original threshold-based approach. However, it does have a worse f-measure than our supervised approach.

Once again, these poor scores offer an indication that we should revisit earlier steps in our process.

What’s next?

Now that you have your set of matching record pairs, you will likely want to combine the records into a single dataset for easy use. This task is called data fusion. For an introduction to fusion check out the next tutorial in this series (Coming Soon…).

Further Reading

While recordlinkage provides some good basic classification algorithms, it does not offer a lot of opportunity to customize the algorithms. To do this, I suggest using other pacakges built for classification, such as scikit learn.

Its important to remember that your goal is to classify these record pairs as matching or non-matching. There is nothing saying that you can’t deviate from the traditional methods. Often, these are good places to start, but be creative in how you design a classification system. For example, you might want to find ways to combine multiple model types, using something like stacking or a committee-based approaches.