This tutorial is a supplement to our tutorial on indexing candidate pairs with the recordlinkage Python package for data integration. This tutorial takes a closer look at the sorted neighbourhood indexing method, which has more complex behaviour than other indexing methods. To get the most out of this tutorial, you should already be familiar with indexing candidate pairs with recordlinkage.
The Interleaved Neighbourhood
The following code creates a set of candidate links using the sorted neighbourhood indexing method:
import pandas as pd import recordlinkage as rl # Import data names_1 = ['alfred', 'bob', 'calvin', 'hobbes', 'rusty'] names_2 = ['alfred', 'danny', 'callum', 'hobie', 'rusty'] df_a = pd.DataFrame(pd.Series(names_1, name='names')) df_b = pd.DataFrame(pd.Series(names_2, name='names')) indexer = rl.SortedNeighbourhoodIndex(on='names', window=3) candidate_links = indexer.index(df_a, df_b)
Here, we have candidate links between names which would be adjacent in a sorted list. This method is excellent if you want to reduce the number of candidate links considered, but cannot guarantee exact matches between potential matches. However, it’s important to understand that this method has some quirks which should be kept in mind while using it.
SortedNeighbourhoodIndex does the following to determine candidate links:
- It places the data from both data frames into one series.
- It sorts the series.
- For each element, it looks within the specified
window. If two elements from different data frames are within the window, they are considered a candidate link.
Let’s look at this process step-by-step. First, we start with two sets of names:
names_1 = ['alfred', 'bob', 'calvin', 'hobbes', 'rusty'] names_2 = ['alfred', 'danny', 'callum', 'hobie', 'rusty']
These are combined into a single sorted series (seen here as a sorted list):
['alfred', 'alfred', 'bob', 'callum', 'calvin', 'danny', 'hobbes', 'hobie', 'rusty', 'rusty']
window = 3, the
SortedNeighbourhoodIndex looks at for candidate links for each element by looking at three elements total, centered on the element in question. In other words, a
window of 3 means that only elements that are adjacent to the target (i.e. one position away) are considered as candidate links.
To visualize this, let’s look at a few windows in the list by capitalizing entries inside the window. The window for the first
'alfred' looks like this:
[ALFRED, ALFRED, 'bob', 'callum', 'calvin', 'danny', 'hobbes', 'hobie', 'rusty', 'rusty']
'alfred' is from a different data frame, a candidate link is given, as can be seen in the network above. The next window, centered on the second
'alfred', looks like this:
[ALFRED, ALFRED, BOB, 'callum', 'calvin', 'danny', 'hobbes', 'hobie', 'rusty', 'rusty']
And since the second
'alfred' is from a different data frame than
'bob', another candidate link is created. The process continues:
['alfred', ALFRED, BOB, CALLUM, 'calvin', 'danny', 'hobbes', 'hobie', 'rusty', 'rusty'] ['alfred', 'alfred', BOB, CALLUM, CALVIN, 'danny', 'hobbes', 'hobie', 'rusty', 'rusty'] ['alfred', 'alfred', 'bob', CALLUM, CALVIN, DANNY, 'hobbes', 'hobie', 'rusty', 'rusty'] ['alfred', 'alfred', 'bob', 'callum', CALVIN, DANNY, HOBBES, 'hobie', 'rusty', 'rusty'] ['alfred', 'alfred', 'bob', 'callum', 'calvin', DANNY, HOBBES, HOBIE, 'rusty', 'rusty'] ['alfred', 'alfred', 'bob', 'callum', 'calvin', 'danny', HOBBES, HOBIE, RUSTY, 'rusty'] ['alfred', 'alfred', 'bob', 'callum', 'calvin', 'danny', 'hobbes', HOBIE, RUSTY, RUSTY] ['alfred', 'alfred', 'bob', 'callum', 'calvin', 'danny', 'hobbes', 'hobie', RUSTY, RUSTY]
As we can see in the visualization above, this process yields two candidate links for every entry, except for the first and last. However, it is important to note that this data is extremely unusual, because the elements of each data frame alternate consistently. We can see this if we capitalize all the elements from
[ALFRED, 'alfred', BOB, 'callum', CALVIN, 'danny', HOBBES, 'hobie', RUSTY, 'rusty']
What happens with different data?
The Small Neighbourhood
What if the sorted neighbourhood has very few elements in it? Here, we can see the sorted list that would occur
df_a contained only
df_b contained only
[ALFRED, ALFRED, ALFRED, ALFRED, ALFRED, 'beth', 'beth', 'beth', 'beth', 'beth']
beth are contained in the same window (i.e. are adjacent), all cases of
beth are considered adjacent. The resulting set of candidate links is equivalent to what would be created with the Full Index method:
The Divided Neighbourhood
On the other hand, consider another case where the all elements from
df_a are sorted before the elements of
df_b, but each name is a distinct value:
[AGNES, ALEXANDER, ALFRED, ALICE, AMY, 'beth', 'bill', 'bob', 'brittany', 'bruce']
Here, only a single link is created! This is because only a single pair of elements from different data frames appear within the same window (
The Random Neighbourhood
Just like our original test data showed unusually predictable behaviour, the small neighbourhood and divided neighbourhood examples are extreme cases which highlight the idiosyncrasies of the sorted neighbourhood method. However, if we want to get a better sense of how sorted neighbour indexing will perform in the real world, let’s look at a random selection of these names:
[ALEXANDER, ALEXANDER, 'alice', 'amy', 'beth', 'beth', BOB, 'brittany', BRUCE, BRUCE]
Here, our randomly selected names un unevenly interspersed, and so we get a less regular pattern.
Choosing a Window Size
All of these examples maintain the default
window size of 3. However, you may find it useful widen the
window. Widening the
window will cause a greater number of candidate links to be included. For example, this is what our random neighbourhood example looks like with the
window size increased to 5:
While a larger set of candidate keys will require more computational resources, it may be worth the tradeoff. A larger window is less susceptible to odd orderings, will reduce the chance of excluding a true match during the indexing phase. It is important to remember that the goal of indexing is not to find final matches — it is to discard obvious mismatches.
The sorted neighbourhood indexing method is a great tool for increasing efficiency in record linkage, even when exact match blocking is not possible. However, this method has some unusual behaviour that, with unusual data, may create significantly more candidate links or significantly fewer candidate links than expected. The sorted neighbourhood method will perform best when there are not systematic differences in the input data, which could cause unintended patterns when the data is combined and sorted.