US patents on data mining
A computer-assisted process for determining linkages between data records comprising:
constructing a predictive model based at least in part on a product divided by a sum of products;
training said predictive model with record pair linkage data, including the step of applying at least one machine learning method on a corpus of record pairs presented so as to indicate decisions made by at least one human decision maker as to whether said record pairs should be linked; and
using said trained predictive model to automatically identify records that have a predetermined type of similarity to other data.
The patent sets out to resolve the problem of duplicate records in databases:
Duplicate records create an especially troublesome problem. Suppose for example that when a customer named "Joseph Smith" first starts doing business with an organization, his name is initially inputted into the computer database as "Joe Smith". The next time he places an order, however, the sales clerk fails to notice or recognize that he is the same "Joe Smith" who is already in the database, and creates a new record under the name "Joseph Smith". A still further transaction might result in a still further record under the name "J. Smith." When the company sends out a mass mailing to all of its customers, Mr. Smith will receive three copies--one to "Joe Smith", another addressed to "Joseph Smith", and a third to "J. Smith." Mr. Smith may be annoyed at receiving several duplicate copies of the mailing, and the business has wasted money by needlessly printing and mailing duplicate copies.
The patent teaches analysis of other facts in the records to establish a match/nonmatch of these other facts:
The functions that can serve as features depend on the nature of the data items being analyzed (and in some cases, on peculiarities in the particular database). In the context of a children's health insurance database, for example, features may include:
match/mismatch of child's birthday/mother's birthday
match/mismatch of house number, telephone number, zip code
match/mismatch of Medicaid number and/or medical record number
presence of multiple birth indicator on one of the records
match/mismatch of child's first and middle names (after filtering out generic names like "Baby Boy")
match/mismatch of last name
match/mismatch of mother's/father's name
approximate matches of any of the name fields where the names are compares using a technique such as the "Soundex" or "Edit Distance" techniques
Of priority, priority is claimed from my U.S. provisional application No. 60/155,063 filed Sep. 21, 1999 entitled "A Probabalistic Record Linkage Model Derived from Training Data", the entirety of which is incorporated herein by reference.
The '019 patent has been cited by one US patent, US 6,675,164 (to Kamath of UC/Berkeley), which includes the text:
Data mining is a process that uses specific techniques to find patterns in data, allowing a user to conduct a relatively broad search of large databases for relevant information that may not be explicitly stored in the databases. Typically, a user initially specifies a search phrase or strategy and the system then extracts patterns and relations corresponding to that strategy from the stored data. These extracted patterns and relations can be: (1) used by the user, or data analyst, to form a prediction model; (2) used to refine an existing model; and/or (3) organized into a summary of the target database. Such a search system permits searching across multiple databases. There are two existing forms of data mining: top-down; and bottom-up. Both forms are separately available on existing systems. Top-down systems are also referred to as "pattern validation," "verification-driven data mining" and "confirmatory analysis." This is a type of analysis that allows an analyst to express a piece of knowledge, validate or validate that knowledge, and obtain the reasons for the validation or invalidation. The validation step in a top-down analysis requires that data refuting the knowledge as well as data supporting the knowledge be considered. Bottom-up systems are also referred to as "data exploration." Bottom-up systems discover knowledge, generally in the form of patterns, in data.
The '164 patent reviews previous patents in data mining.