-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression v3 matcher #176
Open
jbothma
wants to merge
20
commits into
main
Choose a base branch
from
reg-v3-base-fix-odd-coefficients
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
either one goes negative, or they all hover around 0
Comparing regression_v1 and regression_v3 Common subdirectories: nomenklatura/matching/regression_v1/__pycache__ and nomenklatura/matching/regression_v3/__pycache__
diff -u nomenklatura/matching/regression_v1/misc.py nomenklatura/matching/regression_v3/misc.py
--- nomenklatura/matching/regression_v1/misc.py 2024-09-09 11:14:58
+++ nomenklatura/matching/regression_v3/misc.py 2024-09-15 09:15:13
@@ -1,8 +1,9 @@
from followthemoney.proxy import E
from followthemoney.types import registry
+import numpy as np
from nomenklatura.matching.regression_v1.util import tokenize_pair, compare_levenshtein
-from nomenklatura.matching.compare.util import has_overlap, extract_numbers
+from nomenklatura.matching.compare.util import has_overlap, extract_numbers, is_disjoint
from nomenklatura.matching.util import props_pair, type_pair
from nomenklatura.matching.util import max_in_sets, has_schema
from nomenklatura.util import normalize_name
@@ -18,6 +19,8 @@
def address_match(query: E, result: E) -> float:
"""Text similarity between addresses."""
lv, rv = type_pair(query, result, registry.address)
+ if not (lv and rv):
+ return np.nan
lvn = [normalize_name(v) for v in lv]
rvn = [normalize_name(v) for v in rv]
return max_in_sets(lvn, rvn, compare_levenshtein)
@@ -61,3 +64,19 @@
return 0.0
lv, rv = type_pair(query, result, registry.identifier)
return 1.0 if has_overlap(lv, rv) else 0.0
+
+
+def position_country_mismatch(query: E, result: E) -> float:
+ """Whether positions have the same country or not"""
+ if not has_schema(query, result, "Position"):
+ return 0.0
+ lv, rv = type_pair(query, result, registry.country)
+ return 1.0 if is_disjoint(lv, rv) else 0
+
+
+def security_isin_mismatch(query: E, result: E) -> float:
+ """Both entities are linked to different ISIN codes."""
+ if not has_schema(query, result, "Security"):
+ return 0.0
+ qv, rv = props_pair(query, result, ["isin"])
+ return 1.0 if is_disjoint(qv, rv) else 0.0
diff -u nomenklatura/matching/regression_v1/model.py nomenklatura/matching/regression_v3/model.py
--- nomenklatura/matching/regression_v1/model.py 2024-02-13 13:25:35
+++ nomenklatura/matching/regression_v3/model.py 2024-09-15 09:29:11
@@ -5,48 +5,48 @@
from sklearn.pipeline import Pipeline # type: ignore
from followthemoney.proxy import E
-from nomenklatura.matching.regression_v1.names import first_name_match
-from nomenklatura.matching.regression_v1.names import family_name_match
-from nomenklatura.matching.regression_v1.names import name_levenshtein, name_match
-from nomenklatura.matching.regression_v1.names import name_token_overlap, name_numbers
-from nomenklatura.matching.regression_v1.misc import phone_match, email_match
-from nomenklatura.matching.regression_v1.misc import address_match, address_numbers
-from nomenklatura.matching.regression_v1.misc import identifier_match, birth_place
-from nomenklatura.matching.regression_v1.misc import org_identifier_match
-from nomenklatura.matching.compare.countries import country_mismatch
+
+from nomenklatura.matching.regression_v3.names import first_name_match, name_similarity
+from nomenklatura.matching.regression_v3.names import family_name_match
+from nomenklatura.matching.regression_v3.names import name_levenshtein, name_match
+from nomenklatura.matching.regression_v3.names import name_token_overlap, name_numbers
+from nomenklatura.matching.regression_v3.misc import phone_match, email_match, position_country_mismatch
+from nomenklatura.matching.regression_v3.misc import address_match, address_numbers
+from nomenklatura.matching.regression_v3.misc import identifier_match, birth_place
+from nomenklatura.matching.regression_v3.misc import org_identifier_match
+from nomenklatura.matching.regression_v3.misc import security_isin_mismatch
from nomenklatura.matching.compare.gender import gender_mismatch
from nomenklatura.matching.compare.dates import dob_matches, dob_year_matches
-from nomenklatura.matching.compare.dates import dob_year_disjoint
+from nomenklatura.matching.compare.dates import dob_year_disjoint, dob_similarity
+from nomenklatura.matching.compare.countries import country_match
from nomenklatura.matching.types import FeatureDocs, FeatureDoc, MatchingResult
from nomenklatura.matching.types import CompareFunction, Encoded, ScoringAlgorithm
from nomenklatura.matching.util import make_github_url
from nomenklatura.util import DATA_PATH
-class RegressionV1(ScoringAlgorithm):
+class RegressionV3(ScoringAlgorithm):
"""A simple matching algorithm based on a regression model."""
- NAME = "regression-v1"
+ NAME = "regression-v3"
MODEL_PATH = DATA_PATH.joinpath(f"{NAME}.pkl")
FEATURES: List[CompareFunction] = [
- name_match,
- name_token_overlap,
name_numbers,
- name_levenshtein,
+ name_similarity,
phone_match,
email_match,
identifier_match,
- dob_matches,
- dob_year_matches,
- dob_year_disjoint,
+ dob_similarity,
first_name_match,
family_name_match,
birth_place,
gender_mismatch,
- country_mismatch,
+ country_match,
+ position_country_mismatch,
org_identifier_match,
address_match,
address_numbers,
+ security_isin_mismatch,
]
@classmethod
diff -u nomenklatura/matching/regression_v1/names.py nomenklatura/matching/regression_v3/names.py
--- nomenklatura/matching/regression_v1/names.py 2024-09-09 11:14:58
+++ nomenklatura/matching/regression_v3/names.py 2024-09-18 22:39:00
@@ -1,14 +1,24 @@
+from statistics import mean
from typing import Iterable, Set
from followthemoney.proxy import E
from followthemoney.types import registry
+import numpy as np
-from nomenklatura.matching.regression_v1.util import tokenize_pair, compare_levenshtein
+from nomenklatura.matching.regression_v3.util import tokenize_pair, compare_levenshtein
from nomenklatura.matching.compare.util import is_disjoint, has_overlap, extract_numbers
-from nomenklatura.matching.util import props_pair, type_pair
+from nomenklatura.matching.compare.names import aligned_levenshtein, name_fingerprint_levenshtein, symmetric_aligned_levenshtein
+from nomenklatura.matching.util import has_schema, props_pair, type_pair
from nomenklatura.matching.util import max_in_sets
from nomenklatura.util import fingerprint_name
+MATCH_BASE_SCORE = 0.7
+MAX_BONUS_LENGTH = 100
+LENGTH_BONUS_FACTOR = (1 - MATCH_BASE_SCORE) / MAX_BONUS_LENGTH
+MAX_BONUS_QTY = 10
+QTY_BONUS_FACTOR = (1 - MATCH_BASE_SCORE) / MAX_BONUS_QTY
+
+
def normalize_names(raws: Iterable[str]) -> Set[str]:
names = set()
for raw in raws:
@@ -21,43 +31,77 @@
def name_levenshtein(left: E, right: E) -> float:
"""Consider the edit distance (as a fraction of name length) between the two most
similar names linked to both entities."""
- lv, rv = type_pair(left, right, registry.name)
- lvn, rvn = normalize_names(lv), normalize_names(rv)
- return max_in_sets(lvn, rvn, compare_levenshtein)
+ if has_schema(left, right, "Person"):
+ lv, rv = type_pair(left, right, registry.name)
+ lvn, rvn = normalize_names(lv), normalize_names(rv)
+ return max_in_sets(lvn, rvn, compare_levenshtein)
+ else:
+ return name_fingerprint_levenshtein(left, right, symmetric_aligned_levenshtein)
def first_name_match(left: E, right: E) -> float:
"""Matching first/given name between the two entities."""
lv, rv = tokenize_pair(props_pair(left, right, ["firstName"]))
+ if not (lv and rv):
+ return np.nan
return 1.0 if has_overlap(lv, rv) else 0.0
def family_name_match(left: E, right: E) -> float:
"""Matching family name between the two entities."""
lv, rv = tokenize_pair(props_pair(left, right, ["lastName"]))
+ if not (lv and rv):
+ return np.nan
return 1.0 if has_overlap(lv, rv) else 0.0
def name_match(left: E, right: E) -> float:
- """Check for exact name matches between the two entities."""
+ """
+ Check for exact name matches between the two entities.
+
+ Having any completely matching name initially scores 0.8.
+ A length bonus is added based on the length of the longest common name up to 100 chars.
+ A quantity bonus is added based on the number of common names up to 10.
+
+ The maximum score is 1.0.
+ No matches scores 0.0.
+ """
lv, rv = type_pair(left, right, registry.name)
lvn, rvn = normalize_names(lv), normalize_names(rv)
- common = [len(n) for n in lvn.intersection(rvn)]
- max_common = max(common, default=0)
- if max_common == 0:
+ common = sorted(lvn.intersection(rvn), key=lambda n: len(n), reverse=True)
+ if not common:
return 0.0
- return float(max_common)
+ score = MATCH_BASE_SCORE
+ longest_common = common[0]
+ length_bonus = min(len(longest_common), MAX_BONUS_LENGTH) * LENGTH_BONUS_FACTOR
+ quantity_bonus = min(len(common), MAX_BONUS_QTY) * QTY_BONUS_FACTOR
+ return score + (length_bonus + quantity_bonus) / 2
def name_token_overlap(left: E, right: E) -> float:
"""Evaluate the proportion of identical words in each name."""
- lv, rv = tokenize_pair(type_pair(left, right, registry.name))
- common = lv.intersection(rv)
- tokens = min(len(lv), len(rv))
- return float(len(common)) / float(max(2.0, tokens))
+ lvt, rvt = tokenize_pair(type_pair(left, right, registry.name))
+ common = lvt.intersection(rvt)
+ tokens = min(len(lvt), len(rvt))
+ if tokens == 0:
+ return 0.0
+ return float(len(common)) / tokens
def name_numbers(left: E, right: E) -> float:
"""Find if names contain numbers, score if the numbers are different."""
lv, rv = type_pair(left, right, registry.name)
return 1.0 if is_disjoint(extract_numbers(lv), extract_numbers(rv)) else 0.0
+
+
+def name_similarity(left: E, right: E) -> float:
+ """Compute the similarity between the names of two entities, picking the max from
+ a full string match, token overlap-based score, and levenshtein distance-based
+ score."""
+ return max(
+ [
+ name_match(left, right),
+ 0.5 * name_token_overlap(left, right),
+ name_levenshtein(left, right),
+ ]
+ )
diff -u nomenklatura/matching/regression_v1/train.py nomenklatura/matching/regression_v3/train.py
--- nomenklatura/matching/regression_v1/train.py 2024-09-06 12:44:09
+++ nomenklatura/matching/regression_v3/train.py 2024-09-13 17:28:35
@@ -1,19 +1,20 @@
import logging
import numpy as np
import multiprocessing
-from typing import Iterable, List, Tuple
+from typing import List, Tuple
from pprint import pprint
from numpy.typing import NDArray
from sklearn.pipeline import make_pipeline # type: ignore
from sklearn.preprocessing import StandardScaler # type: ignore
-from sklearn.model_selection import train_test_split # type: ignore
+from sklearn.model_selection import GroupShuffleSplit # type: ignore
from sklearn.linear_model import LogisticRegression # type: ignore
+from sklearn.impute import SimpleImputer # type: ignore
from sklearn import metrics # type: ignore
-from concurrent.futures import ThreadPoolExecutor
+from concurrent.futures import ProcessPoolExecutor
from nomenklatura.judgement import Judgement
from nomenklatura.matching.pairs import read_pairs, JudgedPair
-from nomenklatura.matching.regression_v1.model import RegressionV1
+from nomenklatura.matching.regression_v3.model import RegressionV3
from nomenklatura.util import PathLike
log = logging.getLogger(__name__)
@@ -22,20 +23,20 @@
def pair_convert(pair: JudgedPair) -> Tuple[List[float], int]:
"""Encode a pair of training data into features and target."""
judgement = 1 if pair.judgement == Judgement.POSITIVE else 0
- features = RegressionV1.encode_pair(pair.left, pair.right)
+ features = RegressionV3.encode_pair(pair.left, pair.right)
return features, judgement
def pairs_to_arrays(
- pairs: Iterable[JudgedPair],
+ pairs: List[JudgedPair],
) -> Tuple[NDArray[np.float32], NDArray[np.float32]]:
"""Parallelize feature computation for training data"""
xrows = []
yrows = []
threads = multiprocessing.cpu_count()
log.info("Compute threads: %d", threads)
- with ThreadPoolExecutor(max_workers=threads) as excecutor:
- results = excecutor.map(pair_convert, pairs)
+ with ProcessPoolExecutor(max_workers=threads) as executor:
+ results = executor.map(pair_convert, pairs, chunksize=1000)
for idx, (x, y) in enumerate(results):
if idx > 0 and idx % 10000 == 0:
log.info("Computing features: %s....", idx)
@@ -45,42 +46,49 @@
return np.array(xrows), np.array(yrows)
-def train_matcher(pairs_file: PathLike) -> None:
+def train_matcher(pairs_file: PathLike, splits: int = 1) -> None:
pairs = []
for pair in read_pairs(pairs_file):
- # HACK: support more eventually:
- # if not pair.left.schema.is_a("LegalEntity"):
- # continue
if pair.judgement == Judgement.UNSURE:
pair.judgement = Judgement.NEGATIVE
- # randomize_entity(pair.left)
- # randomize_entity(pair.right)
pairs.append(pair)
- # random.shuffle(pairs)
- # pairs = pairs[:30000]
positive = len([p for p in pairs if p.judgement == Judgement.POSITIVE])
negative = len([p for p in pairs if p.judgement == Judgement.NEGATIVE])
log.info("Total pairs loaded: %d (%d pos/%d neg)", len(pairs), positive, negative)
+
X, y = pairs_to_arrays(pairs)
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
- # logreg = LogisticRegression(class_weight={0: 95, 1: 1})
- # logreg = LogisticRegression(penalty="l1", solver="liblinear")
- logreg = LogisticRegression(penalty="l2")
- log.info("Training model...")
- pipe = make_pipeline(StandardScaler(), logreg)
- pipe.fit(X_train, y_train)
- coef = logreg.coef_[0]
- coefficients = {n.__name__: c for n, c in zip(RegressionV1.FEATURES, coef)}
- RegressionV1.save(pipe, coefficients)
- print("Coefficients:")
- pprint(coefficients)
- y_pred = pipe.predict(X_test)
- cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
- print("Confusion matrix:\n", cnf_matrix)
- print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
- print("Precision:", metrics.precision_score(y_test, y_pred))
- print("Recall:", metrics.recall_score(y_test, y_pred))
+ groups = [p.group for p in pairs]
+ gss = GroupShuffleSplit(n_splits=splits, test_size=0.33)
+ for split, (train_indices, test_indices) in enumerate(
+ gss.split(X, y, groups=groups), 1
+ ):
+ X_train = [X[i] for i in train_indices]
+ X_test = [X[i] for i in test_indices]
+ y_train = [y[i] for i in train_indices]
+ y_test = [y[i] for i in test_indices]
- y_pred_proba = pipe.predict_proba(X_test)[::, 1]
- auc = metrics.roc_auc_score(y_test, y_pred_proba)
- print("Area under curve:", auc)
+ print()
+ log.info("Training model...(split %d)" % split)
+ logreg = LogisticRegression(penalty="l2")
+ pipe = make_pipeline(
+ SimpleImputer(strategy="mean"),
+ StandardScaler(),
+ logreg,
+ )
+ pipe.fit(X_train, y_train)
+ coef = logreg.coef_[0]
+ coefficients = {n.__name__: c for n, c in zip(RegressionV3.FEATURES, coef)}
+ RegressionV3.save(pipe, coefficients)
+
+ print("Coefficients:")
+ pprint(coefficients)
+ y_pred = pipe.predict(X_test)
+ cnf_matrix = metrics.confusion_matrix(y_test, y_pred, normalize="all") * 100
+ print("Confusion matrix (% of all):\n", cnf_matrix)
+ print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
+ print("Precision:", metrics.precision_score(y_test, y_pred))
+ print("Recall:", metrics.recall_score(y_test, y_pred))
+
+ y_pred_proba = pipe.predict_proba(X_test)[::, 1]
+ auc = metrics.roc_auc_score(y_test, y_pred_proba)
+ print("Area under curve:", auc) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
At least an iteration of #163
SimpleImputer
to fillNaN
with mean for the featurename_similarity
feature takes max of name_match,name_token_overlap
andname_levenshtein
address_match
isNaN
when values aren't available.name_match
component ofname_similarity
scaled to 0..1 favouring names with longer longest matching token, and more matching tokens.name_fingerprint_levenshtein
is used for non-Person pairs for alignment of tokensdob_matches
,dob_year_matches
,dob_year_disjoint
with a single feature scoringcountry_mismatch
scores positively when countries overlap, negatively when countries are disjoint,NaN
otherwise.position_country_mismatch
scores negatively whenPosition:country
is disjointsecurity_isin_mismatch
scores negatively whenSecurity:isin
is disjointBefore feature changes with just the chronological pairs support,
name_levenshtein
has a negative coefficient. Changes to makename_levelshtein
positive resulted inname_token_overlap
's coefficient becoming negative. Soname_match
,name_token_overlap
andname_levenshtein
have been combined into a single feature, taking the max.TODO