Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression v3 matcher #176

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Regression v3 matcher #176

wants to merge 20 commits into from

Conversation

jbothma
Copy link
Contributor

@jbothma jbothma commented Sep 17, 2024

At least an iteration of #163

  • add regression-v3 crawler copied from regression-v1
  • split train/test data based on pair group key to fully place connected entities in either train or test
  • parallelise feature generation for training
  • Add SimpleImputer to fill NaN with mean for the feature
  • single name_similarity feature takes max of name_match, name_token_overlap and name_levenshtein
    • this helps with name features otherwise getting negative coefficients
  • address_match is NaN when values aren't available.
    • This makes the coefficient positive
  • name_match component of name_similarity scaled to 0..1 favouring names with longer longest matching token, and more matching tokens.
  • Symmetric form of name_fingerprint_levenshtein is used for non-Person pairs for alignment of tokens
  • dob_similarity replaces dob_matches, dob_year_matches, dob_year_disjoint with a single feature scoring
    • high for high precision match,
    • lower for edits and year match,
    • and negatively for precise date mismatch beyond edit distance of 2 on precise dates.
  • country_mismatch scores positively when countries overlap, negatively when countries are disjoint, NaN otherwise.
  • position_country_mismatch scores negatively when Position:country is disjoint
  • security_isin_mismatch scores negatively when Security:isin is disjoint

Before feature changes with just the chronological pairs support, name_levenshtein has a negative coefficient. Changes to make name_levelshtein positive resulted in name_token_overlap's coefficient becoming negative. So name_match, name_token_overlap and name_levenshtein have been combined into a single feature, taking the max.

TODO

  • clear feature docstrings
  • type annotations
  • See if name_token_overlap scaling is too aggressive and can be taught to chill

@jbothma
Copy link
Contributor Author

jbothma commented Sep 19, 2024

Comparing regression_v1 and regression_v3

Common subdirectories: nomenklatura/matching/regression_v1/__pycache__ and nomenklatura/matching/regression_v3/__pycache__
diff -u nomenklatura/matching/regression_v1/misc.py nomenklatura/matching/regression_v3/misc.py
--- nomenklatura/matching/regression_v1/misc.py	2024-09-09 11:14:58
+++ nomenklatura/matching/regression_v3/misc.py	2024-09-15 09:15:13
@@ -1,8 +1,9 @@
 from followthemoney.proxy import E
 from followthemoney.types import registry
+import numpy as np
 
 from nomenklatura.matching.regression_v1.util import tokenize_pair, compare_levenshtein
-from nomenklatura.matching.compare.util import has_overlap, extract_numbers
+from nomenklatura.matching.compare.util import has_overlap, extract_numbers, is_disjoint
 from nomenklatura.matching.util import props_pair, type_pair
 from nomenklatura.matching.util import max_in_sets, has_schema
 from nomenklatura.util import normalize_name
@@ -18,6 +19,8 @@
 def address_match(query: E, result: E) -> float:
     """Text similarity between addresses."""
     lv, rv = type_pair(query, result, registry.address)
+    if not (lv and rv):
+        return np.nan
     lvn = [normalize_name(v) for v in lv]
     rvn = [normalize_name(v) for v in rv]
     return max_in_sets(lvn, rvn, compare_levenshtein)
@@ -61,3 +64,19 @@
         return 0.0
     lv, rv = type_pair(query, result, registry.identifier)
     return 1.0 if has_overlap(lv, rv) else 0.0
+
+
+def position_country_mismatch(query: E, result: E) -> float:
+    """Whether positions have the same country or not"""
+    if not has_schema(query, result, "Position"):
+        return 0.0
+    lv, rv = type_pair(query, result, registry.country)
+    return 1.0 if is_disjoint(lv, rv) else 0
+
+
+def security_isin_mismatch(query: E, result: E) -> float:
+    """Both entities are linked to different ISIN codes."""
+    if not has_schema(query, result, "Security"):
+        return 0.0
+    qv, rv = props_pair(query, result, ["isin"])
+    return 1.0 if is_disjoint(qv, rv) else 0.0
diff -u nomenklatura/matching/regression_v1/model.py nomenklatura/matching/regression_v3/model.py
--- nomenklatura/matching/regression_v1/model.py	2024-02-13 13:25:35
+++ nomenklatura/matching/regression_v3/model.py	2024-09-15 09:29:11
@@ -5,48 +5,48 @@
 from sklearn.pipeline import Pipeline  # type: ignore
 from followthemoney.proxy import E
 
-from nomenklatura.matching.regression_v1.names import first_name_match
-from nomenklatura.matching.regression_v1.names import family_name_match
-from nomenklatura.matching.regression_v1.names import name_levenshtein, name_match
-from nomenklatura.matching.regression_v1.names import name_token_overlap, name_numbers
-from nomenklatura.matching.regression_v1.misc import phone_match, email_match
-from nomenklatura.matching.regression_v1.misc import address_match, address_numbers
-from nomenklatura.matching.regression_v1.misc import identifier_match, birth_place
-from nomenklatura.matching.regression_v1.misc import org_identifier_match
-from nomenklatura.matching.compare.countries import country_mismatch
+
+from nomenklatura.matching.regression_v3.names import first_name_match, name_similarity
+from nomenklatura.matching.regression_v3.names import family_name_match
+from nomenklatura.matching.regression_v3.names import name_levenshtein, name_match
+from nomenklatura.matching.regression_v3.names import name_token_overlap, name_numbers
+from nomenklatura.matching.regression_v3.misc import phone_match, email_match, position_country_mismatch
+from nomenklatura.matching.regression_v3.misc import address_match, address_numbers
+from nomenklatura.matching.regression_v3.misc import identifier_match, birth_place
+from nomenklatura.matching.regression_v3.misc import org_identifier_match
+from nomenklatura.matching.regression_v3.misc import security_isin_mismatch
 from nomenklatura.matching.compare.gender import gender_mismatch
 from nomenklatura.matching.compare.dates import dob_matches, dob_year_matches
-from nomenklatura.matching.compare.dates import dob_year_disjoint
+from nomenklatura.matching.compare.dates import dob_year_disjoint, dob_similarity
+from nomenklatura.matching.compare.countries import country_match
 from nomenklatura.matching.types import FeatureDocs, FeatureDoc, MatchingResult
 from nomenklatura.matching.types import CompareFunction, Encoded, ScoringAlgorithm
 from nomenklatura.matching.util import make_github_url
 from nomenklatura.util import DATA_PATH
 
 
-class RegressionV1(ScoringAlgorithm):
+class RegressionV3(ScoringAlgorithm):
     """A simple matching algorithm based on a regression model."""
 
-    NAME = "regression-v1"
+    NAME = "regression-v3"
     MODEL_PATH = DATA_PATH.joinpath(f"{NAME}.pkl")
     FEATURES: List[CompareFunction] = [
-        name_match,
-        name_token_overlap,
         name_numbers,
-        name_levenshtein,
+        name_similarity,
         phone_match,
         email_match,
         identifier_match,
-        dob_matches,
-        dob_year_matches,
-        dob_year_disjoint,
+        dob_similarity,
         first_name_match,
         family_name_match,
         birth_place,
         gender_mismatch,
-        country_mismatch,
+        country_match,
+        position_country_mismatch,
         org_identifier_match,
         address_match,
         address_numbers,
+        security_isin_mismatch,
     ]
 
     @classmethod
diff -u nomenklatura/matching/regression_v1/names.py nomenklatura/matching/regression_v3/names.py
--- nomenklatura/matching/regression_v1/names.py	2024-09-09 11:14:58
+++ nomenklatura/matching/regression_v3/names.py	2024-09-18 22:39:00
@@ -1,14 +1,24 @@
+from statistics import mean
 from typing import Iterable, Set
 from followthemoney.proxy import E
 from followthemoney.types import registry
+import numpy as np
 
-from nomenklatura.matching.regression_v1.util import tokenize_pair, compare_levenshtein
+from nomenklatura.matching.regression_v3.util import tokenize_pair, compare_levenshtein
 from nomenklatura.matching.compare.util import is_disjoint, has_overlap, extract_numbers
-from nomenklatura.matching.util import props_pair, type_pair
+from nomenklatura.matching.compare.names import aligned_levenshtein, name_fingerprint_levenshtein, symmetric_aligned_levenshtein
+from nomenklatura.matching.util import has_schema, props_pair, type_pair
 from nomenklatura.matching.util import max_in_sets
 from nomenklatura.util import fingerprint_name
 
 
+MATCH_BASE_SCORE = 0.7
+MAX_BONUS_LENGTH = 100
+LENGTH_BONUS_FACTOR = (1 - MATCH_BASE_SCORE) / MAX_BONUS_LENGTH
+MAX_BONUS_QTY = 10
+QTY_BONUS_FACTOR = (1 - MATCH_BASE_SCORE) / MAX_BONUS_QTY
+
+
 def normalize_names(raws: Iterable[str]) -> Set[str]:
     names = set()
     for raw in raws:
@@ -21,43 +31,77 @@
 def name_levenshtein(left: E, right: E) -> float:
     """Consider the edit distance (as a fraction of name length) between the two most
     similar names linked to both entities."""
-    lv, rv = type_pair(left, right, registry.name)
-    lvn, rvn = normalize_names(lv), normalize_names(rv)
-    return max_in_sets(lvn, rvn, compare_levenshtein)
+    if has_schema(left, right, "Person"):
+        lv, rv = type_pair(left, right, registry.name)
+        lvn, rvn = normalize_names(lv), normalize_names(rv)
+        return max_in_sets(lvn, rvn, compare_levenshtein)
+    else:
+        return name_fingerprint_levenshtein(left, right, symmetric_aligned_levenshtein)
 
 
 def first_name_match(left: E, right: E) -> float:
     """Matching first/given name between the two entities."""
     lv, rv = tokenize_pair(props_pair(left, right, ["firstName"]))
+    if not (lv and rv):
+        return np.nan
     return 1.0 if has_overlap(lv, rv) else 0.0
 
 
 def family_name_match(left: E, right: E) -> float:
     """Matching family name between the two entities."""
     lv, rv = tokenize_pair(props_pair(left, right, ["lastName"]))
+    if not (lv and rv):
+        return np.nan
     return 1.0 if has_overlap(lv, rv) else 0.0
 
 
 def name_match(left: E, right: E) -> float:
-    """Check for exact name matches between the two entities."""
+    """
+    Check for exact name matches between the two entities.
+
+    Having any completely matching name initially scores 0.8.
+    A length bonus is added based on the length of the longest common name up to 100 chars.
+    A quantity bonus is added based on the number of common names up to 10.
+
+    The maximum score is 1.0.
+    No matches scores 0.0.
+    """
     lv, rv = type_pair(left, right, registry.name)
     lvn, rvn = normalize_names(lv), normalize_names(rv)
-    common = [len(n) for n in lvn.intersection(rvn)]
-    max_common = max(common, default=0)
-    if max_common == 0:
+    common = sorted(lvn.intersection(rvn), key=lambda n: len(n), reverse=True)
+    if not common:
         return 0.0
-    return float(max_common)
+    score = MATCH_BASE_SCORE
+    longest_common = common[0]
+    length_bonus = min(len(longest_common), MAX_BONUS_LENGTH) * LENGTH_BONUS_FACTOR
+    quantity_bonus = min(len(common), MAX_BONUS_QTY) * QTY_BONUS_FACTOR
+    return score + (length_bonus + quantity_bonus) / 2
 
 
 def name_token_overlap(left: E, right: E) -> float:
     """Evaluate the proportion of identical words in each name."""
-    lv, rv = tokenize_pair(type_pair(left, right, registry.name))
-    common = lv.intersection(rv)
-    tokens = min(len(lv), len(rv))
-    return float(len(common)) / float(max(2.0, tokens))
+    lvt, rvt = tokenize_pair(type_pair(left, right, registry.name))
+    common = lvt.intersection(rvt)
+    tokens = min(len(lvt), len(rvt))
+    if tokens == 0:
+        return 0.0
+    return float(len(common)) / tokens
 
 
 def name_numbers(left: E, right: E) -> float:
     """Find if names contain numbers, score if the numbers are different."""
     lv, rv = type_pair(left, right, registry.name)
     return 1.0 if is_disjoint(extract_numbers(lv), extract_numbers(rv)) else 0.0
+
+
+def name_similarity(left: E, right: E) -> float:
+    """Compute the similarity between the names of two entities, picking the max from
+    a full string match, token overlap-based score, and levenshtein distance-based
+    score."""
+    return max(
+        [
+            name_match(left, right),
+            0.5 * name_token_overlap(left, right),
+            name_levenshtein(left, right),
+        ]
+    )
diff -u nomenklatura/matching/regression_v1/train.py nomenklatura/matching/regression_v3/train.py
--- nomenklatura/matching/regression_v1/train.py	2024-09-06 12:44:09
+++ nomenklatura/matching/regression_v3/train.py	2024-09-13 17:28:35
@@ -1,19 +1,20 @@
 import logging
 import numpy as np
 import multiprocessing
-from typing import Iterable, List, Tuple
+from typing import List, Tuple
 from pprint import pprint
 from numpy.typing import NDArray
 from sklearn.pipeline import make_pipeline  # type: ignore
 from sklearn.preprocessing import StandardScaler  # type: ignore
-from sklearn.model_selection import train_test_split  # type: ignore
+from sklearn.model_selection import GroupShuffleSplit  # type: ignore
 from sklearn.linear_model import LogisticRegression  # type: ignore
+from sklearn.impute import SimpleImputer  # type: ignore
 from sklearn import metrics  # type: ignore
-from concurrent.futures import ThreadPoolExecutor
+from concurrent.futures import ProcessPoolExecutor
 
 from nomenklatura.judgement import Judgement
 from nomenklatura.matching.pairs import read_pairs, JudgedPair
-from nomenklatura.matching.regression_v1.model import RegressionV1
+from nomenklatura.matching.regression_v3.model import RegressionV3
 from nomenklatura.util import PathLike
 
 log = logging.getLogger(__name__)
@@ -22,20 +23,20 @@
 def pair_convert(pair: JudgedPair) -> Tuple[List[float], int]:
     """Encode a pair of training data into features and target."""
     judgement = 1 if pair.judgement == Judgement.POSITIVE else 0
-    features = RegressionV1.encode_pair(pair.left, pair.right)
+    features = RegressionV3.encode_pair(pair.left, pair.right)
     return features, judgement
 
 
 def pairs_to_arrays(
-    pairs: Iterable[JudgedPair],
+    pairs: List[JudgedPair],
 ) -> Tuple[NDArray[np.float32], NDArray[np.float32]]:
     """Parallelize feature computation for training data"""
     xrows = []
     yrows = []
     threads = multiprocessing.cpu_count()
     log.info("Compute threads: %d", threads)
-    with ThreadPoolExecutor(max_workers=threads) as excecutor:
-        results = excecutor.map(pair_convert, pairs)
+    with ProcessPoolExecutor(max_workers=threads) as executor:
+        results = executor.map(pair_convert, pairs, chunksize=1000)
         for idx, (x, y) in enumerate(results):
             if idx > 0 and idx % 10000 == 0:
                 log.info("Computing features: %s....", idx)
@@ -45,42 +46,49 @@
     return np.array(xrows), np.array(yrows)
 
 
-def train_matcher(pairs_file: PathLike) -> None:
+def train_matcher(pairs_file: PathLike, splits: int = 1) -> None:
     pairs = []
     for pair in read_pairs(pairs_file):
-        # HACK: support more eventually:
-        # if not pair.left.schema.is_a("LegalEntity"):
-        #     continue
         if pair.judgement == Judgement.UNSURE:
             pair.judgement = Judgement.NEGATIVE
-        # randomize_entity(pair.left)
-        # randomize_entity(pair.right)
         pairs.append(pair)
-    # random.shuffle(pairs)
-    # pairs = pairs[:30000]
     positive = len([p for p in pairs if p.judgement == Judgement.POSITIVE])
     negative = len([p for p in pairs if p.judgement == Judgement.NEGATIVE])
     log.info("Total pairs loaded: %d (%d pos/%d neg)", len(pairs), positive, negative)
+
     X, y = pairs_to_arrays(pairs)
-    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
-    # logreg = LogisticRegression(class_weight={0: 95, 1: 1})
-    # logreg = LogisticRegression(penalty="l1", solver="liblinear")
-    logreg = LogisticRegression(penalty="l2")
-    log.info("Training model...")
-    pipe = make_pipeline(StandardScaler(), logreg)
-    pipe.fit(X_train, y_train)
-    coef = logreg.coef_[0]
-    coefficients = {n.__name__: c for n, c in zip(RegressionV1.FEATURES, coef)}
-    RegressionV1.save(pipe, coefficients)
-    print("Coefficients:")
-    pprint(coefficients)
-    y_pred = pipe.predict(X_test)
-    cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
-    print("Confusion matrix:\n", cnf_matrix)
-    print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
-    print("Precision:", metrics.precision_score(y_test, y_pred))
-    print("Recall:", metrics.recall_score(y_test, y_pred))
+    groups = [p.group for p in pairs]
+    gss = GroupShuffleSplit(n_splits=splits, test_size=0.33)
+    for split, (train_indices, test_indices) in enumerate(
+        gss.split(X, y, groups=groups), 1
+    ):
+        X_train = [X[i] for i in train_indices]
+        X_test = [X[i] for i in test_indices]
+        y_train = [y[i] for i in train_indices]
+        y_test = [y[i] for i in test_indices]
 
-    y_pred_proba = pipe.predict_proba(X_test)[::, 1]
-    auc = metrics.roc_auc_score(y_test, y_pred_proba)
-    print("Area under curve:", auc)
+        print()
+        log.info("Training model...(split %d)" % split)
+        logreg = LogisticRegression(penalty="l2")
+        pipe = make_pipeline(
+            SimpleImputer(strategy="mean"),
+            StandardScaler(),
+            logreg,
+        )
+        pipe.fit(X_train, y_train)
+        coef = logreg.coef_[0]
+        coefficients = {n.__name__: c for n, c in zip(RegressionV3.FEATURES, coef)}
+        RegressionV3.save(pipe, coefficients)
+
+        print("Coefficients:")
+        pprint(coefficients)
+        y_pred = pipe.predict(X_test)
+        cnf_matrix = metrics.confusion_matrix(y_test, y_pred, normalize="all") * 100
+        print("Confusion matrix (% of all):\n", cnf_matrix)
+        print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
+        print("Precision:", metrics.precision_score(y_test, y_pred))
+        print("Recall:", metrics.recall_score(y_test, y_pred))
+
+        y_pred_proba = pipe.predict_proba(X_test)[::, 1]
+        auc = metrics.roc_auc_score(y_test, y_pred_proba)
+        print("Area under curve:", auc)

@jbothma jbothma changed the title Reg v3 base fix odd coefficients Regression v3 matcher Oct 31, 2024
@jbothma jbothma marked this pull request as ready for review December 10, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant