Skip to content

Latest commit

 

History

History
131 lines (106 loc) · 6.17 KB

project_notes.md

File metadata and controls

131 lines (106 loc) · 6.17 KB

QUORA QUESTION PAIR SIMILARITY PROJECT NOTES

BASICS

  • train.csv has the data
  • features - id(int), qid1(int), qid2(int), question1(string), question2(string)
  • label - is_duplicate(int)
  • this is a binary classification problem
  • metrics used will be log loss and binary confusion matrix
  • EDA

  • bar chart of number of duplicate and non duplicate question pairs
  • bar chart of number of questions which are repeated and not repeated in the entire dataset
  • checking for duplicate data points in the dataset
  • log histogram of frequency of questions (i.e how many times does a single question repeats in the dataset)
  • check for null valued data points and filling the null values with empty string as mostly string data is missing
  • FEATURE EXTRACTION

  • freq_qid1 = Frequency of qid1's
  • freq_qid2 = Frequency of qid2's
  • q1len = Length of q1
  • q2len = Length of q2
  • q1_n_words = Number of words in Question 1
  • q2_n_words = Number of words in Question 2
  • word_Common = (Number of common unique words in Question 1 and Question 2)
  • word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
  • word_share = (word_common)/(word_Total) (kind of intersection/union ratio)
  • freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
  • freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2
  • The average word share and Common no. of words of qid1 and qid2 is more when they are duplicate(Similar)
  • The distributions of the word_Common feature in similar and non-similar questions are highly overlapping
  • PREPROCESSING OF TEXT

  • remove html tags
  • remove punctuations
  • perform stemming
  • remove stopwords
  • expand contractions (eg wasn't -> was not)

    ADVANCED FEATURE EXTRACTION

    Definitions
  • Token: You get a token by splitting sentence a space
  • Stop_Word : stop words as per NLTK.
  • Word : A token that is not a stop_word
    Features
  • cwc_min : Ratio of common_word_count to min lenghth of word count of Q1 and Q2
  • cwc_min = common_word_count / (min(len(q1_words), len(q2_words))
  • cwc_max : Ratio of common_word_count to max lenghth of word count of Q1 and Q2
  • cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
  • csc_min : Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2
  • csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
  • csc_max : Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2
  • csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
  • ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2
  • ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
  • ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2
  • ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
  • last_word_eq : Check if First word of both questions is equal or not
  • last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
  • first_word_eq : Check if First word of both questions is equal or not
  • first_word_eq = int(q1_tokens[0] == q2_tokens[0])
  • abs_len_diff : Abs. length difference
  • abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
  • mean_len : Average Token Length of both Questions
  • mean_len = (len(q1_tokens) + len(q2_tokens))/2
  • fuzz_ratio : http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
  • fuzz_partial_ratio : http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
  • token_sort_ratio : http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
  • token_set_ratio : http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
  • longest_substr_ratio : Ratio of length longest common substring to min length of token count of Q1 and Q2
  • longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))

    ANALYSIS OF EXTRACTED FEATURES

  • word cloud of duplicate question pairs is plotted
  • word cloud of non duplicate question pairs is plotted
  • a total of 15 NLP features are made now.

    VISUALIZATION

  • t-SNE and PCA are used to visualize the 15-d data in 3-d

    VECTORIZATION

  • Used TFIDF for vectorization and made a dictionary (key:word, value:tf-idf score)
  • Used TFIDF word2vec by using GLOVE model instead of Google's Word2Vec model
  • This is done separately for all question1's and then all the question2's
  • Text vectors made have 96 dimensions (both question1 and question2 vectors)

    FINAL DATAFRAME PREPARATION

    Finally these data frames are merged
  • Dataframe on which simple preprocessing is done
  • Dataframe which has the NLP features
  • Dataframe which has the vectors for all question1's
  • Dataframe which has the vectors for all question2's
  • The final DataFrame has 218 features.

    TRAIN TEST SPLIT

  • Training data -> 70%
  • Testing data -> 30%

    MODELS

  • A random model is built which gives a log loss of 0.89 which is the worst case log loss
  • Logistic Regression model is used and the hyperparameter chosen for tuning is alpha l2 regularization is used and calibrated classifier cross validation is used.
    Confusion matrix is plotted
    Best value of log loss is = 0.43
  • SVM classifier model is used and the hyperparameter chosen for tuning is alpha l1 regularization is used and calibrated classifier cross validation is used.
    Confusion matrix is plotted
    Best value of log loss is = 0.44
  • XGBoost model is used
    Confusion matrix is plotted
    Best value of log loss is = 0.36(without much hyperparameter tuning)
    XGBoost v1 log loss is = 0.33
  • Decision Tree model is used by using RandomizedSearchCV on some parameters
    Confusion matrix is plotted
    Best value of log loss is = 0.41
  • Random Forest model is used by using RandomizedSearchCV on some parameters
    Confusion matrix is plotted
    Best value of log loss is = 0.43