QUORA QUESTION PAIR SIMILARITY PROJECT NOTES

BASICS

train.csv has the data

features - id(int), qid1(int), qid2(int), question1(string), question2(string)

label - is_duplicate(int)

this is a binary classification problem

metrics used will be log loss and binary confusion matrix

EDA

bar chart of number of duplicate and non duplicate question pairs

bar chart of number of questions which are repeated and not repeated in the entire dataset

checking for duplicate data points in the dataset

log histogram of frequency of questions (i.e how many times does a single question repeats in the dataset)

check for null valued data points and filling the null values with empty string as mostly string data is missing

FEATURE EXTRACTION

freq_qid1 = Frequency of qid1's

freq_qid2 = Frequency of qid2's

q1len = Length of q1

q2len = Length of q2

q1_n_words = Number of words in Question 1

q2_n_words = Number of words in Question 2

word_Common = (Number of common unique words in Question 1 and Question 2)

word_Total =(Total num of words in Question 1 + Total num of words in Question 2)

word_share = (word_common)/(word_Total) (kind of intersection/union ratio)

freq_q1+freq_q2 = sum total of frequency of qid1 and qid2

freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

The average word share and Common no. of words of qid1 and qid2 is more when they are duplicate(Similar)

The distributions of the word_Common feature in similar and non-similar questions are highly overlapping

PREPROCESSING OF TEXT

remove html tags

remove punctuations

perform stemming

remove stopwords

expand contractions (eg wasn't -> was not)

ADVANCED FEATURE EXTRACTION

Definitions

Token: You get a token by splitting sentence a space

Stop_Word : stop words as per NLTK.

Word : A token that is not a stop_word

Features

cwc_min : Ratio of common_word_count to min lenghth of word count of Q1 and Q2

cwc_min = common_word_count / (min(len(q1_words), len(q2_words))

cwc_max : Ratio of common_word_count to max lenghth of word count of Q1 and Q2

cwc_max = common_word_count / (max(len(q1_words), len(q2_words))

csc_min : Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2

csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))

csc_max : Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2

csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))

ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2

ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))

ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2

ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))

last_word_eq : Check if First word of both questions is equal or not

last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])

first_word_eq : Check if First word of both questions is equal or not

first_word_eq = int(q1_tokens[0] == q2_tokens[0])

abs_len_diff : Abs. length difference

abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))

mean_len : Average Token Length of both Questions

mean_len = (len(q1_tokens) + len(q2_tokens))/2

fuzz_ratio : http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

fuzz_partial_ratio : http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

token_sort_ratio : http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

token_set_ratio : http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

longest_substr_ratio : Ratio of length longest common substring to min length of token count of Q1 and Q2

longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))

ANALYSIS OF EXTRACTED FEATURES

word cloud of duplicate question pairs is plotted

word cloud of non duplicate question pairs is plotted

a total of 15 NLP features are made now.

VISUALIZATION

t-SNE and PCA are used to visualize the 15-d data in 3-d

VECTORIZATION

Used TFIDF for vectorization and made a dictionary (key:word, value:tf-idf score)

Used TFIDF word2vec by using GLOVE model instead of Google's Word2Vec model

This is done separately for all question1's and then all the question2's

Text vectors made have 96 dimensions (both question1 and question2 vectors)

FINAL DATAFRAME PREPARATION

Finally these data frames are merged

Dataframe on which simple preprocessing is done

Dataframe which has the NLP features

Dataframe which has the vectors for all question1's

Dataframe which has the vectors for all question2's

The final DataFrame has 218 features.

TRAIN TEST SPLIT

Training data -> 70%

Testing data -> 30%

MODELS

A random model is built which gives a log loss of 0.89 which is the worst case log loss

Logistic Regression model is used and the hyperparameter chosen for tuning is alpha l2 regularization is used and calibrated classifier cross validation is used.
Confusion matrix is plotted
Best value of log loss is = 0.43

SVM classifier model is used and the hyperparameter chosen for tuning is alpha l1 regularization is used and calibrated classifier cross validation is used.
Confusion matrix is plotted
Best value of log loss is = 0.44

XGBoost model is used
Confusion matrix is plotted
Best value of log loss is = 0.36(without much hyperparameter tuning)
XGBoost v1 log loss is = 0.33

Decision Tree model is used by using RandomizedSearchCV on some parameters
Confusion matrix is plotted
Best value of log loss is = 0.41

Random Forest model is used by using RandomizedSearchCV on some parameters
Confusion matrix is plotted
Best value of log loss is = 0.43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project_notes.md

project_notes.md

QUORA QUESTION PAIR SIMILARITY PROJECT NOTES

BASICS

EDA

FEATURE EXTRACTION

PREPROCESSING OF TEXT

ADVANCED FEATURE EXTRACTION

Definitions

Features

ANALYSIS OF EXTRACTED FEATURES

VISUALIZATION

VECTORIZATION

FINAL DATAFRAME PREPARATION

Finally these data frames are merged

TRAIN TEST SPLIT

MODELS

Files

project_notes.md

Latest commit

History

project_notes.md

File metadata and controls

QUORA QUESTION PAIR SIMILARITY PROJECT NOTES

BASICS

EDA

FEATURE EXTRACTION

PREPROCESSING OF TEXT

ADVANCED FEATURE EXTRACTION

Definitions

Features

ANALYSIS OF EXTRACTED FEATURES

VISUALIZATION

VECTORIZATION

FINAL DATAFRAME PREPARATION

Finally these data frames are merged

TRAIN TEST SPLIT

MODELS