I add here stuff related to NLP.
Within each directory there should be a README file to help guiding you through the code. So far, this is what I have included:
-
text_classification_DL_battle
Amazon Reviews classification (score prediction) using Hierarchical Attention Networks (Zichao Yang, et al., 2016), BERT models at the Hugginface's transformer library and the Fastai Text API.
-
text_classification_HAN
Amazon Reviews classification (score prediction) using Hierarchical Attention Networks (Zichao Yang, et al., 2016). I have also used a number of Dropout mechanisms from the work Regularizing and Optimizing LSTM Language Models (Stephen Merity, Nitish Shirish Keskar and Richard Socher, 2017). The companion Medium post can be found here.
-
text_classification_without_DL
Predicting the review score for Amazon reviews (Shoes, Clothes and jewelery). using tf-idf, LDA and EnsembleTopics along with
lightGBM
andhyperopt
for the final classification and hyper-parameter optimization. I placed special emphasis in the text preprocessing. -
text_classification_EDA
Amazon Reviews classification using tf-idf and EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (Jason Wei and Kai Zou 2019) along with
lightGBM
andhyperopt
for the final classification and hyper-parameter optimization. Following the philosophy of the previous exercise, I placed some emphasis in the text preprocessing, in particular in the use of certain tokenizers. -
rnn_character_tagging
Tagging at character level using RNNs with the aim of differentiating for example, different coding languages or writing styles. The code here is based in a post by Nadbor.
-
textrank
The simplest text summarization approach using the
Pagerank
algorithm via the networkx package and comparing the results with the properTextrank
implementation Variations of the Similarity Function of TextRank for Automated Summarization (Federico Barrios et al., 2016). -
text_classification_CNN_with_tf
This is a dir with very old Tensorflow code using the 20_newsgroup dataset. My aim back then was is simply to illustrate 3 different ways of building a Convolutional neural network for text classification using Tensorflow. Last time I checked (October 2019) The code still run, but if you run it you will get every possible warning to upgrade. This dir is mostly for me to keep track of the things I do more than any other thing.
Any comments or suggestions please: jrzaurin@gmail.com or even better open an issue.