Cryptocurrency Return Prediction Using Investor Sentiment Extracted by BERT-Based Classifiers From News Articles, Reddit Posts and Tweets
Master's thesis project for the program of M.Sc. Economics and Management Science at Humboldt University of Berlin
--by Duygu Ider https://www.linkedin.com/in/duyguider/
Please find the paper here: https://arxiv.org/abs/2204.05781
Outline of the project and what each script/notebook does:
PART 1: BERT-Based Sentiment Classification
- price_data_scrape.ipynb - Scrape price data for Bitcoin and Ethereum
- news_scraper_final.ipynb, reddit_scraper_final.ipynb, twitter_scraper_final.py - Scrape news, Reddit and Tweets data
- weak_labels_approach.py - Use Financhial Phrasebank data (Malo et. al, 2014) to label it with pseudo-labels predicted by BART zero-shot classifier, fit a BERT-based classifier, evaluate model performance in the case of weak labels
- combine_text_data_zsc_finbert.py - Combine the price and text data to a single dataset, predict sentiment using zero-shot classifier (BART) and FinBERT to assign weak labels
- bert_crypto_hyperparam_optimal_and_zsc.ipynb - Perform grid search hyperparameter optimization to the process of fine-tuning BERT-based classifiers. The implemented models are BERT-Unfrozen, BERT-Frozen and BERT-Context
PART 2: Return Prediction and Trading Simulation
- data_prep_for_financial_models.py - Prepare the combined dataset as an input for the financial models. Add price, macroeconomic, blockchain features and weekday dummies
- return_prediction_trading_simulation.ipynb - Load data, add technical analsis features to the dataset, lag defined features by a certain lag amount, plot some intermediate outputs, perform elimination by variance inflation factor to analyze sentiment feature contribution, fit all cryptocurrency return predictors using Bayesian hyperparameter optimization, perform trading simulation over multiple test periods, create a clearly defined output table of all prediction results
- return_prediction_trading_simulation(rnn_added_pipeline_implemented).ipynb_ - RNN and LSTM added as financial forecasting models, compared to the previous script