This repository has been archived by the owner on Sep 12, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtext_preprocessing.py
71 lines (56 loc) · 2.16 KB
/
text_preprocessing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# -*- coding: utf-8 -*-
"""Text pre-processing functions.
Usage
-----
This module is used by pipeline/clean.py and pipeline/retrain.py
Using remove_link_lemma()
--------------
How to import this:
from text_preprocessing import remove_link_lemma
When you do that, you will get 1 new callable to be used when initializing vectorizer:
remove_link_lemma
(1) Indirect usage:
The repo's pre-built sentiment analysis model will indirectly calls remove_link_lemma during model.predict()
The pre-built model can be accessed by:
MODEL_FILEID = '1ydeM6Tiamck5sF8oMDThZIRb0xQu7Nqd'
MODEL_URL = 'https://drive.google.com/uc?id=' + MODEL_FILEID
MODEL_OUTPUT = 'model.pickle'
# Download model from google drive
gdown.download(MODEL_URL, MODEL_OUTPUT, quiet=False)
# Load model to session
infile = open(MODEL_OUTPUT, 'rb')
model = pickle.load(infile)
infile.close()
(2) Direct usage:
When initializing new vectorizer instance by:
tfidf = TfidfVectorizer(lowercase=False,ngram_range=(2,2), preprocessor = remove_link_lemma)
or in a pipeline
pipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=False,
ngram_range=(2,2),
preprocessor = remove_link_lemma)), # vectorizer
('nb', BernoulliNB()) # build model
])
"""
import re
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
wnl = WordNetLemmatizer()
def lemmatizing(text):
"""
input: string
Lemmatize input using nltk WordNetLemmatizer (eg wolves -> wolf)
output: string
"""
return ' '.join([wnl.lemmatize(word) for word in text.split(' ')])
def remove_link(text):
"""
input: string
Remove links from input.
output: string
"""
pattern = re.compile('htt.*', re.IGNORECASE)
return ' '.join([word for word in text.split(' ') if pattern.search(word)==None])
def remove_link_lemma(text):
return remove_link(lemmatizing(text))