- Table of Contents
- StarWord Demo Link
- About StarWord
StarWord is an Indonesian-language-based website with machine-learning powered website, made using TensorFlow and Flask.
StarWord helps Dicoding Indonesia by validating feedback from ILT (Instructor-Led Training). Each of the feedback fed into the system will be rendered valid or not based on 3 factors: its relevancy to ILT, its sentiment, and its quantitative rating.
This project is developed by 5 developers for the company-capstone project of Bangkit Academy 2022 by Google, GoTo, Traveloka of MBKM 2022 from Machine Learning (ML) and Cloud Computing (CC) specialization path with the team ID C22-FV01.
All modern user-based services, including Dicoding, use feedback from users to improve further as metrics. It is not rare that this approach has flaws, especially in the process of data acquisition. 15% of the users give mismatched quantitative (star-based) and qualitative (essay field) feedback, making it harder to get an overview of the potential improvements and new strategies for the company to improve as the assessment is clouded by the mismatched data.
Tackling this problem, an API is developed to help service providers to validate all of their user's feedback results, thus leading to the fully-accurate data-driven improvement. Specifically, this API will take an input of a JSON dictionary pair(s) of quantitative and qualitative feedback to be classified. This dictionary will be broken down into each pair of their respective ratings. If both the quantitative and the sentiment of the qualitative feedback have a “Positive” label, then this feedback will be considered valid and vice-versa. The output of this process will be a JSON dictionary for each of the input pairs, which can be consumed through an API link to be utilized for end-users through a website, or a mobile application.
- This website aims to help Dicoding to validate the feedback collected in ILT sessions. StarWord will help Dicoding by fetching the label of whether the feedback is relevant to ILT and predicting its sentiment. This result will be compared to the star given by the user.
- StarWord has a feature to help the admin validate the feedbacks per batch, and has 2 views, which are for admin to manipulate the ILT sessions and feedback, and for users to input the feedback for ILT sessions.
This section contains the steps to run StarWord Web on local device and environment, then how to deploy it to Google Cloud Platform or Heroku. Those three steps will be explained with several short codes below.
https://github.com/Yousei-kun/StarWord-NLP-FeedbackValidator.git
pip install virtualenv
python -m venv <name of environment>
<name of environment>\Scripts\activate
pip install -r requirements.txt
if __name__ == "__main__":
app.run(debug=True)
python run.py
Now, your own StarWord website is ready to serve!
2. Enable the Cloud Run Admin API. Open the GCP Console then on the left in the sidebar menu select > APIs &
https://github.com/Yousei-kun/StarWord-NLP-FeedbackValidator.git
# Use the official lightweight Python image.
# https://hub.docker.com/_/python
FROM python:3.8
# Allow statements and log messages to immediately appear in the Knative logs
ENV PYTHONUNBUFFERED True
# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
# Install production dependencies.
RUN pip install -r requirements.txt
# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
# Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling.
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 run:app
https://github.com/Yousei-kun/StarWord-NLP-FeedbackValidator.git
$ heroku login -i
$ heroku create <name of apps>
web: gunicorn run:app
$ git init
$ heroku git:remote -a <name of apps>
$ git add .
$ git commit -am "make it better"
$ git push heroku master
-
From the web page, grab user's input such as name, email, leaning path, session name, rating, and feedback. Before going to the next step, it is crucial to make sure that the session name input is a valid name. if session name is invalid, the relust will be
Nama sesi ILT tersebut tidak tersedia
. -
Check whether the feedback is relevant or not by calling
predict_relevant
function which contains acheck_positive_or_not
function. The mentioned function will return true if the feedback is in "positive_review_indicator" list, false otherwise. If it is true, return it as a positive feedback. Elseways, call theclean_data
function to remove unwanted characters after lowering it. -
Clean the stop words from the feedback string by using the
remove_stopwords
function. This function will load the "stopword_Bahasa.csv" file which contains stop words in Indonesian and delete every stop word in the feedback string respectively. Perform manual checking once more by utilizing thecheck_positive_or_not
function. -
Tokenize the feedback by using the
tokenize_relevant
function. The function will load the "tokenizer_relevant.pickle" which contains the tokenizer object and make the feedback string into sequences by using thetexts_to_sequences
method. Finally, call thepad_sequences
function from theTensorFlow
library to make the pads sequences of the feedback the same length, then return the padded sequence. -
Load the model from the 'model_relevant.h5' file which is in the folder "packages/model/model_sentiment/model_relevant.h5" using the
load_model
function withTensorFlow
library. -
Use the model loaded previously to predict the relevance of the feedback by using the
predict
method and set the padded sequence as parameters. The result is a float with the range 0 to 1. However, to predict sentiment later, the result of relevance prediction has to be either 0 (irrelevant) or 1 (relevant). Therefore, the returned result will be rounded to 0 or 1.
-
From the web page, grab user's input such as name, email, leaning path, session name, rating, and feedback. Before validation make sure the session name input is a valid name. if session name is invalid, the relust will be
Nama sesi ILT tersebut tidak tersedia
. -
Check the sentiment by using the
predict_sentiment
function, where the parameter "relevant" will be checked whether its value is 0 or not, and returned asNone
if it is. Otherwise, the feedback will be further checked if it is in "positive_review_indicator" list by using thecheck_positive_or_not
function. It will return 1 (which indicates the feedback is positive) if returned True. Elseways, call theclean_data
function to remove unwanted characters and punctuation after lowering it. -
Clean the stop words from the feedback string by using the
remove_stopwords
function. This function will load the "stopword_Bahasa.csv" file which contains stop words in Indonesian and delete every stop word in the feedback string respectively. Perform manual checking once more by utilizing thecheck_positive_or_not
function. -
Tokenize the feedback by using the
tokenize_relevant
function. The function will load the "tokenizer_relevant.pickle" which contains the tokenizer object and make the feedback string into sequences by using thetexts_to_sequences
method. Finally, call thepad_sequences
function from theTensorFlow
library to make the pads sequences of the feedback the same length, then return the padded sequence. -
Load the model from the 'model_relevant.h5' file which is in the folder "packages/model/model_sentiment/model_sentiment.h5" using the
load_model
function withTensorFlow
library. -
Use the model loaded previously to predict the relevance of the feedback by using the
predict
method and set the padded sequence as parameters. The result returned is an array of the model's confidence of each label (array of 3 values of classes available for positive, negative, and advice). From the array, the index of maximum value is fetched usingargmax
method fromNumPy
library.
The development plan of this project can be seen in the Gantt Chart in the picture provided or for more information can click here to view
As the development has finished, these are the points conducted from the plan & realization:
-
First week
This week is for the project planning, as well as for the additional datasets since the real given datasets are insufficient and include processes in dataset cleaning, view templating, dataset labeling, and finalization. Meanwhile, the front-end is developed is constructed parallelly.
-
Second week
This week is focusing on model development including the model tuning for relevant labels and continuing to sentiment labels. Therefore, the second week is spent on the model development and its completion to get the best fit model.
-
Third week
This week is for deployment including API integration, hosting process and pipeline production for StarWord to be able to be used for the users. The deployment used Flask as the main framework and the hosting process involved Heroku as the host.
-
Fourth week
This week is for project testing and finalization. There was a mentoring session conducted within this week which resulted in wide range of improvements, including front-end, additional dataset, and re-training to develop better model.
-
The rest of the weeks
These weeks are used to make the project brief, documentation. Moreover, preparation for the presentation of the project is produced within these periods.
The StarWord Repository is divided into 3 branches (including master). Below is the explanation:
-
API Development Branch (cc-development)
The cc-development branch is the branch for back-end (dedicated for ONLY API) development. It is written in Python by using the flask and flask restful framework.
See the full documentation and the routings configuration, as well as their input and output here.
-
Machine Learning Development Branch (ml-development)
The ML branch is the branch for machine learning development. It is written in Python, and it uses most basic libraries such as Pandas, NumPy, Sci-Kit Learn, and TensorFlow. Full documentation of the Machine Learning project with steps to make the model is clearly explained in the readme.md file of the Machine Learning development branch, along with the pipeline testing of each model.
See the full documentation and the model development process from scratch here.
-
Main Branch (main)
The Main branch is used as the integration branch of CC-development and ML-development. The plan is to use the Flask framework to build and integrate the website and machine learning model as a whole website. In the main branch lies the front-end website which only uses monolithic Flask to consume the API made from the cc-development branch.
Below are the tools used for deployment, and its detail of monthly pricing.
Name | Detail | Type | Monthly Pricing Est. |
---|---|---|---|
Google Cloud SQL | Database-related functions to create, read, update, delete data | db-standard-1. 10 GiB SSD storage | $42 |
Google App Engine | Hosting the website | F2 | $133 |
Google Cloud Run | Handle the request-related from users | 0.25 GiB Memory, per 10k requests | $5 |
Bangkit 2022 | Company-Capstone Product | C22-FV01 Team Members:
- (ML) M7227F2096 - Ivan Budianto
- (ML) M2150F1654 - Jestine Siewij
- (ML) M7312H2694 - Agus Wahyudi
- (ML) M2299F2566 - Khamidah Ahmad Syauqi
- (CC) C2193F1826 - Mochamad Nahrul Hayawan