Skip to content

matheusmmmp/03-next

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

A study about the use of a decision tree to predict the iris species of a flower

Status: Finished🚀

AboutFeaturesHow it worksTech StackLicense

About

The decision tree is one of the oldest and most widely used methods in machine learning, and it can be used to classify an element by analyzing the relationships between its variables. These algorithms progressively subdivide the data into smaller and more specific sets, in terms of their attributes, until they reach a size simplified enough to be labeled. To do so, it is necessary to train the model with previously labeled data in order to apply it to new data.

The objective of this exercise is, through the use of pySpark in the Google Collab environment, to predict the iris species of a flower by the given characteristics. We will predict the species through the dataset of 3 iris species where 50 samples were collected for each species, considering the species as the output variable and the other variables of petal and sepal sizes as the input.


Features

  • Using PySpark in Google Colab
  • Decision Tree Application
  • Random Forest Application
  • Use of the Algorithm Classifier

How it works

This project was developed in Google Colab and uses the following dataset:

Pre-requisites

To run spark in Colab, we need to first install all dependencies in Colab environment, i.e. Apache Spark 2.4.4 with hadoop 2.7, Java 8 and Findspark to find spark in the system. Follow the steps to install the dependencies:

Install Dependencies in your Colab environment

!sudo apt update
!sudo apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

Importing Spark libraries

from pyspark.sql import SparkSession 
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("ArvoreDecisao") \
    .getOrCreate()
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql.types import *
from pyspark.sql.functions import *

Tech Stack

The following tools were used in the construction of the project:

PySpark Libraries

  • Vectors
  • VectorAssembler
  • StringIndexer
  • DecisionTreeClassifier
  • MulticlassClassificationEvaluator
  • RandomForestClassifier
  • Spark SQL

License

This project is under the license MIT.

Made with love by Matheus Pereira 👋🏽 Get in Touch!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published