🌲 Decision Tree in PySpark

A study about the use of a decision tree to predict the iris species of a flower

Status: Finished🚀

About • Features • How it works • Tech Stack • License

About

The decision tree is one of the oldest and most widely used methods in machine learning, and it can be used to classify an element by analyzing the relationships between its variables. These algorithms progressively subdivide the data into smaller and more specific sets, in terms of their attributes, until they reach a size simplified enough to be labeled. To do so, it is necessary to train the model with previously labeled data in order to apply it to new data.

The objective of this exercise is, through the use of pySpark in the Google Collab environment, to predict the iris species of a flower by the given characteristics. We will predict the species through the dataset of 3 iris species where 50 samples were collected for each species, considering the species as the output variable and the other variables of petal and sepal sizes as the input.

Features

Using PySpark in Google Colab
Decision Tree Application
Random Forest Application
Use of the Algorithm Classifier

How it works

This project was developed in Google Colab and uses the following dataset:

Iris Species Dataset

Pre-requisites

To run spark in Colab, we need to first install all dependencies in Colab environment, i.e. Apache Spark 2.4.4 with hadoop 2.7, Java 8 and Findspark to find spark in the system. Follow the steps to install the dependencies:

Install Dependencies in your Colab environment

!sudo apt update
!sudo apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

Importing Spark libraries

from pyspark.sql import SparkSession 
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("ArvoreDecisao") \
    .getOrCreate()
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql.types import *
from pyspark.sql.functions import *

Tech Stack

The following tools were used in the construction of the project:

PySpark Libraries

Vectors
VectorAssembler
StringIndexer
DecisionTreeClassifier
MulticlassClassificationEvaluator
RandomForestClassifier
Spark SQL

License

This project is under the license MIT.

Made with love by Matheus Pereira 👋🏽 Get in Touch!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vs		.vs
ignews		ignews
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌲 Decision Tree in PySpark

A study about the use of a decision tree to predict the iris species of a flower

Status: Finished🚀

About

Features

How it works

Pre-requisites

Install Dependencies in your Colab environment

Importing Spark libraries

Tech Stack

PySpark Libraries

License

About

Releases

Packages

Languages

matheusmmmp/03-next

Folders and files

Latest commit

History

Repository files navigation

🌲 Decision Tree in PySpark

A study about the use of a decision tree to predict the iris species of a flower

Status: Finished🚀

About

Features

How it works

Pre-requisites

Install Dependencies in your Colab environment

Importing Spark libraries

Tech Stack

PySpark Libraries

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages