Skip to content

Within this repository, delve into a treasure trove of my personal projects spanning Machine Learning, exploratory data analysis (EDA), Python Jupyter Notebooks, and an assortment of visualizations crafted using Dataiku Platform's exported standard files.

Notifications You must be signed in to change notification settings

leonism/dataiku-FIFA

Repository files navigation

FIFA Logo

Dataiku's Conundrum Challenge on FIFA Dataset.

Introduction

If there's one thing similar about an interesting dataset and a good football's match, is that they're all keeping everyone's safe at home during this time of the pandemic. And in all honesty, I'm no data-scientist nor a dev guy. I just recently got myself exposed to a Machine Learning/Artificial Intelligent in general, while Dataiku in particular -- somewhere a little over then 3 months ago, but anyway here's my take to this conundrum's challenge.

Installation

On this repository, you may find my personal projects related to Machine Learning, EDA, Python Jupyter Notebook and couple of Visualization based on the Dataiku Platform exported standard files. Most of the datasets I've been working with, downloaded from Conundrum site. Installation pretty straight forward. Simply download the whole set as a single project and as a ZIP file, everything have been flattened out with plain text files, and no SQL dump was involved, so there wouldn't be any missing system dependencies issue. Simply imported the downloaded Zip file to your working project.

Jupyter Notebooks

Data Flow

And since the challenge is not to 'predict' anything, rather to group/cluster the player's skillsets in reflect to their wages rate. Here's what my current flow would look like, and don't bother much on the 2 additional datasets, as they're merely exported from the existing model, so that I may explore them further. main-flow.png

Prepare Recipes

And here's how I go about on the prepare recipes, nothing out of the ordinary. Just converting categorical to numerical values with one-hot encoding and filling up the 'NaN' with median values, while grouping them to have better clarity, if ever I need to go back and revise anything. recipes-prepared.jpg

Modeling & Training

While on modeling/training steps, I choose the 'Interactive Clustering' which in return, delivered me a sufficient scoring value. model-score.png

Clustering Classification

On to the clustering variables name, I simply identify them in the grading manner, starting from 'Grading A', as the most top-knot performer, all the way down to the least performing one marked with 'Grading E'. clusttering.jpg

Cluster Plot

And here's how my cluster plot would look like, obviously the better the grade, the least volume of players getting included in them.

Acceleration x Wage scatter-plot-a.png

Sliding Tackle x Wage scateter-plot-b.png

Variables Significant Level

And for sure those who sits at the 'Grading A' level would stand above the average threshold (though, that's not always the case with other variables).

grading-a-variables.jpg

Value Proposition

And coming back again to the initial question, "creating a flow that outputs a value proposition in term of their wages". I think I didn't include the players name and their nationalities in my modeling for a couple of reasons. In my opinions, those two variables are just too subjective to get included. In a sense that you could be a top-knot player, regardless of what your 'Names' would sound like, and of course your 'Nationalities'.

So I've done the DSS flow diagram, while the followings are my list of 'value proposition' that contributed of being one 'Grading-A' player in the field.

Top 5 Values Proposition fig1.png

Top 5 Values Proposition By Distribution. fig3.png

Top 5 Values Proposition By Grade. fig3.png

Correlation Matrix

The very first correlation analysis consists of plotting the "Correlation matrix" for numerical variables. For each couple of numerical variables, this computes the "strength" of the correlation (called the Pearson coefficient):

  • 1.0 means a perfect correlation
  • 0.0 means no correlation
  • -1.0 means a perfect "inverse" correlation

Since it does not really make sense to print this correlation plot for hundred of variables, we are restricting it to the first 50 numerical variables of the dataset. download-1.png download.png

Been enjoying exploring this dataset for sure, and certainly it was fun doing it, stays safe everyone! 😊

Disclaimer

And please remember, as this is only a weekend pet project, which I'm doing them for my personal interest only.

About

Within this repository, delve into a treasure trove of my personal projects spanning Machine Learning, exploratory data analysis (EDA), Python Jupyter Notebooks, and an assortment of visualizations crafted using Dataiku Platform's exported standard files.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published