If there's one thing similar about an interesting dataset and a good football's match, is that they're all keeping everyone's safe at home during this time of the pandemic. And in all honesty, I'm no data-scientist nor a dev guy. I just recently got myself exposed to a Machine Learning/Artificial Intelligent in general, while Dataiku in particular -- somewhere a little over then 3 months ago, but anyway here's my take to this conundrum's challenge.
On this repository, you may find my personal projects related to Machine Learning, EDA, Python Jupyter Notebook and couple of Visualization based on the Dataiku Platform exported standard files. Most of the datasets I've been working with, downloaded from Conundrum site. Installation pretty straight forward. Simply download the whole set as a single project and as a ZIP file, everything have been flattened out with plain text files, and no SQL dump was involved, so there wouldn't be any missing system dependencies issue. Simply imported the downloaded Zip file to your working project.
- Correlations analysis on Conundrum_13_Data_prepared (admin).ipynb
- Correlations analysis on Conundrum_13_Data_prepared_scored (admin).ipynb
- High dimensional data visualization (t-SNE) on Conundrum_13_Data_prepared_scored (admin).ipynb
- PCA on Conundrum_13_Data_prepared_scored (admin).ipynb
- Statistics and tests on a single population on Conundrum_13_Data_prepared_scored (admin).ipynb
- Statistics and tests on multiple populations on Conundrum_13_Data_prepared_scored (admin).ipynb
- Topic modeling on Conundrum_13_Data_prepared_scored (admin).ipynb
And since the challenge is not to 'predict' anything, rather to group/cluster the player's skillsets in reflect to their wages rate. Here's what my current flow would look like, and don't bother much on the 2 additional datasets, as they're merely exported from the existing model, so that I may explore them further.
And here's how I go about on the prepare recipes, nothing out of the ordinary. Just converting categorical to numerical values with one-hot encoding and filling up the 'NaN' with median values, while grouping them to have better clarity, if ever I need to go back and revise anything.
While on modeling/training steps, I choose the 'Interactive Clustering' which in return, delivered me a sufficient scoring value.
On to the clustering variables name, I simply identify them in the grading manner, starting from 'Grading A', as the most top-knot performer, all the way down to the least performing one marked with 'Grading E'.
And here's how my cluster plot would look like, obviously the better the grade, the least volume of players getting included in them.
And for sure those who sits at the 'Grading A' level would stand above the average threshold (though, that's not always the case with other variables).
And coming back again to the initial question, "creating a flow that outputs a value proposition in term of their wages". I think I didn't include the players name and their nationalities in my modeling for a couple of reasons. In my opinions, those two variables are just too subjective to get included. In a sense that you could be a top-knot player, regardless of what your 'Names' would sound like, and of course your 'Nationalities'.
So I've done the DSS flow diagram, while the followings are my list of 'value proposition' that contributed of being one 'Grading-A' player in the field.
Top 5 Values Proposition By Distribution.
Top 5 Values Proposition By Grade.
The very first correlation analysis consists of plotting the "Correlation matrix" for numerical variables. For each couple of numerical variables, this computes the "strength" of the correlation (called the Pearson coefficient):
- 1.0 means a perfect correlation
- 0.0 means no correlation
- -1.0 means a perfect "inverse" correlation
Since it does not really make sense to print this correlation plot for hundred of variables, we are restricting it to the first 50 numerical variables of the dataset.
Been enjoying exploring this dataset for sure, and certainly it was fun doing it, stays safe everyone! 😊
And please remember, as this is only a weekend pet project, which I'm doing them for my personal interest only.