- Start by loading metadata about tracks and track metrics compiled by The Echo Nest using pandas. The data is merged based on the track_id and genre_top columns.
- Explore correlations and Visualize the correlation metrics using a heatmap.
- Normalize the data using StandardScaler from scikit-learn to ensure fair treatment of different feature scales.
- Apply PCA to reduce dimensionality. Explore scree plots to determine the number of components to use.
- Plot cumulative explained variance to determine the number of features required to explain a certain percentage of variance. In this case, aim for about 85% explained variance.
- Utilize the lower-dimensional PCA projection to train a decision tree classifier. Split the data into training and testing sets and evaluate the model's performance.
- Implement logistic regression as an alternative classification algorithm. Compare the performance of the decision tree and logistic regression using classification reports.
- Apply k-fold cross-validation to get a more robust evaluation of model performance. Use both decision tree and logistic regression classifiers and examine the cross-validation scores.