All code can be found here : https://github.com/lppier/fifa18_final16
I thought that it would be interesting to build a prediction model to predict the results of the clashes between the final 16 FIFA World Cup 2018 teams.
Data obtained from here : https://www.kaggle.com/agostontorok/soccer-world-cup-2018-winner/data
Main python notebook here : https://github.com/lppier/fifa18_final16/blob/master/fifa2018.ipynb
Two datasets were used :
Dataset 1 : A FIFA World Ranking database of all the countries that play soccer competitively.
Dataset 2 : Results of all international soccer matches since 1872.
- Considering that line-up changes does affect the odds, I elected to take in data only from the last world cup onwards.
- Only the 2018 rankings was used, rankings earlier than this was not considered.
- Only World Cup matches’ data was used (Qualifers, or otherwise), it was found during model exploration that using World Cup data only yielded better accuracies and overall metrics.
- Data Balancing was done to push up the accuracies of the model.
- Data was split into 80% training data, 20% test data. The 20% test data was not used to test until the end, after using 5-fold cross-validation to validate the model quality using only the training data.
- I used Orange to do a quick run of all the Scikit-Learn classifier algorithms, of which Logistic Regression, Random Forest and Naive Bayes emerged as the best classifiers for this particular problem.
- A random forest model was eventually chosen, and hyper-parameters tweaking was done on it.
Predicted Outcome : Germany Wins the Germany – South Korea match with 77% accuracy.
More to come soon after the final 16 are in…