Jonathan Lalou's Blog

Posts Tagged ‘Euromillions’

[DevoxxFR2014] Apply to dataset

features = full_dataset.apply(advanced_feature_extraction, axis=1)
enhanced_dataset = pd.concat([full_dataset, features], axis=1)


To verify feature efficacy, correlation matrices and PCA are employed, confirming strong discriminatory power.

## Model Selection, Implementation, and Optimization

The binary classification problem—human versus random—lends itself to supervised learning algorithms. Christophe Bourguignat systematically evaluates candidates from linear models to ensembles.

Support Vector Machines provide a strong baseline due to their effectiveness in high-dimensional spaces:

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svm_model = SVC(kernel=’rbf’, C=10.0, gamma=0.1, probability=True, random_state=42)
cross_val_scores = cross_val_score(svm_model, X_train, y_train, cv=5, scoring=’roc_auc’)
print(“SVM Cross-Validation AUC Mean:”, cross_val_scores.mean())

svm_model.fit(X_train, y_train)
svm_preds = svm_model.predict(X_test)
print(classification_report(y_test, svm_preds))


Random Forests offer interpretability through feature importance:

rf_model = RandomForestClassifier(n_estimators=500, max_depth=15, random_state=42)
rf_model.fit(X_train, y_train)
rf_importances = pd.DataFrame({
‘feature’: X.columns,
‘importance’: rf_model.feature_importances_
}).sort_values(‘importance’, ascending=False)
print(“Top Features:\n”, rf_importances.head(5))


Gradient Boosting (XGBoost) for superior performance:

from xgboost import XGBClassifier

xgb_model = XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=8, random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)
print(“XGBoost Accuracy:”, (xgb_preds == y_test).mean())


Optimization uses Bayesian methods via scikit-optimize for efficiency.

## Evaluation and Interpretation

Comprehensive evaluation includes ROC curves, precision-recall plots, and calibration:

from sklearn.metrics import roc_curve, precision_recall_curve

fpr, tpr, _ = roc_curve(y_test, rf_model.predict_proba(X_test)[:,1])
plt.plot(fpr, tpr)
plt.title(‘ROC Curve’)
plt.show()


SHAP values interpret predictions:

import shap

explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
“`

Practical Deployment for Geek Use Cases

The model deploys as a Flask API for generating verified random combinations.

Conclusion: Democratizing ML for Everyday Insights

This extended demonstration shows how Python and open data enable geeks to build meaningful ML applications, revealing human biases while providing practical tools.

Links:

Posted in en-US | Tags: ChristopheBourguignat, DataScience, DevoxxFR2014, Euromillions, MachineLearning, OpenData, Pandas, Python, ScikitLearn | No Comments »