Jonathan Lalou's Blog

Posts Tagged ‘Pandas’

[PyConUS 2024] Pandas + Dask DataFrame 2.0: A Leap Forward in Distributed Computing

At PyCon US 2024, Patrick Hoefler delivered an insightful presentation on the advancements in Dask DataFrame 2.0, particularly its enhanced integration with pandas and its performance compared to other big data tools like Spark, DuckDB, and Polars. As a maintainer of both pandas and Dask, Patrick, who works at Coiled, shared how recent improvements have transformed Dask into a robust and efficient solution for distributed computing, making it a compelling choice for handling large-scale datasets.

Enhanced String Handling with Arrow Integration

One of the most significant upgrades in Dask DataFrame 2.0 is its adoption of Apache Arrow for string handling, moving away from the less efficient NumPy object data type. Patrick highlighted that this shift has resulted in substantial performance gains. For instance, string operations are now two to three times faster in pandas, and in Dask, they can achieve up to tenfold improvements due to better multithreading capabilities. Additionally, memory usage has been drastically reduced—by approximately 60 to 70% in typical datasets—making Dask more suitable for memory-constrained environments. This enhancement ensures that users can process large datasets with string-heavy columns more efficiently, a critical factor in distributed workloads.

Revolutionary Shuffle Algorithm

Patrick emphasized the complete overhaul of Dask’s shuffle algorithm, which is pivotal for distributed systems where data must be communicated across multiple workers. The previous algorithm scaled poorly, with a logarithmic complexity that hindered performance as dataset sizes grew. The new peer-to-peer (P2P) shuffle algorithm, however, scales linearly, ensuring that doubling the dataset size only doubles the workload. This improvement not only boosts performance but also enhances reliability, allowing Dask to handle arbitrarily large datasets with constant memory usage by leveraging disk storage when necessary. Such advancements make Dask a more resilient choice for complex data processing tasks.

Query Planning: A Game-Changer

The introduction of a logical query planning layer marks a significant milestone for Dask. Historically, Dask executed operations as they were received, often leading to inefficient processing. The new query optimizer employs techniques like column projection and predicate pushdown, which significantly reduce unnecessary data reads and network transfers. For example, by identifying and prioritizing filters and projections early in the query process, Dask can minimize data movement, potentially leading to performance improvements of up to 1000x in certain scenarios. This optimization makes Dask more intuitive and efficient, aligning it closer to established systems like Spark.

Benchmarking Against the Giants

Patrick presented comprehensive benchmarks using the TPC-H dataset to compare Dask’s performance against Spark, DuckDB, and Polars. At a 100 GB scale, DuckDB often outperformed others due to its single-node optimization, but Dask held its own. At larger scales (1 TB and 10 TB), Dask’s distributed nature gave it an edge, particularly when DuckDB struggled with memory constraints on complex queries. Against Spark, Dask showed remarkable progress, outperforming it in most queries at the 1 TB scale and maintaining competitiveness at 10 TB, despite some overhead issues that Patrick noted are being addressed. These results underscore Dask’s growing capability to handle enterprise-level data processing tasks.

Links:

Coiled company website

Hashtags: #Dask #Pandas #BigData #DistributedComputing #PyConUS2024 #PatrickHoefler #Coiled #Spark #DuckDB #Polars

Posted in en-US | Tags: DuckDB, Pandas, PyConUS | No Comments »

[DevoxxFR2014] Apply to dataset

Author: Jonathan Lalou

features = full_dataset.apply(advanced_feature_extraction, axis=1)
enhanced_dataset = pd.concat([full_dataset, features], axis=1)


To verify feature efficacy, correlation matrices and PCA are employed, confirming strong discriminatory power.

## Model Selection, Implementation, and Optimization

The binary classification problem—human versus random—lends itself to supervised learning algorithms. Christophe Bourguignat systematically evaluates candidates from linear models to ensembles.

Support Vector Machines provide a strong baseline due to their effectiveness in high-dimensional spaces:

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svm_model = SVC(kernel=’rbf’, C=10.0, gamma=0.1, probability=True, random_state=42)
cross_val_scores = cross_val_score(svm_model, X_train, y_train, cv=5, scoring=’roc_auc’)
print(“SVM Cross-Validation AUC Mean:”, cross_val_scores.mean())

svm_model.fit(X_train, y_train)
svm_preds = svm_model.predict(X_test)
print(classification_report(y_test, svm_preds))


Random Forests offer interpretability through feature importance:

rf_model = RandomForestClassifier(n_estimators=500, max_depth=15, random_state=42)
rf_model.fit(X_train, y_train)
rf_importances = pd.DataFrame({
‘feature’: X.columns,
‘importance’: rf_model.feature_importances_
}).sort_values(‘importance’, ascending=False)
print(“Top Features:\n”, rf_importances.head(5))


Gradient Boosting (XGBoost) for superior performance:

from xgboost import XGBClassifier

xgb_model = XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=8, random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)
print(“XGBoost Accuracy:”, (xgb_preds == y_test).mean())


Optimization uses Bayesian methods via scikit-optimize for efficiency.

## Evaluation and Interpretation

Comprehensive evaluation includes ROC curves, precision-recall plots, and calibration:

from sklearn.metrics import roc_curve, precision_recall_curve

fpr, tpr, _ = roc_curve(y_test, rf_model.predict_proba(X_test)[:,1])
plt.plot(fpr, tpr)
plt.title(‘ROC Curve’)
plt.show()


SHAP values interpret predictions:

import shap

explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
“`

Practical Deployment for Geek Use Cases

The model deploys as a Flask API for generating verified random combinations.

Conclusion: Democratizing ML for Everyday Insights

This extended demonstration shows how Python and open data enable geeks to build meaningful ML applications, revealing human biases while providing practical tools.

Links:

Posted in en-US | Tags: ChristopheBourguignat, DataScience, DevoxxFR2014, Euromillions, MachineLearning, OpenData, Pandas, Python, ScikitLearn | No Comments »