Jonathan Lalou's Blog

Posts Tagged ‘DataScience’

[KotlinConf2025] Charts, Code, and Sails: Winning a Regatta with Kotlin Notebook

In the high-stakes world of competitive sailing, where every decision can mean the difference between victory and defeat, an extraordinary tool has emerged: Kotlin Notebook. Roman Belov, a distinguished member of the JetBrains team, shared a captivating account of leveraging this innovative technology to triumph in a 24-hour regatta. The narrative transcends a simple code demonstration, illustrating how interactive programming becomes a critical asset in a dynamic, unpredictable environment like the open sea.

This journey highlights the power of Kotlin Notebook as more than just a development tool; it’s a platform for real-time problem-solving. While a seasoned developer, Roman’s most cherished hat is that of a yachtsman. He uses the notebook to translate complex nautical challenges into actionable, data-driven decisions. The essence of the task is to navigate a course, which is essentially a graph with nodes representing different locations and edges representing the path between them. However, unlike a typical graph problem, the rules of sailing introduce complex variables. The boat cannot sail directly into the wind, and its speed is heavily dependent on the angle of the wind. This means the graph is constantly changing, making traditional route-planning algorithms obsolete.

The solution required a tool that could rapidly process data, visualize outcomes, and allow for on-the-fly adjustments. This is where Kotlin Notebook excelled, providing a live, interactive environment. Roman outlined how he could use the notebook to perform crucial tasks in the middle of the race: visualizing the race course on a map, calculating the fastest path based on current wind conditions, and dynamically adjusting the route as the wind shifted. This is achieved by creating a “sailable roads” model, which evaluates every potential path on the graph at regular intervals and discards any that are impossible given the wind direction. For the remaining paths, the notebook computes the optimal boat speed and time to complete that segment, effectively modeling the race in real time.

Roman then showcased the brute-force search algorithm that was used to find the optimal path. The code, written in Kotlin, was surprisingly straightforward and demonstrated the language’s elegance and readability. The algorithm, running within the notebook, would constantly iterate through the potential paths, calculating the time to finish for each one and discarding any that were slower than the best time found so far. The visual output of the notebook, which could render the different routes directly on the map, was a game-changer. It transformed abstract data and calculations into a clear, visual representation that allowed the sailors to make quick, informed decisions.

The application of Kotlin Notebook in this unconventional scenario proves its versatility beyond traditional data science or development tasks. It demonstrated how a tool designed for rapid experimentation can be applied to complex, real-world problems. The interactive nature of the notebook allowed Roman to combine data analysis, algorithm execution, and visual feedback into a single, cohesive workflow, enabling him and his crew to stay ahead of the competition and ultimately, win the race. This story is a testament to the power of a modern programming language and an adaptable toolchain, turning a challenging maritime endeavor into an exciting display of computational prowess.

Links:

Posted in en-US | Tags: DataScience, InteractiveProgramming, JetBrains, Kotlin, KotlinConf2025, KotlinNotebook, Regatta, RomanBelov, Sailing | No Comments »

[DevoxxPL2022] Successful AI-NLP Project: What You Need to Know

Author: Jonathan Lalou

At Devoxx Poland 2022, Robert Wcisło and Łukasz Matug, data scientists at UBS, shared insights on ensuring the success of AI and NLP projects, drawing from their experience implementing AI solutions in a large investment bank. Their presentation highlighted critical success factors for deploying machine learning (ML) models into production, addressing common pitfalls and offering practical guidance across the project lifecycle.

Understanding the Challenges

The speakers noted that enthusiasm for AI often outpaces practical outcomes, with 2018 data indicating only 10% of ML projects reached production. While this figure may have improved, many projects still fail due to misaligned expectations or inadequate preparation. To counter this, they outlined a simplified three-phase process—Prepare, Build, and Maintain—integrating Software Development Lifecycle (SDLC) and MLOps principles, with a focus on delivering business value and user experience.

Prepare Phase: Setting the Foundation

Łukasz emphasized the importance of the Prepare phase, where clarity on business needs is critical. Many stakeholders, inspired by AI hype, expect miraculous solutions without defining specific outcomes. Key considerations include:

Defining the Output: Understand the business problem and desired results, such as labeling outcomes (e.g., fraud detection). Reduce ambiguity by explicitly defining what the application should achieve.
Evaluating ML Necessity: ML excels in areas like recommendation systems, language understanding, anomaly detection, and personalization, but it’s not a universal solution. For one-off problems, simpler analytics may suffice.
Red Flags: ML models rarely achieve 100% accuracy, requiring more data and testing for higher precision, which increases costs. Highly regulated industries may demand transparency, posing challenges for complex models. Data availability is also critical—without sufficient data, ML is infeasible, though workarounds like transfer learning or purchasing data exist.
Universal Performance Metric: Establish a metric aligned with business goals (e.g., click-through rate, precision/recall) to measure success, unify stakeholder expectations, and guide development priorities for cost efficiency.
Tooling and Infrastructure: Align software and data science teams with shared tools (e.g., Git, data access, experiment logs). Ensure compliance with data restrictions (e.g., GDPR, cross-border rules) and secure access to production-like data and infrastructure (e.g., GPUs).
Automation Levels: Decide the role of AI—ranging from no AI (human baseline) to full automation. Partial automation, where models handle clear cases and humans review uncertain ones, is often practical. Consider ethical principles like fairness, compliance, and no-harm to avoid bias or regulatory issues.
Model Utilization: Plan how the model will be served—binary distribution, API service, embedded application, or self-service platform. Each approach impacts user experience, scalability, and maintenance.
Scalability and Reuse: Design for scalability and consider reusing datasets or models to enhance future projects and reduce costs.

Build Phase: Crafting the Model

Robert focused on the Build phase, offering technical tips to streamline development:

Data Management: Data evolves, requiring retraining to address drift. For NLP projects, cover diverse document templates, including slang or errors. Track data provenance and lineage to monitor sources and transformations, ensuring pipeline stability.
Data Quality: Most ML projects involve smaller datasets (hundreds to thousands of points), where quality trumps quantity. Address imbalances by collaborating with clients for better data or using simpler models. Perform sanity checks to ensure representativeness, avoiding overly curated data that misaligns with production (e.g., professional photos vs. smartphone images).
Metadata and Tagging: Use tags (e.g., source, date, document type) to simplify debugging and maintenance. For instance, identifying underperforming data (e.g., low-quality German PDFs) becomes easier with metadata.
Labeling Strategy: Noisy or ambiguous labels (e.g., misinterpreting “bridges” as Jeff Bridges or drawings vs. physical bicycles) degrade model performance. Aim for human-level performance (HLP), either against ground truth (e.g., biopsy results) or inter-human agreement. A consistent labeling strategy, documented with clear examples, reduces ambiguity and improves data quality. Tools like AWS Mechanical Turk or in-house labeling platforms can streamline this process.
Training Tips: Use transfer learning to leverage pre-trained models, reducing data needs. Active learning prioritizes labeling hard examples, while pseudo-labeling uses existing models to pre-annotate data, saving time if the model is reliable. Ensure determinism by fixing seeds for reproducibility during debugging. Start with lightweight models (e.g., BERT Tiny) to establish baselines before scaling to complex models.
Baselines: Compare against prior models, heuristic-based systems, or simple proofs-of-concept to contextualize progress toward HLP. An 85% accuracy may be sufficient if it aligns with HLP, but 60% after extensive effort signals issues.

Maintain Phase: Sustaining Performance

Maintenance is critical as ML models differ from traditional software due to data drift and evolving inputs. Strategies include:

Deployment Techniques: Use A/B testing to compare model versions, shadow mode to evaluate models in parallel with human processes, canary deployments to test on a small traffic subset, or blue-green deployments for seamless rollbacks.
Monitoring: Beyond system metrics, monitor input (e.g., image brightness, speech volume, input length) and output (e.g., exact predictions, user behavior like query frequency). Detect data or concept drift to maintain relevance.
Reuse: Reuse models, data, and experiences to reduce uncertainty, lower costs, and build organizational capabilities for future projects.

Key Takeaways

The speakers stressed reusing existing resources to demystify AI, reduce costs, and enhance efficiency. By addressing business needs, data quality, and operational challenges early, teams can increase the likelihood of delivering impactful AI-NLP solutions. They invited attendees to discuss further at the UBS stand, emphasizing practical application over theoretical magic.

Links:

Posted in en-US | Tags: AINLP, DataScience, DevoxxPL2022, DevoxxPoland, LukaszMatug, MachineLearning, MLOps, RobertWcislo, UBS | No Comments »

[DevoxxFR2014] Apply to dataset

Author: Jonathan Lalou

features = full_dataset.apply(advanced_feature_extraction, axis=1)
enhanced_dataset = pd.concat([full_dataset, features], axis=1)


To verify feature efficacy, correlation matrices and PCA are employed, confirming strong discriminatory power.

## Model Selection, Implementation, and Optimization

The binary classification problem—human versus random—lends itself to supervised learning algorithms. Christophe Bourguignat systematically evaluates candidates from linear models to ensembles.

Support Vector Machines provide a strong baseline due to their effectiveness in high-dimensional spaces:

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svm_model = SVC(kernel=’rbf’, C=10.0, gamma=0.1, probability=True, random_state=42)
cross_val_scores = cross_val_score(svm_model, X_train, y_train, cv=5, scoring=’roc_auc’)
print(“SVM Cross-Validation AUC Mean:”, cross_val_scores.mean())

svm_model.fit(X_train, y_train)
svm_preds = svm_model.predict(X_test)
print(classification_report(y_test, svm_preds))


Random Forests offer interpretability through feature importance:

rf_model = RandomForestClassifier(n_estimators=500, max_depth=15, random_state=42)
rf_model.fit(X_train, y_train)
rf_importances = pd.DataFrame({
‘feature’: X.columns,
‘importance’: rf_model.feature_importances_
}).sort_values(‘importance’, ascending=False)
print(“Top Features:\n”, rf_importances.head(5))


Gradient Boosting (XGBoost) for superior performance:

from xgboost import XGBClassifier

xgb_model = XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=8, random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)
print(“XGBoost Accuracy:”, (xgb_preds == y_test).mean())


Optimization uses Bayesian methods via scikit-optimize for efficiency.

## Evaluation and Interpretation

Comprehensive evaluation includes ROC curves, precision-recall plots, and calibration:

from sklearn.metrics import roc_curve, precision_recall_curve

fpr, tpr, _ = roc_curve(y_test, rf_model.predict_proba(X_test)[:,1])
plt.plot(fpr, tpr)
plt.title(‘ROC Curve’)
plt.show()


SHAP values interpret predictions:

import shap

explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
“`

Practical Deployment for Geek Use Cases

The model deploys as a Flask API for generating verified random combinations.

Conclusion: Democratizing ML for Everyday Insights

This extended demonstration shows how Python and open data enable geeks to build meaningful ML applications, revealing human biases while providing practical tools.

Links:

Posted in en-US | Tags: ChristopheBourguignat, DataScience, DevoxxFR2014, Euromillions, MachineLearning, OpenData, Pandas, Python, ScikitLearn | No Comments »