Jonathan Lalou's Blog

Posts Tagged ‘PyConUS’

[PyConUS 2024] Pandas + Dask DataFrame 2.0: A Leap Forward in Distributed Computing

At PyCon US 2024, Patrick Hoefler delivered an insightful presentation on the advancements in Dask DataFrame 2.0, particularly its enhanced integration with pandas and its performance compared to other big data tools like Spark, DuckDB, and Polars. As a maintainer of both pandas and Dask, Patrick, who works at Coiled, shared how recent improvements have transformed Dask into a robust and efficient solution for distributed computing, making it a compelling choice for handling large-scale datasets.

Enhanced String Handling with Arrow Integration

One of the most significant upgrades in Dask DataFrame 2.0 is its adoption of Apache Arrow for string handling, moving away from the less efficient NumPy object data type. Patrick highlighted that this shift has resulted in substantial performance gains. For instance, string operations are now two to three times faster in pandas, and in Dask, they can achieve up to tenfold improvements due to better multithreading capabilities. Additionally, memory usage has been drastically reduced—by approximately 60 to 70% in typical datasets—making Dask more suitable for memory-constrained environments. This enhancement ensures that users can process large datasets with string-heavy columns more efficiently, a critical factor in distributed workloads.

Revolutionary Shuffle Algorithm

Patrick emphasized the complete overhaul of Dask’s shuffle algorithm, which is pivotal for distributed systems where data must be communicated across multiple workers. The previous algorithm scaled poorly, with a logarithmic complexity that hindered performance as dataset sizes grew. The new peer-to-peer (P2P) shuffle algorithm, however, scales linearly, ensuring that doubling the dataset size only doubles the workload. This improvement not only boosts performance but also enhances reliability, allowing Dask to handle arbitrarily large datasets with constant memory usage by leveraging disk storage when necessary. Such advancements make Dask a more resilient choice for complex data processing tasks.

Query Planning: A Game-Changer

The introduction of a logical query planning layer marks a significant milestone for Dask. Historically, Dask executed operations as they were received, often leading to inefficient processing. The new query optimizer employs techniques like column projection and predicate pushdown, which significantly reduce unnecessary data reads and network transfers. For example, by identifying and prioritizing filters and projections early in the query process, Dask can minimize data movement, potentially leading to performance improvements of up to 1000x in certain scenarios. This optimization makes Dask more intuitive and efficient, aligning it closer to established systems like Spark.

Benchmarking Against the Giants

Patrick presented comprehensive benchmarks using the TPC-H dataset to compare Dask’s performance against Spark, DuckDB, and Polars. At a 100 GB scale, DuckDB often outperformed others due to its single-node optimization, but Dask held its own. At larger scales (1 TB and 10 TB), Dask’s distributed nature gave it an edge, particularly when DuckDB struggled with memory constraints on complex queries. Against Spark, Dask showed remarkable progress, outperforming it in most queries at the 1 TB scale and maintaining competitiveness at 10 TB, despite some overhead issues that Patrick noted are being addressed. These results underscore Dask’s growing capability to handle enterprise-level data processing tasks.

Links:

Coiled company website

Hashtags: #Dask #Pandas #BigData #DistributedComputing #PyConUS2024 #PatrickHoefler #Coiled #Spark #DuckDB #Polars

Posted in en-US | Tags: DuckDB, Pandas, PyConUS | No Comments »

[PyConUS 2024] How Python Harnesses Rust through PyO3

Author: Jonathan Lalou

David Hewitt, a key contributor to the PyO3 library, delivered a comprehensive session at PyConUS 2024, unraveling the mechanics of integrating Rust with Python. As a Python developer for over a decade and a lead maintainer of PyO3, David provided a detailed exploration of how Rust’s power enhances Python’s ecosystem, focusing on PyO3’s role in bridging the two languages. His talk traced the journey of a Python function call to Rust code, offering insights into performance, security, and concurrency, while remaining accessible to those unfamiliar with Rust.

Why Rust in Python?

David began by outlining the motivations for combining Rust with Python, emphasizing Rust’s reliability, performance, and security. Unlike Python, where exceptions can arise unexpectedly, Rust’s structured error handling via pattern matching ensures predictable behavior, reducing debugging challenges. Performance-wise, Rust’s compiled nature offers significant speedups, as seen in libraries like Pydantic, Polars, and Ruff. David highlighted Rust’s security advantages, noting its memory safety features prevent common vulnerabilities found in C or C++, making it a preferred choice for companies like Microsoft and Google. Additionally, Rust’s concurrency model avoids data races, aligning well with Python’s evolving threading capabilities, such as sub-interpreters and free-threading in Python 3.13.

PyO3: Bridging Python and Rust

Central to David’s talk was PyO3, a Rust library that facilitates seamless integration with Python. PyO3 allows developers to write Rust code that runs within a Python program or vice versa, using procedural macros to generate Python-compatible modules. David explained how tools like Maturin and setup-tools-rust simplify project setup, enabling developers to compile Rust code into native libraries that Python imports like standard modules. He emphasized PyO3’s goal of maintaining a low barrier to entry, with comprehensive documentation and a developer guide to assist Python programmers venturing into Rust, ensuring a smooth transition across languages.

Tracing a Function Call

David took the audience on a technical journey, tracing a Python function call through PyO3 to Rust code. Using a simple word-counting function as an example, he showed how a Rust implementation, marked with PyO3’s @pyfunction attribute, mirrors Python’s structure while offering performance gains of 2–4x. He dissected the Python interpreter’s bytecode, revealing how the CALL instruction invokes PyObject_Vectorcall, which resolves to a Rust function pointer via PyO3’s generated code. This “trampoline” handles critical safety measures, such as preventing Rust panics from crashing the Python interpreter and managing the Global Interpreter Lock (GIL) for safe concurrency. David’s step-by-step breakdown clarified how arguments are passed and converted, ensuring seamless execution.

Future of Rust in Python’s Ecosystem

Concluding, David reflected on Rust’s growing adoption in Python, citing over 350 projects monthly uploading Rust code to PyPI, with downloads exceeding 3 billion annually. He predicted that Rust could rival C/C++ in the Python ecosystem within 2–4 years, driven by its reliability and performance. Addressing concurrency, David discussed how PyO3 could adapt to Python’s sub-interpreters and free-threading, potentially enforcing immutability to simplify multithreaded interactions. His vision for PyO3 is to enhance Python’s strengths without replacing it, fostering a symbiotic relationship that empowers developers to leverage Rust’s precision where needed.

Links:

Hashtags: #Rust #PyO3 #Python #Performance #Security #PyConUS2024 #DavidHewitt #Pydantic #Polars #Ruff

Posted in en-US | Tags: PyConUS, Python, Rust, Security | No Comments »

[PyConUS 2023] Fixing Legacy Code, One Pull Request at a Time

Author: Jonathan Lalou

At PyCon US 2023, Guillaume Dequenne from Sonar presented a compelling workshop on modernizing legacy codebases through incremental improvements. Sponsored by Sonar, this session focused on integrating code quality tools into development workflows to enhance maintainability and sustainability, using a Flask application as a practical example. Guillaume’s approach, dubbed “Clean as You Code,” offers a scalable strategy for tackling technical debt without overwhelming developers.

The Legacy Code Conundrum

Legacy codebases often pose significant challenges, accumulating technical debt that hinders development efficiency and developer morale. Guillaume illustrated this with a vivid metaphor: analyzing a legacy project for the first time can feel like drowning in a sea of issues. Traditional approaches to fixing all issues at once are unscalable, risking functional regressions and requiring substantial resources. Instead, Sonar advocates for a pragmatic methodology that focuses on ensuring new code adheres to high-quality standards, gradually reducing technical debt over time.

Clean as You Code Methodology

The “Clean as You Code” approach hinges on two principles: ownership of new code and incremental improvement. Guillaume explained that developers naturally understand and take responsibility for code they write today, making it easier to enforce quality standards. By ensuring that each pull request introduces clean code, teams can progressively refurbish their codebase. Over time, as new code replaces outdated sections, the overall quality improves without requiring a massive upfront investment. This method aligns with continuous integration and delivery (CI/CD) practices, allowing teams to maintain high standards while delivering features systematically.

Leveraging SonarCloud for Quality Assurance

Guillaume demonstrated the practical application of this methodology using SonarCloud, a cloud-based static analysis tool. By integrating SonarCloud into a Flask application’s CI/CD pipeline, developers can automatically analyze pull requests for issues like bugs, security vulnerabilities, and code smells. He showcased how SonarCloud’s quality gates enforce standards on new code, ensuring that only clean contributions are merged. For instance, Guillaume highlighted a detected SQL injection vulnerability due to unsanitized user input, emphasizing the tool’s ability to provide contextual data flow analysis to pinpoint and resolve issues efficiently.

Enhancing Developer Workflow with SonarLint

To catch issues early, Guillaume introduced SonarLint, an IDE extension for PyCharm and VSCode that performs real-time static analysis. This tool allows developers to address issues before committing code, streamlining the review process. He demonstrated how SonarLint highlights issues like unraised exceptions and offers quick fixes, enhancing productivity. Additionally, the connected mode between SonarLint and SonarCloud synchronizes issue statuses, ensuring consistency across development and review stages. This integration empowers developers to maintain high-quality code from the outset, reducing the burden of post-commit fixes.

Sustaining Codebase Health

The workshop underscored the long-term benefits of the “Clean as You Code” approach, illustrated by a real-world project where issue counts decreased over time as new rules were introduced. By focusing on new code and leveraging tools like SonarCloud and SonarLint, teams can achieve sustainable codebases that are maintainable, reliable, and secure. Guillaume’s presentation offered a roadmap for developers to modernize legacy systems incrementally, fostering a culture of continuous improvement.

Links:

Hashtags: #LegacyCode #CleanCode #StaticAnalysis #SonarCloud #SonarLint #Python #Flask #GuillaumeDequenne #PyConUS2023

Posted in en-US | Tags: CleanCode, PyConUS, Python, Sonar | No Comments »