[KotlinConf2024] DataFrame: Kotlin’s Dynamic Data Handling
At KotlinConf2024, Roman Belov, JetBrains’ Kotlin Moods group leader, showcased Kotlin DataFrame, a versatile library for managing flat and hierarchical data. Designed for general developers, not just data scientists, DataFrame handles CSV, JSON, and object subgraphs, enabling seamless data transformation and visualization. Roman demonstrated its integration with Kotlin Notebook for prototyping and a compiler plugin for dynamic type inference, using a KotlinConf app backend as an example. This talk highlighted how DataFrame empowers developers to build robust, interactive data pipelines.
DataFrame: A Versatile Data Structure
Kotlin DataFrame redefines data handling for Kotlin developers. Roman explained that, unlike traditional data classes, DataFrame supports dynamic column manipulation, akin to Excel tables. It can read, write, and transform data from formats like CSV or JSON, making it ideal for both analytics and general projects. For a KotlinConf app, DataFrame processed session data from a REST API, allowing developers to filter, sort, and pivot data effortlessly, providing a flexible alternative to rigid data class structures.
Prototyping with Kotlin Notebook
Kotlin Notebook, a plugin for IntelliJ IDEA Ultimate, enhances DataFrame’s prototyping capabilities. Roman demonstrated creating a scratch file to fetch session data via Ktor Client. The notebook’s auto-completion for dependencies, like Ktor or DataFrame, simplifies setup, downloading the latest versions from Maven Central. Interactive tables display hierarchical data, and each code fragment updates variable types, enabling rapid experimentation. This environment suits developers iterating on ideas, offering a low-friction way to test data transformations before production.
Dynamic Type Inference in Action
DataFrame’s compiler plugin, built for the K2 compiler, introduces on-the-fly type inference. Roman showed how it analyzes a DataFrame’s schema during execution, generating extension properties for columns. For example, accessing a title
column in a sessions DataFrame feels like using a property, with auto-completion for column names and types. This eliminates manual schema definitions, streamlining data wrangling. Though experimental, the plugin cached schemas efficiently, ensuring performance, as seen when filtering multiplatform talk descriptions.
Handling Hierarchical Data
DataFrame excels with hierarchical structures, unlike flat data classes. Roman illustrated this with nested JSON from the KotlinConf API, converting categories into a DataFrame with grouped columns. Developers can navigate sub-DataFrames within cells, mirroring data class nesting. For instance, a category’s items array became a sub-DataFrame, accessible via intuitive APIs. This capability supports complex data like object subgraphs, enabling developers to transform and analyze nested structures without cumbersome manual mappings.
Building a KotlinConf Schedule
Roman walked through a practical example: creating a daily schedule for KotlinConf. Starting with session data, he converted startsAt
strings to LocalDateTime
, filtered out service sessions, and joined room IDs with room names from another DataFrame. Sorting by start time and pivoting by room produced a clean schedule, with nulls replaced by empty strings. The resulting HTML table, generated directly in the notebook, showcased DataFrame’s ability to transform REST API data into user-friendly outputs, all with concise, readable code.
Visualizing Data with Kandy
DataFrame integrates with Kandy, JetBrains’ visualization library, to create charts. Roman demonstrated analyzing GitHub commits from the Kotlin repository, grouping them by week to plot commit counts and average message lengths. The resulting chart revealed trends, like steady growth potentially tied to CI improvements. Kandy’s simple API, paired with DataFrame’s data manipulation, makes visualization accessible. Roman encouraged exploring Kandy’s website for examples, highlighting its role in turning raw data into actionable insights.
DataFrame in Production
Moving DataFrame to production is straightforward. Roman showed copying notebook code into IntelliJ’s EAP version, importing the generated schema to access columns as properties. The compiler plugin evolves schemas dynamically, supporting operations like adding a room
column and using it immediately. This approach minimizes boilerplate, as seen when serializing a schedule to JSON. Though the plugin is experimental, its integration with K2 ensures reliability, making DataFrame a practical choice for building scalable backend systems, from APIs to data pipelines.