Jonathan Lalou's Blog

Posts Tagged ‘Python’

Demystifying Parquet: The Power of Efficient Data Storage in the Cloud

Unlocking the Power of Apache Parquet: A Modern Standard for Data Efficiency

In today’s digital ecosystem, where data volume, velocity, and variety continue to rise, the choice of file format can dramatically impact performance, scalability, and cost. Whether you are an architect designing a cloud-native data platform or a developer managing analytics pipelines, Apache Parquet stands out as a foundational technology you should understand — and probably already rely on.

This article explores what Parquet is, why it matters, and how to work with it in practice — including real examples in Python, Java, Node.js, and Bash for converting and uploading files to Amazon S3.

What Is Apache Parquet?

Apache Parquet is a high-performance, open-source file format designed for efficient columnar data storage. Originally developed by Twitter and Cloudera and now an Apache Software Foundation project, Parquet is purpose-built for use with distributed data processing frameworks like Apache Spark, Hive, Impala, and Drill.

Unlike row-based formats such as CSV or JSON, Parquet organizes data by columns rather than rows. This enables powerful compression, faster retrieval of selected fields, and dramatic performance improvements for analytical queries.

Why Choose Parquet?

✅ Columnar Format = Faster Queries

Because Parquet stores values from the same column together, analytical engines can skip irrelevant data and process only what’s required — reducing I/O and boosting speed.

Compression and Storage Efficiency

Parquet achieves better compression ratios than row-based formats, thanks to the similarity of values in each column. This translates directly into reduced cloud storage costs.

Schema Evolution

Parquet supports schema evolution, enabling your datasets to grow gracefully. New fields can be added over time without breaking existing consumers.

Interoperability

The format is compatible across multiple ecosystems and languages, including Python (Pandas, PyArrow), Java (Spark, Hadoop), and even browser-based analytics tools.

☁️ Using Parquet with Amazon S3

One of the most common modern use cases for Parquet is in conjunction with Amazon S3, where it powers data lakes, ETL pipelines, and serverless analytics via services like Amazon Athena and Redshift Spectrum.

Here’s how you can write Parquet files and upload them to S3 in different environments:

From CSV to Parquet in Practice

Python Example

import pandas as pd

# Load CSV data
df = pd.read_csv("input.csv")

# Save as Parquet
df.to_parquet("output.parquet", engine="pyarrow")

To upload to S3:

import boto3

s3 = boto3.client("s3")
s3.upload_file("output.parquet", "your-bucket", "data/output.parquet")

Node.js Example

Install the required libraries:

npm install aws-sdk

Upload file to S3:

const AWS = require('aws-sdk');
const fs = require('fs');

const s3 = new AWS.S3();
const fileContent = fs.readFileSync('output.parquet');

const params = {
    Bucket: 'your-bucket',
    Key: 'data/output.parquet',
    Body: fileContent
};

s3.upload(params, (err, data) => {
    if (err) throw err;
    console.log(`File uploaded successfully at ${data.Location}`);
});

☕ Java with Apache Spark and AWS SDK

In your pom.xml, include:

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.12.2</version>
</dependency>
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk-s3</artifactId>
    <version>1.12.470</version>
</dependency>

Spark conversion:

Dataset<Row> df = spark.read().option("header", "true").csv("input.csv");
df.write().parquet("output.parquet");

Upload to S3:

AmazonS3 s3 = AmazonS3ClientBuilder.standard()
    .withRegion("us-west-2")
    .withCredentials(new AWSStaticCredentialsProvider(
        new BasicAWSCredentials("ACCESS_KEY", "SECRET_KEY")))
    .build();

s3.putObject("your-bucket", "data/output.parquet", new File("output.parquet"));

Bash with AWS CLI

aws s3 cp output.parquet s3://your-bucket/data/output.parquet

Final Thoughts

Apache Parquet has quietly become a cornerstone of the modern data stack. It powers everything from ad hoc analytics to petabyte-scale data lakes, bringing consistency and efficiency to how we store and retrieve data.

Whether you are migrating legacy pipelines, designing new AI workloads, or simply optimizing your storage bills — understanding and adopting Parquet can unlock meaningful benefits.

When used in combination with cloud platforms like AWS, the performance, scalability, and cost-efficiency of Parquet-based workflows are hard to beat.

Posted in en-US | Tags: Java, NodeJS, parquet, Python, Spark | No Comments »

Creating EPUBs from Images: A Developer’s Guide to Digital Publishing

Author: Jonathan Lalou

Ever needed to convert a collection of images into a professional EPUB file? Whether you’re working with comics, manga, or any image-based content, I’ve developed a Python script that makes this process seamless and customizable.

What is create_epub.py?

This Python script transforms a folder of images into a fully-featured EPUB file, complete with:

Proper EPUB 3.0 structure
Customizable metadata
Table of contents
Responsive image display
Cover image handling

Key Features

Smart Filename Generation: Automatically generates EPUB filenames based on metadata (e.g., “MyBook_01_1.epub”)
Comprehensive Metadata Support: Title, author, series, volume, edition, ISBN, and more
Image Optimization: Supports JPEG, PNG, and GIF formats with proper scaling
Responsive Design: CSS-based layout that works across devices
Detailed Logging: Progress tracking and debugging capabilities

Usage Example

python create_epub.py image_folder \
    --title "My Book" \
    --author "Author Name" \
    --volume 1 \
    --edition "First Edition" \
    --series "My Series" \
    --publisher "My Publisher" \
    --isbn "978-3-16-148410-0"

Technical Details

The script creates a proper EPUB 3.0 structure with:

META-INF/container.xml
OEBPS/content.opf (metadata)
OEBPS/toc.ncx (table of contents)
OEBPS/nav.xhtml (navigation)
OEBPS/style.css (responsive styling)
OEBPS/images/ (image storage)

Best Practices Implemented

Proper XML namespaces and validation
Responsive image handling
Comprehensive metadata support
Clean, maintainable code structure
Extensive error handling and logging

Getting Started

# Install dependencies
pip install -r requirements.txt

# Basic usage
python create_epub.py /path/to/images --title "My Book"

# With debug logging
python create_epub.py /path/to/images --title "My Book" --debug

The script is designed to be both powerful and user-friendly, making it accessible to developers while providing the flexibility needed for professional publishing workflows.

Whether you’re a developer looking to automate EPUB creation or a content creator seeking to streamline your publishing process, this tool provides a robust solution for converting images into EPUB files.

The script on GitHub or below: 👇👇👇
[python]
import os
import sys
import logging
import zipfile
import uuid
from datetime import datetime
import argparse
from PIL import Image
import xml.etree.ElementTree
from xml.dom import minidom

# @author Jonathan Lalou / https://github.com/JonathanLalou/

# Configure logging
logging.basicConfig(
level=logging.INFO,
format=’%(asctime)s – %(levelname)s – %(message)s’,
handlers=[
logging.StreamHandler(sys.stdout)
]
)
logger = logging.getLogger(__name__)

# Define the CSS content
CSS_CONTENT = ”’
body {
margin: 0;
padding: 0;
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
}
img {
max-width: 100%;
max-height: 100vh;
object-fit: contain;
}
”’

def create_container_xml():
"""Create the container.xml file."""
logger.debug("Creating container.xml")
container = xml.etree.ElementTree.Element(‘container’, {
‘version’: ‘1.0’,
‘xmlns’: ‘urn:oasis:names:tc:opendocument:xmlns:container’
})
rootfiles = xml.etree.ElementTree.SubElement(container, ‘rootfiles’)
xml.etree.ElementTree.SubElement(rootfiles, ‘rootfile’, {
‘full-path’: ‘OEBPS/content.opf’,
‘media-type’: ‘application/oebps-package+xml’
})
xml_content = prettify_xml(container)
logger.debug("container.xml content:\n" + xml_content)
return xml_content

def create_content_opf(metadata, spine_items, manifest_items):
"""Create the content.opf file."""
logger.debug("Creating content.opf")
logger.debug(f"Metadata: {metadata}")
logger.debug(f"Spine items: {spine_items}")
logger.debug(f"Manifest items: {manifest_items}")

package = xml.etree.ElementTree.Element(‘package’, {
‘xmlns’: ‘http://www.idpf.org/2007/opf’,
‘xmlns:dc’: ‘http://purl.org/dc/elements/1.1/’,
‘xmlns:dcterms’: ‘http://purl.org/dc/terms/’,
‘xmlns:opf’: ‘http://www.idpf.org/2007/opf’,
‘version’: ‘3.0’,
‘unique-identifier’: ‘bookid’
})

# Metadata
metadata_elem = xml.etree.ElementTree.SubElement(package, ‘metadata’)

# Required metadata
book_id = str(uuid.uuid4())
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:identifier’, {‘id’: ‘bookid’}).text = book_id
logger.debug(f"Generated book ID: {book_id}")

xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:title’).text = metadata.get(‘title’, ‘Untitled’)
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:language’).text = metadata.get(‘language’, ‘en’)
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:creator’).text = metadata.get(‘author’, ‘Unknown’)

# Add required dcterms:modified
current_time = datetime.now().strftime(‘%Y-%m-%dT%H:%M:%SZ’)
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘dcterms:modified’
}).text = current_time

# Add cover metadata
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘name’: ‘cover’,
‘content’: ‘cover-image’
})

# Add additional metadata
if metadata.get(‘publisher’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:publisher’).text = metadata[‘publisher’]

if metadata.get(‘description’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:description’).text = metadata[‘description’]

if metadata.get(‘rights’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:rights’).text = metadata[‘rights’]

if metadata.get(‘subject’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:subject’).text = metadata[‘subject’]

if metadata.get(‘isbn’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:identifier’, {
‘opf:scheme’: ‘ISBN’
}).text = metadata[‘isbn’]

# Series metadata
if metadata.get(‘series’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘belongs-to-collection’
}).text = metadata[‘series’]
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘group-position’
}).text = metadata.get(‘volume’, ‘1’)

# Release date
if metadata.get(‘release_date’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:date’).text = metadata[‘release_date’]

# Version and edition
if metadata.get(‘version’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘schema:version’
}).text = metadata[‘version’]

if metadata.get(‘edition’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘schema:bookEdition’
}).text = metadata[‘edition’]

# Manifest
manifest = xml.etree.ElementTree.SubElement(package, ‘manifest’)
for item in manifest_items:
xml.etree.ElementTree.SubElement(manifest, ‘item’, item)

# Spine
spine = xml.etree.ElementTree.SubElement(package, ‘spine’)
for item in spine_items:
xml.etree.ElementTree.SubElement(spine, ‘itemref’, {‘idref’: item})

xml_content = prettify_xml(package)
logger.debug("content.opf content:\n" + xml_content)
return xml_content

def create_toc_ncx(metadata, nav_points):
"""Create the toc.ncx file."""
logger.debug("Creating toc.ncx")
logger.debug(f"Navigation points: {nav_points}")

ncx = xml.etree.ElementTree.Element(‘ncx’, {
‘xmlns’: ‘http://www.daisy.org/z3986/2005/ncx/’,
‘version’: ‘2005-1’
})

head = xml.etree.ElementTree.SubElement(ncx, ‘head’)
book_id = str(uuid.uuid4())
xml.etree.ElementTree.SubElement(head, ‘meta’, {‘name’: ‘dtb:uid’, ‘content’: book_id})
logger.debug(f"Generated NCX book ID: {book_id}")

xml.etree.ElementTree.SubElement(head, ‘meta’, {‘name’: ‘dtb:depth’, ‘content’: ‘1’})
xml.etree.ElementTree.SubElement(head, ‘meta’, {‘name’: ‘dtb:totalPageCount’, ‘content’: ‘0’})
xml.etree.ElementTree.SubElement(head, ‘meta’, {‘name’: ‘dtb:maxPageNumber’, ‘content’: ‘0’})

doc_title = xml.etree.ElementTree.SubElement(ncx, ‘docTitle’)
xml.etree.ElementTree.SubElement(doc_title, ‘text’).text = metadata.get(‘title’, ‘Untitled’)

nav_map = xml.etree.ElementTree.SubElement(ncx, ‘navMap’)
for i, (id, label, src) in enumerate(nav_points, 1):
nav_point = xml.etree.ElementTree.SubElement(nav_map, ‘navPoint’, {‘id’: id, ‘playOrder’: str(i)})
nav_label = xml.etree.ElementTree.SubElement(nav_point, ‘navLabel’)
xml.etree.ElementTree.SubElement(nav_label, ‘text’).text = label
xml.etree.ElementTree.SubElement(nav_point, ‘content’, {‘src’: src})

xml_content = prettify_xml(ncx)
logger.debug("toc.ncx content:\n" + xml_content)
return xml_content

def create_nav_xhtml(metadata, nav_points):
"""Create the nav.xhtml file."""
logger.debug("Creating nav.xhtml")

html = xml.etree.ElementTree.Element(‘html’, {
‘xmlns’: ‘http://www.w3.org/1999/xhtml’,
‘xmlns:epub’: ‘http://www.idpf.org/2007/ops’
})

head = xml.etree.ElementTree.SubElement(html, ‘head’)
xml.etree.ElementTree.SubElement(head, ‘title’).text = ‘Table of Contents’

body = xml.etree.ElementTree.SubElement(html, ‘body’)
nav = xml.etree.ElementTree.SubElement(body, ‘nav’, {‘epub:type’: ‘toc’})
ol = xml.etree.ElementTree.SubElement(nav, ‘ol’)

for _, label, src in nav_points:
li = xml.etree.ElementTree.SubElement(ol, ‘li’)
xml.etree.ElementTree.SubElement(li, ‘a’, {‘href’: src}).text = label

xml_content = prettify_xml(html)
logger.debug("nav.xhtml content:\n" + xml_content)
return xml_content

def create_page_xhtml(page_number, image_file):
"""Create an XHTML page for an image."""
logger.debug(f"Creating page {page_number} for image {image_file}")

html = xml.etree.ElementTree.Element(‘html’, {
‘xmlns’: ‘http://www.w3.org/1999/xhtml’,
‘xmlns:epub’: ‘http://www.idpf.org/2007/ops’
})

head = xml.etree.ElementTree.SubElement(html, ‘head’)
xml.etree.ElementTree.SubElement(head, ‘title’).text = f’Page {page_number}’
xml.etree.ElementTree.SubElement(head, ‘link’, {
‘rel’: ‘stylesheet’,
‘type’: ‘text/css’,
‘href’: ‘style.css’
})

body = xml.etree.ElementTree.SubElement(html, ‘body’)
xml.etree.ElementTree.SubElement(body, ‘img’, {
‘src’: f’images/{image_file}’,
‘alt’: f’Page {page_number}’
})

xml_content = prettify_xml(html)
logger.debug(f"Page {page_number} XHTML content:\n" + xml_content)
return xml_content

def prettify_xml(elem):
"""Convert XML element to pretty string."""
rough_string = xml.etree.ElementTree.tostring(elem, ‘utf-8’)
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")

def create_epub_from_images(image_folder, output_file, metadata):
logger.info(f"Starting EPUB creation from images in {image_folder}")
logger.info(f"Output file will be: {output_file}")
logger.info(f"Metadata: {metadata}")

# Get all image files
image_files = [f for f in os.listdir(image_folder)
if f.lower().endswith((‘.png’, ‘.jpg’, ‘.jpeg’, ‘.gif’, ‘.bmp’))]
image_files.sort()
logger.info(f"Found {len(image_files)} image files")
logger.debug(f"Image files: {image_files}")

if not image_files:
logger.error("No image files found in the specified folder")
sys.exit(1)

# Create ZIP file (EPUB)
logger.info("Creating EPUB file structure")
with zipfile.ZipFile(output_file, ‘w’, zipfile.ZIP_DEFLATED) as epub:
# Add mimetype (must be first, uncompressed)
logger.debug("Adding mimetype file (uncompressed)")
epub.writestr(‘mimetype’, ‘application/epub+zip’, zipfile.ZIP_STORED)

# Create META-INF directory
logger.debug("Adding container.xml")
epub.writestr(‘META-INF/container.xml’, create_container_xml())

# Create OEBPS directory structure
logger.debug("Creating OEBPS directory structure")
os.makedirs(‘temp/OEBPS/images’, exist_ok=True)
os.makedirs(‘temp/OEBPS/style’, exist_ok=True)

# Add CSS
logger.debug("Adding style.css")
epub.writestr(‘OEBPS/style.css’, CSS_CONTENT)

# Process images and create pages
logger.info("Processing images and creating pages")
manifest_items = [
{‘id’: ‘style’, ‘href’: ‘style.css’, ‘media-type’: ‘text/css’},
{‘id’: ‘nav’, ‘href’: ‘nav.xhtml’, ‘media-type’: ‘application/xhtml+xml’, ‘properties’: ‘nav’}
]
spine_items = []
nav_points = []

for i, image_file in enumerate(image_files, 1):
logger.debug(f"Processing image {i:03d}/{len(image_files):03d}: {image_file}")

# Copy image to temp directory
image_path = os.path.join(image_folder, image_file)
logger.debug(f"Reading image: {image_path}")
with open(image_path, ‘rb’) as f:
image_data = f.read()
logger.debug(f"Adding image to EPUB: OEBPS/images/{image_file}")
epub.writestr(f’OEBPS/images/{image_file}’, image_data)

# Add image to manifest
image_id = f’image_{i:03d}’
if i == 1:
image_id = ‘cover-image’ # Special ID for cover image
manifest_items.append({
‘id’: image_id,
‘href’: f’images/{image_file}’,
‘media-type’: ‘image/jpeg’ if image_file.lower().endswith((‘.jpg’, ‘.jpeg’)) else ‘image/png’
})

# Create page XHTML
page_id = f’page_{i:03d}’
logger.debug(f"Creating page XHTML: {page_id}.xhtml")
page_content = create_page_xhtml(i, image_file)
epub.writestr(f’OEBPS/{page_id}.xhtml’, page_content)

# Add to manifest and spine
manifest_items.append({
‘id’: page_id,
‘href’: f'{page_id}.xhtml’,
‘media-type’: ‘application/xhtml+xml’
})
spine_items.append(page_id)

# Add to navigation points
nav_points.append((
f’navpoint-{i:03d}’,
‘Cover’ if i == 1 else f’Page {i:03d}’,
f'{page_id}.xhtml’
))

# Create content.opf
logger.debug("Creating content.opf")
epub.writestr(‘OEBPS/content.opf’, create_content_opf(metadata, spine_items, manifest_items))

# Create toc.ncx
logger.debug("Creating toc.ncx")
epub.writestr(‘OEBPS/toc.ncx’, create_toc_ncx(metadata, nav_points))

# Create nav.xhtml
logger.debug("Creating nav.xhtml")
epub.writestr(‘OEBPS/nav.xhtml’, create_nav_xhtml(metadata, nav_points))

logger.info(f"Successfully created EPUB file: {output_file}")
logger.info("EPUB structure:")
logger.info(" mimetype")
logger.info(" META-INF/container.xml")
logger.info(" OEBPS/")
logger.info(" content.opf")
logger.info(" toc.ncx")
logger.info(" nav.xhtml")
logger.info(" style.css")
logger.info(" images/")
for i in range(1, len(image_files) + 1):
logger.info(f" page_{i:03d}.xhtml")

def generate_default_filename(metadata, image_folder):
"""Generate default EPUB filename based on metadata."""
# Get title from metadata or use folder name
title = metadata.get(‘title’)
if not title:
# Get folder name and extract part before last underscore
folder_name = os.path.basename(os.path.normpath(image_folder))
title = folder_name.rsplit(‘_’, 1)[0] if ‘_’ in folder_name else folder_name

# Format title: remove spaces, hyphens, quotes and capitalize
title = ”.join(word.capitalize() for word in title.replace(‘-‘, ‘ ‘).replace(‘"’, ”).replace("’", ”).split())

# Format volume number with 2 digits
volume = metadata.get(‘volume’, ’01’)
if volume.isdigit():
volume = f"{int(volume):02d}"

# Get edition number
edition = metadata.get(‘edition’, ‘1’)

return f"{title}_{volume}_{edition}.epub"

def main():
parser = argparse.ArgumentParser(description=’Create an EPUB from a folder of images’)
parser.add_argument(‘image_folder’, help=’Folder containing the images’)
parser.add_argument(‘–output-file’, ‘-o’, help=’Output EPUB file path (optional)’)
parser.add_argument(‘–title’, help=’Book title’)
parser.add_argument(‘–author’, help=’Book author’)
parser.add_argument(‘–series’, help=’Series name’)
parser.add_argument(‘–volume’, help=’Volume number’)
parser.add_argument(‘–release-date’, help=’Release date (YYYY-MM-DD)’)
parser.add_argument(‘–edition’, help=’Edition number’)
parser.add_argument(‘–version’, help=’Version number’)
parser.add_argument(‘–language’, help=’Book language (default: en)’)
parser.add_argument(‘–publisher’, help=’Publisher name’)
parser.add_argument(‘–description’, help=’Book description’)
parser.add_argument(‘–rights’, help=’Copyright/license information’)
parser.add_argument(‘–subject’, help=’Book subject/category’)
parser.add_argument(‘–isbn’, help=’ISBN number’)
parser.add_argument(‘–debug’, action=’store_true’, help=’Enable debug logging’)

args = parser.parse_args()

if args.debug:
logger.setLevel(logging.DEBUG)
logger.info("Debug logging enabled")

if not os.path.exists(args.image_folder):
logger.error(f"Image folder does not exist: {args.image_folder}")
sys.exit(1)

if not os.path.isdir(args.image_folder):
logger.error(f"Specified path is not a directory: {args.image_folder}")
sys.exit(1)

metadata = {
‘title’: args.title,
‘author’: args.author,
‘series’: args.series,
‘volume’: args.volume,
‘release_date’: args.release_date,
‘edition’: args.edition,
‘version’: args.version,
‘language’: args.language,
‘publisher’: args.publisher,
‘description’: args.description,
‘rights’: args.rights,
‘subject’: args.subject,
‘isbn’: args.isbn
}

# Remove None values from metadata
metadata = {k: v for k, v in metadata.items() if v is not None}

# Generate output filename if not provided
if not args.output_file:
args.output_file = generate_default_filename(metadata, args.image_folder)
logger.info(f"Using default output filename: {args.output_file}")

try:
create_epub_from_images(args.image_folder, args.output_file, metadata)
logger.info("EPUB creation completed successfully")
except Exception as e:
logger.error(f"EPUB creation failed: {str(e)}")
sys.exit(1)

if __name__ == ‘__main__’:
main()

[/python]

Posted in en-US | Tags: epub, Python | No Comments »

Understanding Chi-Square Tests: A Comprehensive Guide for Developers

Author: Jonathan Lalou

In the world of software development and data analysis, understanding statistical significance is crucial. Whether you’re running A/B tests, analyzing user behavior, or building machine learning models, the Chi-Square (χ²) test is an essential tool in your statistical toolkit. This comprehensive guide will help you understand its principles, implementation, and practical applications.

What is Chi-Square?

The Chi-Square test is a statistical method used to determine if there’s a significant difference between expected and observed frequencies in categorical data. It’s named after the Greek letter χ (chi) and is particularly useful for analyzing relationships between categorical variables.

Historical Context

The Chi-Square test was developed by Karl Pearson in 1900, making it one of the oldest statistical tests still in widespread use today. Its development marked a significant advancement in statistical analysis, particularly in the field of categorical data analysis.

Core Principles and Mathematical Foundation

Null Hypothesis (H₀): Assumes no significant difference between observed and expected data
Alternative Hypothesis (H₁): Suggests a significant difference exists
Degrees of Freedom: Number of categories minus constraints
P-value: Probability of observing the results if H₀ is true

The Chi-Square Formula

The Chi-Square statistic is calculated using the formula:

χ² = Σ [(O - E)² / E]

Where: – O = Observed frequency – E = Expected frequency – Σ = Sum over all categories

Practical Implementation

1. A/B Testing Implementation (Python)

from scipy.stats import chi2_contingency
import numpy as np
import matplotlib.pyplot as plt

def perform_ab_test(control_data, treatment_data):
    """
    Perform A/B test using Chi-Square test
    
    Args:
        control_data: List of [successes, failures] for control group
        treatment_data: List of [successes, failures] for treatment group
    """
    # Create contingency table
    observed = np.array([control_data, treatment_data])
    
    # Perform Chi-Square test
    chi2, p_value, dof, expected = chi2_contingency(observed)
    
    # Calculate effect size (Cramer's V)
    n = np.sum(observed)
    min_dim = min(observed.shape) - 1
    cramers_v = np.sqrt(chi2 / (n * min_dim))
    
    return {
        'chi2': chi2,
        'p_value': p_value,
        'dof': dof,
        'expected': expected,
        'effect_size': cramers_v
    }

# Example usage
control = [100, 150]  # [clicks, no-clicks] for control
treatment = [120, 130]  # [clicks, no-clicks] for treatment

results = perform_ab_test(control, treatment)
print(f"Chi-Square: {results['chi2']:.2f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"Effect Size (Cramer's V): {results['effect_size']:.3f}")

2. Feature Selection Implementation (Java)

import org.apache.commons.math3.stat.inference.ChiSquareTest;
import java.util.Arrays;

public class FeatureSelection {
    private final ChiSquareTest chiSquareTest;
    
    public FeatureSelection() {
        this.chiSquareTest = new ChiSquareTest();
    }
    
    public FeatureSelectionResult analyzeFeature(
            long[][] observed,
            double significanceLevel) {
        
        double pValue = chiSquareTest.chiSquareTest(observed);
        boolean isSignificant = pValue < significanceLevel;
        
        // Calculate effect size (Cramer's V)
        double chiSquare = chiSquareTest.chiSquare(observed);
        long total = Arrays.stream(observed)
                .flatMapToLong(Arrays::stream)
                .sum();
        int minDim = Math.min(observed.length, observed[0].length) - 1;
        double cramersV = Math.sqrt(chiSquare / (total * minDim));
        
        return new FeatureSelectionResult(
            pValue,
            isSignificant,
            cramersV
        );
    }
    
    public static class FeatureSelectionResult {
        private final double pValue;
        private final boolean isSignificant;
        private final double effectSize;
        
        // Constructor and getters
    }
}

Advanced Applications

1. Machine Learning Feature Selection

Chi-Square tests are particularly useful in feature selection for machine learning models. Here’s how to implement it in Python using scikit-learn:

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris
import pandas as pd

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Select top 2 features using Chi-Square
selector = SelectKBest(chi2, k=2)
X_new = selector.fit_transform(X, y)

# Get selected features
selected_features = X.columns[selector.get_support()]
print(f"Selected features: {selected_features.tolist()}")

2. Goodness-of-Fit Testing

Testing if your data follows a particular distribution:

from scipy.stats import chisquare
import numpy as np

# Example: Testing if dice is fair
observed = np.array([18, 16, 15, 17, 16, 18])  # Observed frequencies
expected = np.array([16.67, 16.67, 16.67, 16.67, 16.67, 16.67])  # Expected for fair dice

chi2, p_value = chisquare(observed, expected)
print(f"Chi-Square: {chi2:.2f}")
print(f"P-value: {p_value:.4f}")

Best Practices and Considerations

Sample Size: Ensure sufficient sample size for reliable results
Expected Frequencies: Each expected frequency should be ≥ 5
Multiple Testing: Apply corrections (e.g., Bonferroni) when conducting multiple tests
Effect Size: Consider effect size in addition to p-values
Assumptions: Verify test assumptions before application

Common Pitfalls to Avoid

Using Chi-Square for continuous data
Ignoring small expected frequencies
Overlooking multiple testing issues
Focusing solely on p-values without considering effect size
Applying the test without checking assumptions

Resources and Further Reading

Understanding and properly implementing Chi-Square tests can significantly enhance your data analysis capabilities as a developer. Whether you’re working on A/B testing, feature selection, or data validation, this statistical tool provides valuable insights into your data’s relationships and distributions.

Remember to always consider the context of your analysis, verify assumptions, and interpret results carefully. Happy coding!

Posted in en-US | Tags: Java, Python, Statistics | No Comments »

RSS to EPUB Converter: Create eBooks from RSS Feeds

Author: Jonathan Lalou

Overview

This Python script (rss_to_ebook.py) converts RSS or Atom feeds into EPUB format eBooks, allowing you to read your favorite blog posts and news articles offline in your preferred e-reader. The script intelligently handles both RSS 2.0 and Atom feed formats, preserving HTML formatting while creating a clean, readable eBook.

Key Features

Dual Format Support: Works with both RSS 2.0 and Atom feeds
Smart Pagination: Automatically handles paginated feeds using multiple detection methods
Date Range Filtering: Select specific date ranges for content inclusion
Metadata Preservation: Maintains feed metadata including title, author, and description
HTML Formatting: Preserves original HTML formatting while cleaning unnecessary elements
Duplicate Prevention: Automatically detects and removes duplicate entries
Comprehensive Logging: Detailed progress tracking and error reporting

Technical Details

The script uses several Python libraries:

feedparser: For parsing RSS and Atom feeds
ebooklib: For creating EPUB files
BeautifulSoup: For HTML cleaning and processing
logging: For detailed operation tracking

Usage

python rss_to_ebook.py <feed_url> [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD] [--output filename.epub] [--debug]

Parameters:

feed_url: URL of the RSS or Atom feed (required)
--start-date: Start date for content inclusion (default: 1 year ago)
--end-date: End date for content inclusion (default: today)
--output: Output EPUB filename (default: rss_feed.epub)
--debug: Enable detailed logging

Example

python rss_to_ebook.py https://example.com/feed --start-date 2024-01-01 --end-date 2024-03-31 --output my_blog.epub

Requirements

Python 3.x

Required packages (install via pip):

pip install feedparser ebooklib beautifulsoup4

How It Works

Feed Detection: Automatically identifies feed format (RSS 2.0 or Atom)
Content Processing:
- Extracts entries within specified date range
- Preserves HTML formatting while cleaning unnecessary elements
- Handles pagination to get all available content
EPUB Creation:
- Creates chapters from feed entries
- Maintains original formatting and links
- Includes table of contents and navigation
- Preserves feed metadata

Error Handling

Validates feed format and content
Handles malformed HTML
Provides detailed error messages and logging
Gracefully handles missing or incomplete feed data

Use Cases

Create eBooks from your favorite blogs
Archive important news articles
Generate reading material for offline use
Create compilations of related content

Gist: GitHub

Here is the script:

[python]
#!/usr/bin/env python3

import feedparser
import argparse
from datetime import datetime, timedelta
from ebooklib import epub
import re
from bs4 import BeautifulSoup
import logging

# Configure logging
logging.basicConfig(
level=logging.INFO,
format=’%(asctime)s – %(levelname)s – %(message)s’,
datefmt=’%Y-%m-%d %H:%M:%S’
)

def clean_html(html_content):
"""Clean HTML content while preserving formatting."""
soup = BeautifulSoup(html_content, ‘html.parser’)

# Remove script and style elements
for script in soup(["script", "style"]):
script.decompose()

# Remove any inline styles
for tag in soup.find_all(True):
if ‘style’ in tag.attrs:
del tag.attrs[‘style’]

# Return the cleaned HTML
return str(soup)

def get_next_feed_page(current_feed, feed_url):
"""Get the next page of the feed using various pagination methods."""
# Method 1: next_page link in feed
if hasattr(current_feed, ‘next_page’):
logging.info(f"Found next_page link: {current_feed.next_page}")
return current_feed.next_page

# Method 2: Atom-style pagination
if hasattr(current_feed.feed, ‘links’):
for link in current_feed.feed.links:
if link.get(‘rel’) == ‘next’:
logging.info(f"Found Atom-style next link: {link.href}")
return link.href

# Method 3: RSS 2.0 pagination (using lastBuildDate)
if hasattr(current_feed.feed, ‘lastBuildDate’):
last_date = current_feed.feed.lastBuildDate
if hasattr(current_feed.entries, ‘last’):
last_entry = current_feed.entries[-1]
if hasattr(last_entry, ‘published_parsed’):
last_entry_date = datetime(*last_entry.published_parsed[:6])
# Try to construct next page URL with date parameter
if ‘?’ in feed_url:
next_url = f"{feed_url}&before={last_entry_date.strftime(‘%Y-%m-%d’)}"
else:
next_url = f"{feed_url}?before={last_entry_date.strftime(‘%Y-%m-%d’)}"
logging.info(f"Constructed date-based next URL: {next_url}")
return next_url

# Method 4: Check for pagination in feed description
if hasattr(current_feed.feed, ‘description’):
desc = current_feed.feed.description
# Look for common pagination patterns in description
next_page_patterns = [
r’next page: (https?://\S+)’,
r’older posts: (https?://\S+)’,
r’page \d+: (https?://\S+)’
]
for pattern in next_page_patterns:
match = re.search(pattern, desc, re.IGNORECASE)
if match:
next_url = match.group(1)
logging.info(f"Found next page URL in description: {next_url}")
return next_url

return None

def get_feed_type(feed):
"""Determine if the feed is RSS 2.0 or Atom format."""
if hasattr(feed, ‘version’) and feed.version.startswith(‘rss’):
return ‘rss’
elif hasattr(feed, ‘version’) and feed.version == ‘atom10’:
return ‘atom’
# Try to detect by checking for Atom-specific elements
elif hasattr(feed.feed, ‘links’) and any(link.get(‘rel’) == ‘self’ for link in feed.feed.links):
return ‘atom’
# Default to RSS if no clear indicators
return ‘rss’

def get_entry_content(entry, feed_type):
"""Get the content of an entry based on feed type."""
if feed_type == ‘atom’:
# Atom format
if hasattr(entry, ‘content’):
return entry.content[0].value if entry.content else ”
elif hasattr(entry, ‘summary’):
return entry.summary
else:
# RSS 2.0 format
if hasattr(entry, ‘content’):
return entry.content[0].value if entry.content else ”
elif hasattr(entry, ‘description’):
return entry.description
return ”

def get_entry_date(entry, feed_type):
"""Get the publication date of an entry based on feed type."""
if feed_type == ‘atom’:
# Atom format uses updated or published
if hasattr(entry, ‘published_parsed’):
return datetime(*entry.published_parsed[:6])
elif hasattr(entry, ‘updated_parsed’):
return datetime(*entry.updated_parsed[:6])
else:
# RSS 2.0 format uses pubDate
if hasattr(entry, ‘published_parsed’):
return datetime(*entry.published_parsed[:6])
return datetime.now()

def get_feed_metadata(feed, feed_type):
"""Extract metadata from feed based on its type."""
metadata = {
‘title’: ”,
‘description’: ”,
‘language’: ‘en’,
‘author’: ‘Unknown’,
‘publisher’: ”,
‘rights’: ”,
‘updated’: ”
}

if feed_type == ‘atom’:
# Atom format metadata
metadata[‘title’] = feed.feed.get(‘title’, ”)
metadata[‘description’] = feed.feed.get(‘subtitle’, ”)
metadata[‘language’] = feed.feed.get(‘language’, ‘en’)
metadata[‘author’] = feed.feed.get(‘author’, ‘Unknown’)
metadata[‘rights’] = feed.feed.get(‘rights’, ”)
metadata[‘updated’] = feed.feed.get(‘updated’, ”)
else:
# RSS 2.0 format metadata
metadata[‘title’] = feed.feed.get(‘title’, ”)
metadata[‘description’] = feed.feed.get(‘description’, ”)
metadata[‘language’] = feed.feed.get(‘language’, ‘en’)
metadata[‘author’] = feed.feed.get(‘author’, ‘Unknown’)
metadata[‘copyright’] = feed.feed.get(‘copyright’, ”)
metadata[‘lastBuildDate’] = feed.feed.get(‘lastBuildDate’, ”)

return metadata

def create_ebook(feed_url, start_date, end_date, output_file):
"""Create an ebook from RSS feed entries within the specified date range."""
logging.info(f"Starting ebook creation from feed: {feed_url}")
logging.info(f"Date range: {start_date.strftime(‘%Y-%m-%d’)} to {end_date.strftime(‘%Y-%m-%d’)}")

# Parse the RSS feed
feed = feedparser.parse(feed_url)

if feed.bozo:
logging.error(f"Error parsing feed: {feed.bozo_exception}")
return False

# Determine feed type
feed_type = get_feed_type(feed)
logging.info(f"Detected feed type: {feed_type}")

logging.info(f"Successfully parsed feed: {feed.feed.get(‘title’, ‘Unknown Feed’)}")

# Create a new EPUB book
book = epub.EpubBook()

# Extract metadata based on feed type
metadata = get_feed_metadata(feed, feed_type)

logging.info(f"Setting metadata for ebook: {metadata[‘title’]}")

# Set basic metadata
book.set_identifier(feed_url) # Use feed URL as unique identifier
book.set_title(metadata[‘title’])
book.set_language(metadata[‘language’])
book.add_author(metadata[‘author’])

# Add additional metadata if available
if metadata[‘description’]:
book.add_metadata(‘DC’, ‘description’, metadata[‘description’])
if metadata[‘publisher’]:
book.add_metadata(‘DC’, ‘publisher’, metadata[‘publisher’])
if metadata[‘rights’]:
book.add_metadata(‘DC’, ‘rights’, metadata[‘rights’])
if metadata[‘updated’]:
book.add_metadata(‘DC’, ‘date’, metadata[‘updated’])

# Add date range to description
date_range_desc = f"Content from {start_date.strftime(‘%Y-%m-%d’)} to {end_date.strftime(‘%Y-%m-%d’)}"
book.add_metadata(‘DC’, ‘description’, f"{metadata[‘description’]}\n\n{date_range_desc}")

# Create table of contents
chapters = []
toc = []

# Process entries within date range
entries_processed = 0
entries_in_range = 0
consecutive_out_of_range = 0
current_page = 1
processed_urls = set() # Track processed URLs to avoid duplicates

logging.info("Starting to process feed entries…")

while True:
logging.info(f"Processing page {current_page} with {len(feed.entries)} entries")

# Process current batch of entries
for entry in feed.entries[entries_processed:]:
entries_processed += 1

# Skip if we’ve already processed this entry
entry_id = entry.get(‘id’, entry.get(‘link’, ”))
if entry_id in processed_urls:
logging.debug(f"Skipping duplicate entry: {entry_id}")
continue
processed_urls.add(entry_id)

# Get entry date based on feed type
entry_date = get_entry_date(entry, feed_type)

if entry_date < start_date:
consecutive_out_of_range += 1
logging.debug(f"Skipping entry from {entry_date.strftime(‘%Y-%m-%d’)} (before start date)")
continue
elif entry_date > end_date:
consecutive_out_of_range += 1
logging.debug(f"Skipping entry from {entry_date.strftime(‘%Y-%m-%d’)} (after end date)")
continue
else:
consecutive_out_of_range = 0
entries_in_range += 1

# Create chapter
title = entry.get(‘title’, ‘Untitled’)
logging.info(f"Adding chapter: {title} ({entry_date.strftime(‘%Y-%m-%d’)})")

# Get content based on feed type
content = get_entry_content(entry, feed_type)

# Clean the content
cleaned_content = clean_html(content)

# Create chapter
chapter = epub.EpubHtml(
title=title,
file_name=f’chapter_{len(chapters)}.xhtml’,
content=f'<h1>{title}</h1>{cleaned_content}’
)

# Add chapter to book
book.add_item(chapter)
chapters.append(chapter)
toc.append(epub.Link(chapter.file_name, title, chapter.id))

# If we have no entries in range or we’ve seen too many consecutive out-of-range entries, stop
if entries_in_range == 0 or consecutive_out_of_range >= 10:
if entries_in_range == 0:
logging.warning("No entries found within the specified date range")
else:
logging.info(f"Stopping after {consecutive_out_of_range} consecutive out-of-range entries")
break

# Try to get more entries if available
next_page_url = get_next_feed_page(feed, feed_url)
if next_page_url:
current_page += 1
logging.info(f"Fetching next page: {next_page_url}")
feed = feedparser.parse(next_page_url)
if not feed.entries:
logging.info("No more entries available")
break
else:
logging.info("No more pages available")
break

if entries_in_range == 0:
logging.error("No entries found within the specified date range")
return False

logging.info(f"Processed {entries_processed} total entries, {entries_in_range} within date range")

# Add table of contents
book.toc = toc

# Add navigation files
book.add_item(epub.EpubNcx())
book.add_item(epub.EpubNav())

# Define CSS style
style = ”’
@namespace epub "http://www.idpf.org/2007/ops";
body {
font-family: Cambria, Liberation Serif, serif;
}
h1 {
text-align: left;
text-transform: uppercase;
font-weight: 200;
}
”’

# Add CSS file
nav_css = epub.EpubItem(
uid="style_nav",
file_name="style/nav.css",
media_type="text/css",
content=style
)
book.add_item(nav_css)

# Create spine
book.spine = [‘nav’] + chapters

# Write the EPUB file
logging.info(f"Writing EPUB file: {output_file}")
epub.write_epub(output_file, book, {})
logging.info("EPUB file created successfully")
return True

def main():
parser = argparse.ArgumentParser(description=’Convert RSS feed to EPUB ebook’)
parser.add_argument(‘feed_url’, help=’URL of the RSS feed’)
parser.add_argument(‘–start-date’, help=’Start date (YYYY-MM-DD)’,
default=(datetime.now() – timedelta(days=365)).strftime(‘%Y-%m-%d’))
parser.add_argument(‘–end-date’, help=’End date (YYYY-MM-DD)’,
default=datetime.now().strftime(‘%Y-%m-%d’))
parser.add_argument(‘–output’, help=’Output EPUB file name’,
default=’rss_feed.epub’)
parser.add_argument(‘–debug’, action=’store_true’, help=’Enable debug logging’)

args = parser.parse_args()

if args.debug:
logging.getLogger().setLevel(logging.DEBUG)

# Parse dates
start_date = datetime.strptime(args.start_date, ‘%Y-%m-%d’)
end_date = datetime.strptime(args.end_date, ‘%Y-%m-%d’)

# Create ebook
if create_ebook(args.feed_url, start_date, end_date, args.output):
logging.info(f"Successfully created ebook: {args.output}")
else:
logging.error("Failed to create ebook")

if __name__ == ‘__main__’:
main()

[/python]

Posted in en-US | Tags: Atom, epub, Python, RSS | No Comments »

Quick and dirty script to convert WordPress export file to Blogger / Atom XML

Author: Jonathan Lalou

I’ve created a Python script that converts WordPress export files to Blogger/Atom XML format. Here’s how to use it:

The script takes two command-line arguments:

wordpress_export.xml: Path to your WordPress export XML file
blogger_export.xml : Path where you want to save the converted Blogger/Atom XML file

To run the script:

python wordpress_to_blogger.py wordpress_export.xml blogger_export.xml

The script performs the following conversions:

Converts WordPress posts to Atom feed entries
Preserves post titles, content, publication dates, and authors
Maintains categories as Atom categories
Handles post status (published/draft)
Preserves HTML content formatting
Converts dates to ISO format required by Atom

The script uses Python’s built-in xml.etree.ElementTree module for XML processing and includes error handling to make it robust.
Some important notes:

The script only converts posts (not pages or other content types)
It preserves the HTML content of your posts
It maintains the original publication dates
It handles both published and draft posts
The output is a valid Atom XML feed that Blogger can import

The file:

[python]#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import sys
import argparse
from datetime import datetime
import re

def convert_wordpress_to_blogger(wordpress_file, output_file):
# Parse WordPress XML
tree = ET.parse(wordpress_file)
root = tree.getroot()

# Create Atom feed
atom = ET.Element(‘feed’, {
‘xmlns’: ‘http://www.w3.org/2005/Atom’,
‘xmlns:app’: ‘http://www.w3.org/2007/app’,
‘xmlns:thr’: ‘http://purl.org/syndication/thread/1.0’
})

# Add feed metadata
title = ET.SubElement(atom, ‘title’)
title.text = ‘Blog Posts’

updated = ET.SubElement(atom, ‘updated’)
updated.text = datetime.now().isoformat()

# Process each post
for item in root.findall(‘.//item’):
if item.find(‘wp:post_type’, {‘wp’: ‘http://wordpress.org/export/1.2/’}).text != ‘post’:
continue

entry = ET.SubElement(atom, ‘entry’)

# Title
title = ET.SubElement(entry, ‘title’)
title.text = item.find(‘title’).text

# Content
content = ET.SubElement(entry, ‘content’, {‘type’: ‘html’})
content.text = item.find(‘content:encoded’, {‘content’: ‘http://purl.org/rss/1.0/modules/content/’}).text

# Publication date
pub_date = item.find(‘pubDate’).text
published = ET.SubElement(entry, ‘published’)
published.text = datetime.strptime(pub_date, ‘%a, %d %b %Y %H:%M:%S %z’).isoformat()

# Author
author = ET.SubElement(entry, ‘author’)
name = ET.SubElement(author, ‘name’)
name.text = item.find(‘dc:creator’, {‘dc’: ‘http://purl.org/dc/elements/1.1/’}).text

# Categories
for category in item.findall(‘category’):
category_elem = ET.SubElement(entry, ‘category’, {‘term’: category.text})

# Status
status = item.find(‘wp:status’, {‘wp’: ‘http://wordpress.org/export/1.2/’}).text
if status == ‘publish’:
app_control = ET.SubElement(entry, ‘app:control’, {‘xmlns:app’: ‘http://www.w3.org/2007/app’})
app_draft = ET.SubElement(app_control, ‘app:draft’)
app_draft.text = ‘no’
else:
app_control = ET.SubElement(entry, ‘app:control’, {‘xmlns:app’: ‘http://www.w3.org/2007/app’})
app_draft = ET.SubElement(app_control, ‘app:draft’)
app_draft.text = ‘yes’

# Write the output file
tree = ET.ElementTree(atom)
tree.write(output_file, encoding=’utf-8′, xml_declaration=True)

def main():
parser = argparse.ArgumentParser(description=’Convert WordPress export to Blogger/Atom XML format’)
parser.add_argument(‘wordpress_file’, help=’Path to WordPress export XML file’)
parser.add_argument(‘output_file’, help=’Path to output Blogger/Atom XML file’)

args = parser.parse_args()

try:
convert_wordpress_to_blogger(args.wordpress_file, args.output_file)
print(f"Successfully converted {args.wordpress_file} to {args.output_file}")
except Exception as e:
print(f"Error: {str(e)}")
sys.exit(1)

if __name__ == ‘__main__’:
main()[/python]

Posted in en-US | Tags: Atom, blogger, epub, Python, Wordpress | No Comments »

Advanced Encoding in Java, Kotlin, Node.js, and Python

Author: Jonathan Lalou

Encoding is essential for handling text, binary data, and secure transmission across applications. Understanding advanced encoding techniques can help prevent data corruption and ensure smooth interoperability across systems. This post explores key encoding challenges and how Java/Kotlin, Node.js, and Python tackle them.

1️⃣ Handling Special Unicode Characters (Emoji, Accents, RTL Text)

Java/Kotlin

Java uses UTF-16 internally, but for external data (JSON, databases, APIs), explicit encoding is required:

String text = "🔧 Café مرحبا";
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(decoded); // 🔧 Café مرحبا

✅ Tip: Always specify StandardCharsets.UTF_8 to avoid platform-dependent defaults.

Node.js

const text = "🔧 Café مرحبا";
const utf8Buffer = Buffer.from(text, 'utf8');
const decoded = utf8Buffer.toString('utf8');
console.log(decoded); // 🔧 Café مرحبا

✅ Tip: Using an incorrect encoding (e.g., latin1) may corrupt characters.

Python

text = "🔧 Café مرحبا"
utf8_bytes = text.encode("utf-8")
decoded = utf8_bytes.decode("utf-8")
print(decoded)  # 🔧 Café مرحبا

✅ Tip: Python 3 handles Unicode by default, but explicit encoding is always recommended.

2️⃣ Encoding Binary Data for Transmission (Base64, Hex, Binary Files)

Java/Kotlin

byte[] data = "Hello World".getBytes(StandardCharsets.UTF_8);
String base64Encoded = Base64.getEncoder().encodeToString(data);
byte[] decoded = Base64.getDecoder().decode(base64Encoded);
System.out.println(new String(decoded, StandardCharsets.UTF_8)); // Hello World

Node.js

const data = Buffer.from("Hello World", 'utf8');
const base64Encoded = data.toString('base64');
const decoded = Buffer.from(base64Encoded, 'base64').toString('utf8');
console.log(decoded); // Hello World

Python

import base64
data = "Hello World".encode("utf-8")
base64_encoded = base64.b64encode(data).decode("utf-8")
decoded = base64.b64decode(base64_encoded).decode("utf-8")
print(decoded)  # Hello World

✅ Tip: Base64 encoding increases data size (~33% overhead), which can be a concern for large files.

3️⃣ Charset Mismatches and Cross-Language Encoding Issues

A file encoded in ISO-8859-1 (Latin-1) may cause garbled text when read using UTF-8.

Java/Kotlin Solution:

byte[] bytes = Files.readAllBytes(Paths.get("file.txt"));
String text = new String(bytes, StandardCharsets.ISO_8859_1);

Node.js Solution:

const fs = require('fs');
const text = fs.readFileSync("file.txt", { encoding: "latin1" });

Python Solution:

with open("file.txt", "r", encoding="ISO-8859-1") as f:
    text = f.read()

✅ Tip: Always specify encoding explicitly when working with external files.

4️⃣ URL Encoding and Decoding

Java/Kotlin

String encoded = URLEncoder.encode("Hello World!", StandardCharsets.UTF_8);
String decoded = URLDecoder.decode(encoded, StandardCharsets.UTF_8);

Node.js

const encoded = encodeURIComponent("Hello World!");
const decoded = decodeURIComponent(encoded);

Python

from urllib.parse import quote, unquote
encoded = quote("Hello World!")
decoded = unquote(encoded)

✅ Tip: Use UTF-8 for URL encoding to prevent inconsistencies across different platforms.

Conclusion: Choosing the Right Approach

Java/Kotlin: Strong type safety, but requires careful Charset management.
Node.js: Web-friendly but depends heavily on Buffer conversions.
Python: Simple and concise, though strict type conversions must be managed.

📌 Pro Tip: Always be explicit about encoding when handling external data (APIs, files, databases) to avoid corruption.

Posted in en-US | Tags: encoding, Java, Kotlin, NodeJS, Python | No Comments »

Java’s Emerging Role in AI and Machine Learning: Bridging the Gap to Production

Author: Jonathan Lalou

While Python dominates in model training, Java is becoming increasingly vital for deploying and serving AI/ML models in production. Its performance, stability, and enterprise integration capabilities make it a strong contender.

Java Example: Real-time Object Detection with DL4J and OpenCV

[java]
import …

public class ObjectDetection {

public static void main(String[] args) {
String modelPath = "yolov3.weights";
String configPath = "yolov3.cfg";
String imagePath = "image.jpg";
Net net = Dnn.readNet(modelPath, configPath);
Mat image = imread(imagePath);
Mat blob = Dnn.blobFromImage(image, 1 / 255.0, new Size(416, 416), new Scalar(0, 0, 0), true, false);

net.setInput(blob);

MatVector detections = net.forward(); // Inference

// Process detections (bounding boxes, classes, confidence)
// … (complex logic for object detection results)
// Draw bounding boxes on the image
// … (OpenCV drawing functions)
imwrite("detected_objects.jpg", image);
}
}

[/java]

Python Example: Similar Object Detection with OpenCV and YOLO

[python]

import numpy as np

net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg")
image = cv2.imread("image.jpg")
blob = cv2.dnn.blobFromImage(image, 1/255.0, (416, 416), swapRB=True, crop=False)
net.setInput(blob)
detections = net.forward()

# Process detections (bounding boxes, classes, confidence)
# … (simpler logic, NumPy arrays)
# Draw bounding boxes on the image
# … (OpenCV drawing functions)
cv2.imwrite("detected_objects.jpg", image)
[/python]

Comparison and Insights:

Syntax and Readability: Python’s syntax is generally more concise and readable for data science and AI tasks. Java, while more verbose, offers strong typing and better performance for production deployments.
Library Ecosystem: Python’s ecosystem (NumPy, OpenCV, TensorFlow, PyTorch) is more mature and developer-friendly for AI/ML development. Java, with libraries like DL4J, is catching up, but its strength lies in enterprise integration and performance.
Performance: Java’s performance is often superior to Python’s, especially for real-time inference and high-throughput applications.
Enterprise Integration: Java’s ability to seamlessly integrate with existing enterprise systems (databases, message queues, APIs) is a significant advantage.
Deployment: Java’s deployment capabilities are more robust, making it suitable for mission-critical AI applications.

Key Takeaways:

Python is excellent for rapid prototyping and model training.
Java excels in deploying and serving AI/ML models in production environments, where performance and reliability are paramount.
The choice between Java and Python depends on the specific use case and requirements.

Posted in en-US, General | Tags: Java, Python | No Comments »

[PyData Global 2024] Making Gaussian Processes Useful

Author: Jonathan Lalou

Bill Engels and Chris Fonnesbeck, both brilliant software developers from PyMC Labs, delivered an insightful 90-minute tutorial at PyData Global 2024 titled “Making Gaussian Processes Useful.” Aimed at demystifying Gaussian processes (GPs) for practicing data scientists, their session bridged the gap between theoretical complexity and practical application. Using baseball analytics as a motivating example, Chris introduced Bayesian modeling and GPs, while Bill provided hands-on strategies for overcoming computational and identifiability challenges. This post explores their comprehensive approach, offering actionable insights for leveraging GPs in real-world scenarios.

Bayesian Inference and Probabilistic Programming

Chris kicked off the tutorial by grounding the audience in Bayesian inference, often implemented through probabilistic programming. He described it as writing software with partially random outputs, enabled by languages like PyMC that provide primitives for random variables. Unlike deterministic programming, probabilistic programming allows modeling distributions over variables, including functions via GPs. Chris explained that Bayesian inference involves specifying a joint probability model for data and parameters, using Bayes’ formula to derive the posterior distribution. This posterior reflects what we learn about unknown parameters after observing data, with the likelihood and priors as key components. The computational challenge lies in the normalizing constant, a multidimensional integral that probabilistic programming libraries handle numerically, freeing data scientists to focus on model specification.

Hierarchical Modeling with Baseball Data

To illustrate Bayesian modeling, Chris used the example of estimating home run probabilities for baseball players. He introduced a simple unpooled model where each player’s home run rate is modeled with a beta prior and a binomial likelihood, reflecting hits over plate appearances. Using PyMC, this model is straightforward to implement, with each line of code corresponding to a mathematical component. However, Chris highlighted its limitations: players with few at-bats yield highly uncertain estimates, leaning heavily on the flat prior. This led to the introduction of hierarchical modeling, or partial pooling, where individual home run rates are drawn from a population distribution with hyperparameters (mean and standard deviation). This approach shrinks extreme estimates, producing more realistic rates, as seen when comparing unpooled estimates (with outliers up to 80%) to pooled ones (clustered below 10%, aligning with real-world data like Barry Bonds’ 15% peak).

Gaussian Processes as a Hierarchical Extension

Chris transitioned to GPs, framing them as a generalization of hierarchical models for continuous predictors, such as player age affecting home run rates. Unlike categorical groups, GPs model relationships where similarity decreases with distance (e.g., younger players’ performance is more similar). A GP is a distribution over functions, parameterized by a mean function (often zero) and a covariance function, which defines how outputs covary based on input proximity. Chris emphasized two key properties of multivariate Gaussians—easy marginalization and conditioning—that make GPs computationally tractable despite their infinite dimensionality. By evaluating a covariance function at specific inputs, a GP yields a finite multivariate normal, enabling flexible, nonlinear modeling without explicitly parameterizing the function’s form.

Computational Challenges and the HSGP Approximation

One of the biggest hurdles with GPs is their computational cost, particularly for latent GPs used with non-Gaussian data like binomial home run counts. Chris explained that the posterior covariance function requires inverting a matrix, which scales cubically with the number of data points (e.g., thousands of players). This makes exact GPs infeasible for large datasets. To address this, he introduced the Hilbert Space Gaussian Process (HSGP) approximation, which reduces cubic compute time to linear by approximating the GP with a finite set of basis functions. These functions depend on the data, while coefficients rely on hyperparameters like length scale and amplitude. Chris demonstrated implementing an HSGP in PyMC to model age effects, specifying 100 basis functions and a boundary three times the data range, resulting in a model that ran in minutes rather than years.

Practical Debugging with GPs

Bill took over to provide practical tips for fitting GPs, emphasizing their sensitivity to priors and the need for debugging. He revisited the baseball example, modeling batting averages with a hierarchical model before introducing a GP to account for age effects. Bill showed that a standard hierarchical model treats players as exchangeable, pooling information equally across all players. A GP, however, allows local pooling, where players of similar ages inform each other more strongly. He introduced the exponentiated quadratic covariance function, which uses a length scale to define “closeness” in age and a scale parameter for effect size. Bill highlighted common pitfalls, such as small length scales reducing a GP to a standard hierarchical model or large length scales causing identifiability issues with intercepts, and provided solutions like informative priors (e.g., inverse gamma, log-normal) to constrain length scales to realistic ranges.

Advanced GP Modeling for Slugging Percentage

Bill concluded with a sophisticated model for slugging percentage, a metric reflecting hitting power, using 10 years of baseball data. The model included player, park, and season effects, with an HSGP to capture age effects. He initially used an exponentiated quadratic covariance function but encountered sampling issues (divergences), a common problem with GPs. Bill fixed this by switching to a Matern 5/2 covariance function, which assumes less smoothness and better suits real-world data, and adopting a centered parameterization for stronger age effects. These changes reduced divergences to near zero, producing a reliable model. The resulting age curve peaked at 26, aligning with baseball wisdom, and showed a decline for older players, demonstrating the GP’s ability to capture nonlinear trends.

Key Takeaways and Resources

Bill and Chris emphasized that GPs extend hierarchical models by enabling local pooling over continuous variables, but their computational and identifiability challenges require careful handling. Informative priors, appropriate covariance functions (e.g., Matern over exponential quadratic), and approximations like HSGP are critical for practical use. They encouraged using PyMC for its high-level interface and the Nutpie sampler for efficiency, while noting alternatives like GPFlow for specialized needs. Their GitHub repository, linked below, includes slides and notebooks for further exploration, making this tutorial a valuable resource for data scientists aiming to apply GPs effectively.

Links:

Posted in en-US | Tags: PyData, Python | No Comments »

Predictive Modeling and the Illusion of Signal

Author: Jonathan Lalou

Introduction

Vincent Warmerdam delves into the illusions often encountered in predictive modeling, highlighting the cognitive traps and statistical misconceptions that lead to overconfidence in model performance.

The Seduction of Spurious Correlations

Models often perform well on training data by exploiting noise rather than genuine signal. Vincent emphasizes critical thinking and statistical rigor to avoid being misled by deceptively strong results.

Building Robust Models

Using robust cross-validation, considering domain knowledge, and testing against out-of-sample data are vital strategies to counteract the illusion of predictive prowess.

Conclusion

Data science is not just coding and modeling — it requires constant skepticism, critical evaluation, and humility. Vincent reminds us to stay vigilant against the comforting but dangerous mirage of false predictability.

Posted in en-US | Tags: PyData2024, Python | No Comments »

Building Intelligent Data Products at Scale

Author: Jonathan Lalou

Introduction

Thomas Vachon shares insights into scaling data-driven products, blending machine learning, engineering, and user-centric design to create impactful and intelligent applications.

Key Ingredients for Success

Building intelligent products requires aligning data pipelines, model training, deployment infrastructure, and feedback loops. Vachon stresses the importance of cross-functional collaboration between data scientists, software engineers, and product teams.

Real-World Lessons

From architectural best practices to team organization strategies, Vachon illustrates how to navigate the complexity of scaling data initiatives sustainably.

Conclusion

Intelligent data products demand not only technical excellence but also thoughtful design, scalability planning, and user empathy from day one.

Posted in en-US | Tags: PyData2024, Python | No Comments »