Jonathan Lalou's Blog

Posts Tagged ‘epub’

Creating EPUBs from Images: A Developer’s Guide to Digital Publishing

Ever needed to convert a collection of images into a professional EPUB file? Whether you’re working with comics, manga, or any image-based content, I’ve developed a Python script that makes this process seamless and customizable.

What is create_epub.py?

This Python script transforms a folder of images into a fully-featured EPUB file, complete with:

Proper EPUB 3.0 structure
Customizable metadata
Table of contents
Responsive image display
Cover image handling

Key Features

Smart Filename Generation: Automatically generates EPUB filenames based on metadata (e.g., “MyBook_01_1.epub”)
Comprehensive Metadata Support: Title, author, series, volume, edition, ISBN, and more
Image Optimization: Supports JPEG, PNG, and GIF formats with proper scaling
Responsive Design: CSS-based layout that works across devices
Detailed Logging: Progress tracking and debugging capabilities

Usage Example

python create_epub.py image_folder \
    --title "My Book" \
    --author "Author Name" \
    --volume 1 \
    --edition "First Edition" \
    --series "My Series" \
    --publisher "My Publisher" \
    --isbn "978-3-16-148410-0"

Technical Details

The script creates a proper EPUB 3.0 structure with:

META-INF/container.xml
OEBPS/content.opf (metadata)
OEBPS/toc.ncx (table of contents)
OEBPS/nav.xhtml (navigation)
OEBPS/style.css (responsive styling)
OEBPS/images/ (image storage)

Best Practices Implemented

Proper XML namespaces and validation
Responsive image handling
Comprehensive metadata support
Clean, maintainable code structure
Extensive error handling and logging

Getting Started

# Install dependencies
pip install -r requirements.txt

# Basic usage
python create_epub.py /path/to/images --title "My Book"

# With debug logging
python create_epub.py /path/to/images --title "My Book" --debug

The script is designed to be both powerful and user-friendly, making it accessible to developers while providing the flexibility needed for professional publishing workflows.

Whether you’re a developer looking to automate EPUB creation or a content creator seeking to streamline your publishing process, this tool provides a robust solution for converting images into EPUB files.

The script on GitHub or below: 👇👇👇
[python]
import os
import sys
import logging
import zipfile
import uuid
from datetime import datetime
import argparse
from PIL import Image
import xml.etree.ElementTree
from xml.dom import minidom

# @author Jonathan Lalou / https://github.com/JonathanLalou/

# Configure logging
logging.basicConfig(
level=logging.INFO,
format=’%(asctime)s – %(levelname)s – %(message)s’,
handlers=[
logging.StreamHandler(sys.stdout)
]
)
logger = logging.getLogger(__name__)

# Define the CSS content
CSS_CONTENT = ”’
body {
margin: 0;
padding: 0;
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
}
img {
max-width: 100%;
max-height: 100vh;
object-fit: contain;
}
”’

def create_container_xml():
"""Create the container.xml file."""
logger.debug("Creating container.xml")
container = xml.etree.ElementTree.Element(‘container’, {
‘version’: ‘1.0’,
‘xmlns’: ‘urn:oasis:names:tc:opendocument:xmlns:container’
})
rootfiles = xml.etree.ElementTree.SubElement(container, ‘rootfiles’)
xml.etree.ElementTree.SubElement(rootfiles, ‘rootfile’, {
‘full-path’: ‘OEBPS/content.opf’,
‘media-type’: ‘application/oebps-package+xml’
})
xml_content = prettify_xml(container)
logger.debug("container.xml content:\n" + xml_content)
return xml_content

def create_content_opf(metadata, spine_items, manifest_items):
"""Create the content.opf file."""
logger.debug("Creating content.opf")
logger.debug(f"Metadata: {metadata}")
logger.debug(f"Spine items: {spine_items}")
logger.debug(f"Manifest items: {manifest_items}")

package = xml.etree.ElementTree.Element(‘package’, {
‘xmlns’: ‘http://www.idpf.org/2007/opf’,
‘xmlns:dc’: ‘http://purl.org/dc/elements/1.1/’,
‘xmlns:dcterms’: ‘http://purl.org/dc/terms/’,
‘xmlns:opf’: ‘http://www.idpf.org/2007/opf’,
‘version’: ‘3.0’,
‘unique-identifier’: ‘bookid’
})

# Metadata
metadata_elem = xml.etree.ElementTree.SubElement(package, ‘metadata’)

# Required metadata
book_id = str(uuid.uuid4())
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:identifier’, {‘id’: ‘bookid’}).text = book_id
logger.debug(f"Generated book ID: {book_id}")

xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:title’).text = metadata.get(‘title’, ‘Untitled’)
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:language’).text = metadata.get(‘language’, ‘en’)
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:creator’).text = metadata.get(‘author’, ‘Unknown’)

# Add required dcterms:modified
current_time = datetime.now().strftime(‘%Y-%m-%dT%H:%M:%SZ’)
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘dcterms:modified’
}).text = current_time

# Add cover metadata
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘name’: ‘cover’,
‘content’: ‘cover-image’
})

# Add additional metadata
if metadata.get(‘publisher’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:publisher’).text = metadata[‘publisher’]

if metadata.get(‘description’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:description’).text = metadata[‘description’]

if metadata.get(‘rights’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:rights’).text = metadata[‘rights’]

if metadata.get(‘subject’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:subject’).text = metadata[‘subject’]

if metadata.get(‘isbn’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:identifier’, {
‘opf:scheme’: ‘ISBN’
}).text = metadata[‘isbn’]

# Series metadata
if metadata.get(‘series’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘belongs-to-collection’
}).text = metadata[‘series’]
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘group-position’
}).text = metadata.get(‘volume’, ‘1’)

# Release date
if metadata.get(‘release_date’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘dc:date’).text = metadata[‘release_date’]

# Version and edition
if metadata.get(‘version’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘schema:version’
}).text = metadata[‘version’]

if metadata.get(‘edition’):
xml.etree.ElementTree.SubElement(metadata_elem, ‘meta’, {
‘property’: ‘schema:bookEdition’
}).text = metadata[‘edition’]

# Manifest
manifest = xml.etree.ElementTree.SubElement(package, ‘manifest’)
for item in manifest_items:
xml.etree.ElementTree.SubElement(manifest, ‘item’, item)

# Spine
spine = xml.etree.ElementTree.SubElement(package, ‘spine’)
for item in spine_items:
xml.etree.ElementTree.SubElement(spine, ‘itemref’, {‘idref’: item})

xml_content = prettify_xml(package)
logger.debug("content.opf content:\n" + xml_content)
return xml_content

def create_toc_ncx(metadata, nav_points):
"""Create the toc.ncx file."""
logger.debug("Creating toc.ncx")
logger.debug(f"Navigation points: {nav_points}")

ncx = xml.etree.ElementTree.Element(‘ncx’, {
‘xmlns’: ‘http://www.daisy.org/z3986/2005/ncx/’,
‘version’: ‘2005-1’
})

head = xml.etree.ElementTree.SubElement(ncx, ‘head’)
book_id = str(uuid.uuid4())
xml.etree.ElementTree.SubElement(head, ‘meta’, {‘name’: ‘dtb:uid’, ‘content’: book_id})
logger.debug(f"Generated NCX book ID: {book_id}")

xml.etree.ElementTree.SubElement(head, ‘meta’, {‘name’: ‘dtb:depth’, ‘content’: ‘1’})
xml.etree.ElementTree.SubElement(head, ‘meta’, {‘name’: ‘dtb:totalPageCount’, ‘content’: ‘0’})
xml.etree.ElementTree.SubElement(head, ‘meta’, {‘name’: ‘dtb:maxPageNumber’, ‘content’: ‘0’})

doc_title = xml.etree.ElementTree.SubElement(ncx, ‘docTitle’)
xml.etree.ElementTree.SubElement(doc_title, ‘text’).text = metadata.get(‘title’, ‘Untitled’)

nav_map = xml.etree.ElementTree.SubElement(ncx, ‘navMap’)
for i, (id, label, src) in enumerate(nav_points, 1):
nav_point = xml.etree.ElementTree.SubElement(nav_map, ‘navPoint’, {‘id’: id, ‘playOrder’: str(i)})
nav_label = xml.etree.ElementTree.SubElement(nav_point, ‘navLabel’)
xml.etree.ElementTree.SubElement(nav_label, ‘text’).text = label
xml.etree.ElementTree.SubElement(nav_point, ‘content’, {‘src’: src})

xml_content = prettify_xml(ncx)
logger.debug("toc.ncx content:\n" + xml_content)
return xml_content

def create_nav_xhtml(metadata, nav_points):
"""Create the nav.xhtml file."""
logger.debug("Creating nav.xhtml")

html = xml.etree.ElementTree.Element(‘html’, {
‘xmlns’: ‘http://www.w3.org/1999/xhtml’,
‘xmlns:epub’: ‘http://www.idpf.org/2007/ops’
})

head = xml.etree.ElementTree.SubElement(html, ‘head’)
xml.etree.ElementTree.SubElement(head, ‘title’).text = ‘Table of Contents’

body = xml.etree.ElementTree.SubElement(html, ‘body’)
nav = xml.etree.ElementTree.SubElement(body, ‘nav’, {‘epub:type’: ‘toc’})
ol = xml.etree.ElementTree.SubElement(nav, ‘ol’)

for _, label, src in nav_points:
li = xml.etree.ElementTree.SubElement(ol, ‘li’)
xml.etree.ElementTree.SubElement(li, ‘a’, {‘href’: src}).text = label

xml_content = prettify_xml(html)
logger.debug("nav.xhtml content:\n" + xml_content)
return xml_content

def create_page_xhtml(page_number, image_file):
"""Create an XHTML page for an image."""
logger.debug(f"Creating page {page_number} for image {image_file}")

html = xml.etree.ElementTree.Element(‘html’, {
‘xmlns’: ‘http://www.w3.org/1999/xhtml’,
‘xmlns:epub’: ‘http://www.idpf.org/2007/ops’
})

head = xml.etree.ElementTree.SubElement(html, ‘head’)
xml.etree.ElementTree.SubElement(head, ‘title’).text = f’Page {page_number}’
xml.etree.ElementTree.SubElement(head, ‘link’, {
‘rel’: ‘stylesheet’,
‘type’: ‘text/css’,
‘href’: ‘style.css’
})

body = xml.etree.ElementTree.SubElement(html, ‘body’)
xml.etree.ElementTree.SubElement(body, ‘img’, {
‘src’: f’images/{image_file}’,
‘alt’: f’Page {page_number}’
})

xml_content = prettify_xml(html)
logger.debug(f"Page {page_number} XHTML content:\n" + xml_content)
return xml_content

def prettify_xml(elem):
"""Convert XML element to pretty string."""
rough_string = xml.etree.ElementTree.tostring(elem, ‘utf-8’)
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")

def create_epub_from_images(image_folder, output_file, metadata):
logger.info(f"Starting EPUB creation from images in {image_folder}")
logger.info(f"Output file will be: {output_file}")
logger.info(f"Metadata: {metadata}")

# Get all image files
image_files = [f for f in os.listdir(image_folder)
if f.lower().endswith((‘.png’, ‘.jpg’, ‘.jpeg’, ‘.gif’, ‘.bmp’))]
image_files.sort()
logger.info(f"Found {len(image_files)} image files")
logger.debug(f"Image files: {image_files}")

if not image_files:
logger.error("No image files found in the specified folder")
sys.exit(1)

# Create ZIP file (EPUB)
logger.info("Creating EPUB file structure")
with zipfile.ZipFile(output_file, ‘w’, zipfile.ZIP_DEFLATED) as epub:
# Add mimetype (must be first, uncompressed)
logger.debug("Adding mimetype file (uncompressed)")
epub.writestr(‘mimetype’, ‘application/epub+zip’, zipfile.ZIP_STORED)

# Create META-INF directory
logger.debug("Adding container.xml")
epub.writestr(‘META-INF/container.xml’, create_container_xml())

# Create OEBPS directory structure
logger.debug("Creating OEBPS directory structure")
os.makedirs(‘temp/OEBPS/images’, exist_ok=True)
os.makedirs(‘temp/OEBPS/style’, exist_ok=True)

# Add CSS
logger.debug("Adding style.css")
epub.writestr(‘OEBPS/style.css’, CSS_CONTENT)

# Process images and create pages
logger.info("Processing images and creating pages")
manifest_items = [
{‘id’: ‘style’, ‘href’: ‘style.css’, ‘media-type’: ‘text/css’},
{‘id’: ‘nav’, ‘href’: ‘nav.xhtml’, ‘media-type’: ‘application/xhtml+xml’, ‘properties’: ‘nav’}
]
spine_items = []
nav_points = []

for i, image_file in enumerate(image_files, 1):
logger.debug(f"Processing image {i:03d}/{len(image_files):03d}: {image_file}")

# Copy image to temp directory
image_path = os.path.join(image_folder, image_file)
logger.debug(f"Reading image: {image_path}")
with open(image_path, ‘rb’) as f:
image_data = f.read()
logger.debug(f"Adding image to EPUB: OEBPS/images/{image_file}")
epub.writestr(f’OEBPS/images/{image_file}’, image_data)

# Add image to manifest
image_id = f’image_{i:03d}’
if i == 1:
image_id = ‘cover-image’ # Special ID for cover image
manifest_items.append({
‘id’: image_id,
‘href’: f’images/{image_file}’,
‘media-type’: ‘image/jpeg’ if image_file.lower().endswith((‘.jpg’, ‘.jpeg’)) else ‘image/png’
})

# Create page XHTML
page_id = f’page_{i:03d}’
logger.debug(f"Creating page XHTML: {page_id}.xhtml")
page_content = create_page_xhtml(i, image_file)
epub.writestr(f’OEBPS/{page_id}.xhtml’, page_content)

# Add to manifest and spine
manifest_items.append({
‘id’: page_id,
‘href’: f'{page_id}.xhtml’,
‘media-type’: ‘application/xhtml+xml’
})
spine_items.append(page_id)

# Add to navigation points
nav_points.append((
f’navpoint-{i:03d}’,
‘Cover’ if i == 1 else f’Page {i:03d}’,
f'{page_id}.xhtml’
))

# Create content.opf
logger.debug("Creating content.opf")
epub.writestr(‘OEBPS/content.opf’, create_content_opf(metadata, spine_items, manifest_items))

# Create toc.ncx
logger.debug("Creating toc.ncx")
epub.writestr(‘OEBPS/toc.ncx’, create_toc_ncx(metadata, nav_points))

# Create nav.xhtml
logger.debug("Creating nav.xhtml")
epub.writestr(‘OEBPS/nav.xhtml’, create_nav_xhtml(metadata, nav_points))

logger.info(f"Successfully created EPUB file: {output_file}")
logger.info("EPUB structure:")
logger.info(" mimetype")
logger.info(" META-INF/container.xml")
logger.info(" OEBPS/")
logger.info(" content.opf")
logger.info(" toc.ncx")
logger.info(" nav.xhtml")
logger.info(" style.css")
logger.info(" images/")
for i in range(1, len(image_files) + 1):
logger.info(f" page_{i:03d}.xhtml")

def generate_default_filename(metadata, image_folder):
"""Generate default EPUB filename based on metadata."""
# Get title from metadata or use folder name
title = metadata.get(‘title’)
if not title:
# Get folder name and extract part before last underscore
folder_name = os.path.basename(os.path.normpath(image_folder))
title = folder_name.rsplit(‘_’, 1)[0] if ‘_’ in folder_name else folder_name

# Format title: remove spaces, hyphens, quotes and capitalize
title = ”.join(word.capitalize() for word in title.replace(‘-‘, ‘ ‘).replace(‘"’, ”).replace("’", ”).split())

# Format volume number with 2 digits
volume = metadata.get(‘volume’, ’01’)
if volume.isdigit():
volume = f"{int(volume):02d}"

# Get edition number
edition = metadata.get(‘edition’, ‘1’)

return f"{title}_{volume}_{edition}.epub"

def main():
parser = argparse.ArgumentParser(description=’Create an EPUB from a folder of images’)
parser.add_argument(‘image_folder’, help=’Folder containing the images’)
parser.add_argument(‘–output-file’, ‘-o’, help=’Output EPUB file path (optional)’)
parser.add_argument(‘–title’, help=’Book title’)
parser.add_argument(‘–author’, help=’Book author’)
parser.add_argument(‘–series’, help=’Series name’)
parser.add_argument(‘–volume’, help=’Volume number’)
parser.add_argument(‘–release-date’, help=’Release date (YYYY-MM-DD)’)
parser.add_argument(‘–edition’, help=’Edition number’)
parser.add_argument(‘–version’, help=’Version number’)
parser.add_argument(‘–language’, help=’Book language (default: en)’)
parser.add_argument(‘–publisher’, help=’Publisher name’)
parser.add_argument(‘–description’, help=’Book description’)
parser.add_argument(‘–rights’, help=’Copyright/license information’)
parser.add_argument(‘–subject’, help=’Book subject/category’)
parser.add_argument(‘–isbn’, help=’ISBN number’)
parser.add_argument(‘–debug’, action=’store_true’, help=’Enable debug logging’)

args = parser.parse_args()

if args.debug:
logger.setLevel(logging.DEBUG)
logger.info("Debug logging enabled")

if not os.path.exists(args.image_folder):
logger.error(f"Image folder does not exist: {args.image_folder}")
sys.exit(1)

if not os.path.isdir(args.image_folder):
logger.error(f"Specified path is not a directory: {args.image_folder}")
sys.exit(1)

metadata = {
‘title’: args.title,
‘author’: args.author,
‘series’: args.series,
‘volume’: args.volume,
‘release_date’: args.release_date,
‘edition’: args.edition,
‘version’: args.version,
‘language’: args.language,
‘publisher’: args.publisher,
‘description’: args.description,
‘rights’: args.rights,
‘subject’: args.subject,
‘isbn’: args.isbn
}

# Remove None values from metadata
metadata = {k: v for k, v in metadata.items() if v is not None}

# Generate output filename if not provided
if not args.output_file:
args.output_file = generate_default_filename(metadata, args.image_folder)
logger.info(f"Using default output filename: {args.output_file}")

try:
create_epub_from_images(args.image_folder, args.output_file, metadata)
logger.info("EPUB creation completed successfully")
except Exception as e:
logger.error(f"EPUB creation failed: {str(e)}")
sys.exit(1)

if __name__ == ‘__main__’:
main()

[/python]

Posted in en-US | Tags: epub, Python | No Comments »

RSS to EPUB Converter: Create eBooks from RSS Feeds

Author: Jonathan Lalou

Overview

This Python script (rss_to_ebook.py) converts RSS or Atom feeds into EPUB format eBooks, allowing you to read your favorite blog posts and news articles offline in your preferred e-reader. The script intelligently handles both RSS 2.0 and Atom feed formats, preserving HTML formatting while creating a clean, readable eBook.

Key Features

Dual Format Support: Works with both RSS 2.0 and Atom feeds
Smart Pagination: Automatically handles paginated feeds using multiple detection methods
Date Range Filtering: Select specific date ranges for content inclusion
Metadata Preservation: Maintains feed metadata including title, author, and description
HTML Formatting: Preserves original HTML formatting while cleaning unnecessary elements
Duplicate Prevention: Automatically detects and removes duplicate entries
Comprehensive Logging: Detailed progress tracking and error reporting

Technical Details

The script uses several Python libraries:

feedparser: For parsing RSS and Atom feeds
ebooklib: For creating EPUB files
BeautifulSoup: For HTML cleaning and processing
logging: For detailed operation tracking

Usage

python rss_to_ebook.py <feed_url> [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD] [--output filename.epub] [--debug]

Parameters:

feed_url: URL of the RSS or Atom feed (required)
--start-date: Start date for content inclusion (default: 1 year ago)
--end-date: End date for content inclusion (default: today)
--output: Output EPUB filename (default: rss_feed.epub)
--debug: Enable detailed logging

Example

python rss_to_ebook.py https://example.com/feed --start-date 2024-01-01 --end-date 2024-03-31 --output my_blog.epub

Requirements

Python 3.x

Required packages (install via pip):

pip install feedparser ebooklib beautifulsoup4

How It Works

Feed Detection: Automatically identifies feed format (RSS 2.0 or Atom)
Content Processing:
- Extracts entries within specified date range
- Preserves HTML formatting while cleaning unnecessary elements
- Handles pagination to get all available content
EPUB Creation:
- Creates chapters from feed entries
- Maintains original formatting and links
- Includes table of contents and navigation
- Preserves feed metadata

Error Handling

Validates feed format and content
Handles malformed HTML
Provides detailed error messages and logging
Gracefully handles missing or incomplete feed data

Use Cases

Create eBooks from your favorite blogs
Archive important news articles
Generate reading material for offline use
Create compilations of related content

Gist: GitHub

Here is the script:

[python]
#!/usr/bin/env python3

import feedparser
import argparse
from datetime import datetime, timedelta
from ebooklib import epub
import re
from bs4 import BeautifulSoup
import logging

# Configure logging
logging.basicConfig(
level=logging.INFO,
format=’%(asctime)s – %(levelname)s – %(message)s’,
datefmt=’%Y-%m-%d %H:%M:%S’
)

def clean_html(html_content):
"""Clean HTML content while preserving formatting."""
soup = BeautifulSoup(html_content, ‘html.parser’)

# Remove script and style elements
for script in soup(["script", "style"]):
script.decompose()

# Remove any inline styles
for tag in soup.find_all(True):
if ‘style’ in tag.attrs:
del tag.attrs[‘style’]

# Return the cleaned HTML
return str(soup)

def get_next_feed_page(current_feed, feed_url):
"""Get the next page of the feed using various pagination methods."""
# Method 1: next_page link in feed
if hasattr(current_feed, ‘next_page’):
logging.info(f"Found next_page link: {current_feed.next_page}")
return current_feed.next_page

# Method 2: Atom-style pagination
if hasattr(current_feed.feed, ‘links’):
for link in current_feed.feed.links:
if link.get(‘rel’) == ‘next’:
logging.info(f"Found Atom-style next link: {link.href}")
return link.href

# Method 3: RSS 2.0 pagination (using lastBuildDate)
if hasattr(current_feed.feed, ‘lastBuildDate’):
last_date = current_feed.feed.lastBuildDate
if hasattr(current_feed.entries, ‘last’):
last_entry = current_feed.entries[-1]
if hasattr(last_entry, ‘published_parsed’):
last_entry_date = datetime(*last_entry.published_parsed[:6])
# Try to construct next page URL with date parameter
if ‘?’ in feed_url:
next_url = f"{feed_url}&before={last_entry_date.strftime(‘%Y-%m-%d’)}"
else:
next_url = f"{feed_url}?before={last_entry_date.strftime(‘%Y-%m-%d’)}"
logging.info(f"Constructed date-based next URL: {next_url}")
return next_url

# Method 4: Check for pagination in feed description
if hasattr(current_feed.feed, ‘description’):
desc = current_feed.feed.description
# Look for common pagination patterns in description
next_page_patterns = [
r’next page: (https?://\S+)’,
r’older posts: (https?://\S+)’,
r’page \d+: (https?://\S+)’
]
for pattern in next_page_patterns:
match = re.search(pattern, desc, re.IGNORECASE)
if match:
next_url = match.group(1)
logging.info(f"Found next page URL in description: {next_url}")
return next_url

return None

def get_feed_type(feed):
"""Determine if the feed is RSS 2.0 or Atom format."""
if hasattr(feed, ‘version’) and feed.version.startswith(‘rss’):
return ‘rss’
elif hasattr(feed, ‘version’) and feed.version == ‘atom10’:
return ‘atom’
# Try to detect by checking for Atom-specific elements
elif hasattr(feed.feed, ‘links’) and any(link.get(‘rel’) == ‘self’ for link in feed.feed.links):
return ‘atom’
# Default to RSS if no clear indicators
return ‘rss’

def get_entry_content(entry, feed_type):
"""Get the content of an entry based on feed type."""
if feed_type == ‘atom’:
# Atom format
if hasattr(entry, ‘content’):
return entry.content[0].value if entry.content else ”
elif hasattr(entry, ‘summary’):
return entry.summary
else:
# RSS 2.0 format
if hasattr(entry, ‘content’):
return entry.content[0].value if entry.content else ”
elif hasattr(entry, ‘description’):
return entry.description
return ”

def get_entry_date(entry, feed_type):
"""Get the publication date of an entry based on feed type."""
if feed_type == ‘atom’:
# Atom format uses updated or published
if hasattr(entry, ‘published_parsed’):
return datetime(*entry.published_parsed[:6])
elif hasattr(entry, ‘updated_parsed’):
return datetime(*entry.updated_parsed[:6])
else:
# RSS 2.0 format uses pubDate
if hasattr(entry, ‘published_parsed’):
return datetime(*entry.published_parsed[:6])
return datetime.now()

def get_feed_metadata(feed, feed_type):
"""Extract metadata from feed based on its type."""
metadata = {
‘title’: ”,
‘description’: ”,
‘language’: ‘en’,
‘author’: ‘Unknown’,
‘publisher’: ”,
‘rights’: ”,
‘updated’: ”
}

if feed_type == ‘atom’:
# Atom format metadata
metadata[‘title’] = feed.feed.get(‘title’, ”)
metadata[‘description’] = feed.feed.get(‘subtitle’, ”)
metadata[‘language’] = feed.feed.get(‘language’, ‘en’)
metadata[‘author’] = feed.feed.get(‘author’, ‘Unknown’)
metadata[‘rights’] = feed.feed.get(‘rights’, ”)
metadata[‘updated’] = feed.feed.get(‘updated’, ”)
else:
# RSS 2.0 format metadata
metadata[‘title’] = feed.feed.get(‘title’, ”)
metadata[‘description’] = feed.feed.get(‘description’, ”)
metadata[‘language’] = feed.feed.get(‘language’, ‘en’)
metadata[‘author’] = feed.feed.get(‘author’, ‘Unknown’)
metadata[‘copyright’] = feed.feed.get(‘copyright’, ”)
metadata[‘lastBuildDate’] = feed.feed.get(‘lastBuildDate’, ”)

return metadata

def create_ebook(feed_url, start_date, end_date, output_file):
"""Create an ebook from RSS feed entries within the specified date range."""
logging.info(f"Starting ebook creation from feed: {feed_url}")
logging.info(f"Date range: {start_date.strftime(‘%Y-%m-%d’)} to {end_date.strftime(‘%Y-%m-%d’)}")

# Parse the RSS feed
feed = feedparser.parse(feed_url)

if feed.bozo:
logging.error(f"Error parsing feed: {feed.bozo_exception}")
return False

# Determine feed type
feed_type = get_feed_type(feed)
logging.info(f"Detected feed type: {feed_type}")

logging.info(f"Successfully parsed feed: {feed.feed.get(‘title’, ‘Unknown Feed’)}")

# Create a new EPUB book
book = epub.EpubBook()

# Extract metadata based on feed type
metadata = get_feed_metadata(feed, feed_type)

logging.info(f"Setting metadata for ebook: {metadata[‘title’]}")

# Set basic metadata
book.set_identifier(feed_url) # Use feed URL as unique identifier
book.set_title(metadata[‘title’])
book.set_language(metadata[‘language’])
book.add_author(metadata[‘author’])

# Add additional metadata if available
if metadata[‘description’]:
book.add_metadata(‘DC’, ‘description’, metadata[‘description’])
if metadata[‘publisher’]:
book.add_metadata(‘DC’, ‘publisher’, metadata[‘publisher’])
if metadata[‘rights’]:
book.add_metadata(‘DC’, ‘rights’, metadata[‘rights’])
if metadata[‘updated’]:
book.add_metadata(‘DC’, ‘date’, metadata[‘updated’])

# Add date range to description
date_range_desc = f"Content from {start_date.strftime(‘%Y-%m-%d’)} to {end_date.strftime(‘%Y-%m-%d’)}"
book.add_metadata(‘DC’, ‘description’, f"{metadata[‘description’]}\n\n{date_range_desc}")

# Create table of contents
chapters = []
toc = []

# Process entries within date range
entries_processed = 0
entries_in_range = 0
consecutive_out_of_range = 0
current_page = 1
processed_urls = set() # Track processed URLs to avoid duplicates

logging.info("Starting to process feed entries…")

while True:
logging.info(f"Processing page {current_page} with {len(feed.entries)} entries")

# Process current batch of entries
for entry in feed.entries[entries_processed:]:
entries_processed += 1

# Skip if we’ve already processed this entry
entry_id = entry.get(‘id’, entry.get(‘link’, ”))
if entry_id in processed_urls:
logging.debug(f"Skipping duplicate entry: {entry_id}")
continue
processed_urls.add(entry_id)

# Get entry date based on feed type
entry_date = get_entry_date(entry, feed_type)

if entry_date < start_date:
consecutive_out_of_range += 1
logging.debug(f"Skipping entry from {entry_date.strftime(‘%Y-%m-%d’)} (before start date)")
continue
elif entry_date > end_date:
consecutive_out_of_range += 1
logging.debug(f"Skipping entry from {entry_date.strftime(‘%Y-%m-%d’)} (after end date)")
continue
else:
consecutive_out_of_range = 0
entries_in_range += 1

# Create chapter
title = entry.get(‘title’, ‘Untitled’)
logging.info(f"Adding chapter: {title} ({entry_date.strftime(‘%Y-%m-%d’)})")

# Get content based on feed type
content = get_entry_content(entry, feed_type)

# Clean the content
cleaned_content = clean_html(content)

# Create chapter
chapter = epub.EpubHtml(
title=title,
file_name=f’chapter_{len(chapters)}.xhtml’,
content=f'<h1>{title}</h1>{cleaned_content}’
)

# Add chapter to book
book.add_item(chapter)
chapters.append(chapter)
toc.append(epub.Link(chapter.file_name, title, chapter.id))

# If we have no entries in range or we’ve seen too many consecutive out-of-range entries, stop
if entries_in_range == 0 or consecutive_out_of_range >= 10:
if entries_in_range == 0:
logging.warning("No entries found within the specified date range")
else:
logging.info(f"Stopping after {consecutive_out_of_range} consecutive out-of-range entries")
break

# Try to get more entries if available
next_page_url = get_next_feed_page(feed, feed_url)
if next_page_url:
current_page += 1
logging.info(f"Fetching next page: {next_page_url}")
feed = feedparser.parse(next_page_url)
if not feed.entries:
logging.info("No more entries available")
break
else:
logging.info("No more pages available")
break

if entries_in_range == 0:
logging.error("No entries found within the specified date range")
return False

logging.info(f"Processed {entries_processed} total entries, {entries_in_range} within date range")

# Add table of contents
book.toc = toc

# Add navigation files
book.add_item(epub.EpubNcx())
book.add_item(epub.EpubNav())

# Define CSS style
style = ”’
@namespace epub "http://www.idpf.org/2007/ops";
body {
font-family: Cambria, Liberation Serif, serif;
}
h1 {
text-align: left;
text-transform: uppercase;
font-weight: 200;
}
”’

# Add CSS file
nav_css = epub.EpubItem(
uid="style_nav",
file_name="style/nav.css",
media_type="text/css",
content=style
)
book.add_item(nav_css)

# Create spine
book.spine = [‘nav’] + chapters

# Write the EPUB file
logging.info(f"Writing EPUB file: {output_file}")
epub.write_epub(output_file, book, {})
logging.info("EPUB file created successfully")
return True

def main():
parser = argparse.ArgumentParser(description=’Convert RSS feed to EPUB ebook’)
parser.add_argument(‘feed_url’, help=’URL of the RSS feed’)
parser.add_argument(‘–start-date’, help=’Start date (YYYY-MM-DD)’,
default=(datetime.now() – timedelta(days=365)).strftime(‘%Y-%m-%d’))
parser.add_argument(‘–end-date’, help=’End date (YYYY-MM-DD)’,
default=datetime.now().strftime(‘%Y-%m-%d’))
parser.add_argument(‘–output’, help=’Output EPUB file name’,
default=’rss_feed.epub’)
parser.add_argument(‘–debug’, action=’store_true’, help=’Enable debug logging’)

args = parser.parse_args()

if args.debug:
logging.getLogger().setLevel(logging.DEBUG)

# Parse dates
start_date = datetime.strptime(args.start_date, ‘%Y-%m-%d’)
end_date = datetime.strptime(args.end_date, ‘%Y-%m-%d’)

# Create ebook
if create_ebook(args.feed_url, start_date, end_date, args.output):
logging.info(f"Successfully created ebook: {args.output}")
else:
logging.error("Failed to create ebook")

if __name__ == ‘__main__’:
main()

[/python]

Posted in en-US | Tags: Atom, epub, Python, RSS | No Comments »

Quick and dirty script to convert WordPress export file to Blogger / Atom XML

Author: Jonathan Lalou

I’ve created a Python script that converts WordPress export files to Blogger/Atom XML format. Here’s how to use it:

The script takes two command-line arguments:

wordpress_export.xml: Path to your WordPress export XML file
blogger_export.xml : Path where you want to save the converted Blogger/Atom XML file

To run the script:

python wordpress_to_blogger.py wordpress_export.xml blogger_export.xml

The script performs the following conversions:

Converts WordPress posts to Atom feed entries
Preserves post titles, content, publication dates, and authors
Maintains categories as Atom categories
Handles post status (published/draft)
Preserves HTML content formatting
Converts dates to ISO format required by Atom

The script uses Python’s built-in xml.etree.ElementTree module for XML processing and includes error handling to make it robust.
Some important notes:

The script only converts posts (not pages or other content types)
It preserves the HTML content of your posts
It maintains the original publication dates
It handles both published and draft posts
The output is a valid Atom XML feed that Blogger can import

The file:

[python]#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import sys
import argparse
from datetime import datetime
import re

def convert_wordpress_to_blogger(wordpress_file, output_file):
# Parse WordPress XML
tree = ET.parse(wordpress_file)
root = tree.getroot()

# Create Atom feed
atom = ET.Element(‘feed’, {
‘xmlns’: ‘http://www.w3.org/2005/Atom’,
‘xmlns:app’: ‘http://www.w3.org/2007/app’,
‘xmlns:thr’: ‘http://purl.org/syndication/thread/1.0’
})

# Add feed metadata
title = ET.SubElement(atom, ‘title’)
title.text = ‘Blog Posts’

updated = ET.SubElement(atom, ‘updated’)
updated.text = datetime.now().isoformat()

# Process each post
for item in root.findall(‘.//item’):
if item.find(‘wp:post_type’, {‘wp’: ‘http://wordpress.org/export/1.2/’}).text != ‘post’:
continue

entry = ET.SubElement(atom, ‘entry’)

# Title
title = ET.SubElement(entry, ‘title’)
title.text = item.find(‘title’).text

# Content
content = ET.SubElement(entry, ‘content’, {‘type’: ‘html’})
content.text = item.find(‘content:encoded’, {‘content’: ‘http://purl.org/rss/1.0/modules/content/’}).text

# Publication date
pub_date = item.find(‘pubDate’).text
published = ET.SubElement(entry, ‘published’)
published.text = datetime.strptime(pub_date, ‘%a, %d %b %Y %H:%M:%S %z’).isoformat()

# Author
author = ET.SubElement(entry, ‘author’)
name = ET.SubElement(author, ‘name’)
name.text = item.find(‘dc:creator’, {‘dc’: ‘http://purl.org/dc/elements/1.1/’}).text

# Categories
for category in item.findall(‘category’):
category_elem = ET.SubElement(entry, ‘category’, {‘term’: category.text})

# Status
status = item.find(‘wp:status’, {‘wp’: ‘http://wordpress.org/export/1.2/’}).text
if status == ‘publish’:
app_control = ET.SubElement(entry, ‘app:control’, {‘xmlns:app’: ‘http://www.w3.org/2007/app’})
app_draft = ET.SubElement(app_control, ‘app:draft’)
app_draft.text = ‘no’
else:
app_control = ET.SubElement(entry, ‘app:control’, {‘xmlns:app’: ‘http://www.w3.org/2007/app’})
app_draft = ET.SubElement(app_control, ‘app:draft’)
app_draft.text = ‘yes’

# Write the output file
tree = ET.ElementTree(atom)
tree.write(output_file, encoding=’utf-8′, xml_declaration=True)

def main():
parser = argparse.ArgumentParser(description=’Convert WordPress export to Blogger/Atom XML format’)
parser.add_argument(‘wordpress_file’, help=’Path to WordPress export XML file’)
parser.add_argument(‘output_file’, help=’Path to output Blogger/Atom XML file’)

args = parser.parse_args()

try:
convert_wordpress_to_blogger(args.wordpress_file, args.output_file)
print(f"Successfully converted {args.wordpress_file} to {args.output_file}")
except Exception as e:
print(f"Error: {str(e)}")
sys.exit(1)

if __name__ == ‘__main__’:
main()[/python]

Posted in en-US | Tags: Atom, blogger, epub, Python, Wordpress | No Comments »