Recent Posts
Archives

PostHeaderIcon [DevoxxFR 2018] Software Heritage: Preserving Humanity’s Software Legacy

Software is intricately woven into the fabric of our modern world, driving industry, fueling innovation, and forming a critical part of our scientific and cultural knowledge. Recognizing the profound importance of the source code that underpins this digital infrastructure, the Software Heritage initiative was launched. At Devoxx France 2018, Roberto Di Cosmo, a professor, director of Software Heritage, and affiliated with Inria , delivered an insightful talk titled “Software Heritage: Pourquoi et comment preserver le patrimoine logiciel de l’Humanite” (Software Heritage: Why and How to Preserve Humanity’s Software Legacy). He articulated the mission to collect, preserve, and share all publicly available software source code, creating a universal archive for future generations – a modern-day Library of Alexandria for software.

Di Cosmo began by emphasizing that source code is not just a set of instructions for computers; it’s a rich repository of human knowledge, ingenuity, and history. From complex algorithms to the subtle comments left by developers, source code tells a story of problem-solving and technological evolution. However, this invaluable heritage is fragile and at risk of being lost due to obsolete storage media, defunct projects, and disappearing hosting platforms.

The Mission: Collect, Preserve, Share

The core mission of Software Heritage, as outlined by Roberto Di Cosmo, is threefold: to collect, preserve, and make accessible the entirety of publicly available software source code. This ambitious undertaking aims to create a comprehensive and permanent archive – an “Internet Archive for source code” – safeguarding it from loss and ensuring it remains available for research, education, industrial development, and cultural understanding.

The collection process involves systematically identifying and archiving code from a vast array of sources, including forges like GitHub, GitLab, Bitbucket, institutional repositories like HAL, and package repositories such as Gitorious and Google Code (many of which are now defunct, highlighting the urgency). Preservation is a long-term commitment, requiring strategies to combat digital obsolescence and ensure the integrity and continued accessibility of the archived code over decades and even centuries. Sharing this knowledge involves providing tools and interfaces for researchers, developers, historians, and the general public to explore this vast repository, discover connections between projects, and trace the lineage of software. Di Cosmo stressed that this is not just about backing up code; it’s about building a structured, interconnected knowledge base.

Technical Challenges and Approach

The scale of this endeavor presents significant technical challenges. The sheer volume of source code is immense and constantly growing. Code exists in numerous version control systems (Git, Subversion, Mercurial, etc.) and packaging formats, each with its own metadata and history. To address this, Software Heritage has developed a sophisticated infrastructure capable of ingesting code from diverse origins and storing it in a universal, canonical format.

A key element of their technical approach is the use of a Merkle tree structure, similar to what Git uses. All software artifacts (files, directories, commits, revisions) are identified by cryptographic hashes of their content. This allows for massive deduplication (since identical files or code snippets are stored only once, regardless of how many projects they appear in) and ensures the integrity and verifiability of the archive. This graph-based model also allows for the reconstruction of the full development history of software projects and the relationships between them. Di Cosmo explained that this structure not only saves space but also provides a powerful way to navigate and understand the evolution of software. The entire infrastructure itself is open source.

A Universal Archive for All

Roberto Di Cosmo emphasized that Software Heritage is built as a common infrastructure for society, serving multiple purposes. For industry, it provides a reference point for existing code, preventing reinvention and facilitating reuse. For science, it offers a vast dataset for research on software engineering, programming languages, and the evolution of code, and is crucial for the reproducibility of research that relies on software. For education, it’s a rich learning resource. And for society as a whole, it preserves a vital part of our collective memory and technological heritage.

He concluded with a call to action, inviting individuals, institutions, and companies to support the initiative. This support can take many forms: contributing code from missing sources, helping to develop tools and connectors for different version control systems, providing financial sponsorship, or simply spreading the word about the importance of preserving our software legacy. Software Heritage aims to be a truly global and collaborative effort to ensure that the knowledge embedded in source code is not lost to time.

Links:

Hashtags: #SoftwareHeritage #OpenSource #Archive #DigitalPreservation #SourceCode #CulturalHeritage #RobertoDiCosmo #Inria #DevoxxFR2018

Leave a Reply