HLPP 2024 Keynote

Building the Universal Source Code Archive: challenges and opportunities of a revolutionary infrastructure

Abstract

Software is the fabric that binds together all aspects of our digital lives, and permeates every research discipline, from the humanities to the hard sciences. To maintain the fabric of knowledge, foster reproducibility of research, and ensure availability and traceability of software components, it is imperative to have dependable preservation and identification of all software artifact relevant for research. To enable massive analysis of the complex galaxy of software development, we need a dedicated infratructure designed for exploring the development history of hundreds of millions of projects, and tens of billions of individual files.
In this talk we present Software Heritage, a groundbreaking initiative committed to collecting, preserving, and sharing all publicly available software in source code form that has already collected more than 19 billion files from more than 290 million software origins. Software Heritage plays an unmatched role in addressing the essential needs of software artifact preservation and identification in all research realms, using the SWHID (Software Hash Identifier) that Software Heritage provides for more than 40 billion artifacts today. It also provides the basic building block for developing the large scale
research infrastructure necessary to address the many challenges posed by the exponential growth of Open Source software.