Report of Google Summer of Code 2023 @ SPDX
Project Details | |
---|---|
Initial Proposal | Generation of SPDX documents of origins @ Software Heritage |
Repository | swh-spdx |
Mentors | David Douard |
Contributions | swh-spdx and tools-python |
Documentation | swh-spdx documentation |
Duration | 5 May'23 - 28 Aug'23 |
The Software Package Data Exchange (SPDX) specification defines an open standard for communicating information about software components. SPDX is used to create Software Bill of Material lists (SBOMs), encapsulate licensing and copyright details, and provide package metadata such as version identifiers and known vulnerabilities.
SPDX was originally designed over a decade ago as a way to help developers comply with open-source licenses. Since then it's been extended with new capabilities for describing dependency trees and issuing SBOMs. SPDX was catapulted to global attention in September 2021 when ISO recognized it as the international standard for software supply chain documentation.
Software Heritage is on a mission to collect, preserve, and share all the publicly available software with its source code and development history. The archive periodically crawls GitHub, GitLab, Debian, PyPI, etc. It has preserved more than 16 billion source code files with 3.4 billion commits spanning more than 256 million software projects. Software Heritage listers serve the purpose of identifying the origins of software for inclusion into the Software Heritage archive, whereas Software Heritage loaders are responsible for incorporating content into the Software Heritage archive.
My GSoC project was all about targeting these projects and developing a CLI (Command Line Interface) tool to generate their SBOMs (Software Bill Of Materials) in SPDX standard making use of SPDX's tools-python library.
I developed a Python package named swh-spdx to assist in the generation of SBOMs for projects stored in the Software Heritage Archive. It utilizes the Software Heritage GraphQL API to extract the source code of projects from different origins. I employed the gql module to establish connectivity between the tool and the GraphQL API server of Software Heritage.
- For Code testing
The tool currently supports projects from two origin types that are PyPI and NPM source packages where the tool takes in the SoftWare Heritage persistent IDentifier(SWHID) of the project and writes the generated SBOM in the specified path.
By generating SPDX documents for Software Heritage archive projects, this initiative simplifies the management of open-source software. Developers can effortlessly trace software origins, licenses, and vulnerabilities. This development simplifies risk analysis, and speeds up audits, all supporting long-lasting open-source practices and strengthening secure software creation.
Title | Commit |
---|---|
Initialize the swh-spdx repo | D3857 |
Implement the initial classes, methods, and their unit tests | D8213 |
Add more test coverage | Dbcd0 |
Add type annotations | De97b |
Add support for swh-model CoreSWHID class | Db28d |
Implement initial SPDX generation of PyPI origin projects | D7ea8 |
Implement initial SPDX generation of NPM origin projects | D7163 |
Add suggestions in docstrings for future improvements | D6762 |
The major challenge I faced during this project was the difference in the directory structure and metadata file structure of projects from different origins, for example, the amount of information the metadata file PKG-INFO of a PyPI origin has is different from that of package.json of a NPM origin so the generalized code might not work.
The way I solved this was to make different implementations for different origins, though the base classes remain the same, the methodology to extract the metadata from different origins differs.
The tool is not yet live due to ongoing improvements and additional implementations. These ongoing improvements encompass areas such as error handling, unit testing, and the final implementation of the command-line interface (CLI). Looking forward, the project's future directions include:
- Enhance metadata parsing for the more comprehensive population of SPDX fields. Look at the SPDX v2.3 Specification to get more information on each SPDX field
- Add support for other available origins
- Improve the test coverage of the current code
- Implement the tool to the Software Heritage's Web User Interface
It was a great experience to work with the wonderful team of developers @ Software Heritage and SPDX especially David Douard (my mentor) and JayeshV who helped me where-ever I faced any doubt, reviewed my work and suggested some insightful improvements. The Open Source community of SWH and SPDX was readily available to assist me with any challenges I encountered during the setup of my development environment, solving issues, submitting merge requests (MRs)/pull requests (PRs), and more.
Throughout the Google Summer of Code program, I gained valuable insights into the practical software development practices that are integral to real-world projects. I am confident that the knowledge and experience I acquired will greatly benefit my future endeavors.
I'm eager to carry on my involvement in the open-source community and continue my learning journey.