Skip to content

This repository contains my work report during the Google Summer Of Code Program 2023 with SPDX and Software Heritage Open-Source Organizations

Notifications You must be signed in to change notification settings

HarshvMahawar/GSoC-23-SPDX-SWH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 

Repository files navigation

GSoC'23 @ SPDX-SWH

Report of Google Summer of Code 2023 @ SPDX

Project Details
Initial Proposal Generation of SPDX documents of origins @ Software Heritage
Repository swh-spdx
Mentors David Douard
Contributions swh-spdx and tools-python
Documentation swh-spdx documentation
Duration 5 May'23 - 28 Aug'23

About SPDX

The Software Package Data Exchange (SPDX) specification defines an open standard for communicating information about software components. SPDX is used to create Software Bill of Material lists (SBOMs), encapsulate licensing and copyright details, and provide package metadata such as version identifiers and known vulnerabilities.

SPDX was originally designed over a decade ago as a way to help developers comply with open-source licenses. Since then it's been extended with new capabilities for describing dependency trees and issuing SBOMs. SPDX was catapulted to global attention in September 2021 when ISO recognized it as the international standard for software supply chain documentation.

About Software Heritage

Software Heritage is on a mission to collect, preserve, and share all the publicly available software with its source code and development history. The archive periodically crawls GitHub, GitLab, Debian, PyPI, etc. It has preserved more than 16 billion source code files with 3.4 billion commits spanning more than 256 million software projects. Software Heritage listers serve the purpose of identifying the origins of software for inclusion into the Software Heritage archive, whereas Software Heritage loaders are responsible for incorporating content into the Software Heritage archive.

My GSoC project was all about targeting these projects and developing a CLI (Command Line Interface) tool to generate their SBOMs (Software Bill Of Materials) in SPDX standard making use of SPDX's tools-python library.

OVERVIEW

I developed a Python package named swh-spdx to assist in the generation of SBOMs for projects stored in the Software Heritage Archive. It utilizes the Software Heritage GraphQL API to extract the source code of projects from different origins. I employed the gql module to establish connectivity between the tool and the GraphQL API server of Software Heritage.

  • For Code testing
    • Unit testing -> Pytest testing framework
    • Code formatting, organization, and style consisting -> black, isort and flake8
    • Documentation testing -> Sphinx
    • Automation Server @ Software Heritage -> Jenkins

The tool currently supports projects from two origin types that are PyPI and NPM source packages where the tool takes in the SoftWare Heritage persistent IDentifier(SWHID) of the project and writes the generated SBOM in the specified path.

What is the need for this tool?

By generating SPDX documents for Software Heritage archive projects, this initiative simplifies the management of open-source software. Developers can effortlessly trace software origins, licenses, and vulnerabilities. This development simplifies risk analysis, and speeds up audits, all supporting long-lasting open-source practices and strengthening secure software creation.

Contributions

Title Commit
Initialize the swh-spdx repo D3857
Implement the initial classes, methods, and their unit tests D8213
Add more test coverage Dbcd0
Add type annotations De97b
Add support for swh-model CoreSWHID class Db28d
Implement initial SPDX generation of PyPI origin projects D7ea8
Implement initial SPDX generation of NPM origin projects D7163
Add suggestions in docstrings for future improvements D6762

Challenges

The major challenge I faced during this project was the difference in the directory structure and metadata file structure of projects from different origins, for example, the amount of information the metadata file PKG-INFO of a PyPI origin has is different from that of package.json of a NPM origin so the generalized code might not work.

The way I solved this was to make different implementations for different origins, though the base classes remain the same, the methodology to extract the metadata from different origins differs.

Future Aspects

The tool is not yet live due to ongoing improvements and additional implementations. These ongoing improvements encompass areas such as error handling, unit testing, and the final implementation of the command-line interface (CLI). Looking forward, the project's future directions include:

  • Enhance metadata parsing for the more comprehensive population of SPDX fields. Look at the SPDX v2.3 Specification to get more information on each SPDX field
  • Add support for other available origins
  • Improve the test coverage of the current code
  • Implement the tool to the Software Heritage's Web User Interface

Summing Up

It was a great experience to work with the wonderful team of developers @ Software Heritage and SPDX especially David Douard (my mentor) and JayeshV who helped me where-ever I faced any doubt, reviewed my work and suggested some insightful improvements. The Open Source community of SWH and SPDX was readily available to assist me with any challenges I encountered during the setup of my development environment, solving issues, submitting merge requests (MRs)/pull requests (PRs), and more.

Throughout the Google Summer of Code program, I gained valuable insights into the practical software development practices that are integral to real-world projects. I am confident that the knowledge and experience I acquired will greatly benefit my future endeavors.

I'm eager to carry on my involvement in the open-source community and continue my learning journey.

About

This repository contains my work report during the Google Summer Of Code Program 2023 with SPDX and Software Heritage Open-Source Organizations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published