Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add license rules #3562

Merged
merged 12 commits into from
Nov 2, 2023
282 changes: 282 additions & 0 deletions ROADMAP-ABOUTCODE.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
AboutCode global Roadmap
========================

python-inspector
Support all package manifests beyond req and setup.py

SCIO: ScanCode.io, pipelines for SCA
-------------------------------------

Compositition analysis of Deployed binaries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Build pipelines for key tech stacks. For each of these automate the end-to-end
analysis of a package binaries mapping it back to it sources and matching it
upstream to its PurlDB origin:

- for Java
- for JavaScript, CSS
- for C/C++ ELFs
- for C/C++ WinPE
- for C/C++ Mach-O
- for .Net, C#
- for Golang
- for Android apk
- for Python
- for Rust
- for Ruby


Matching pipeline
~~~~~~~~~~~~~~~~~~

Build a dedicated pipeline to matching (client side)


Scan TODO/Review app
~~~~~~~~~~~~~~~~~~~~~

- Build an app in SCIO to automate flagging scan items that needs review or attention.
- Create a UI and backend to organize the scan review.
- Consider including and merging the "scantext" license detection review app


Pre-built container image(s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Build and publish container images
- Consider building a single image for CLI deployments
- Consider publishe the app image for standalone CLI deployments

Package management
~~~~~~~~~~~~~~~~~~~~

- Adopt the two levels manifests/package instances
- Refactor dependencies as deps and requirements


Deploy free analysis public server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Consider sponsorship from Amazon/Google/Azure

Create and document standard CI/CD integrations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- GitHub
- GitLab
- Azure


SCTK: ScanCode Toolkit
-----------------------

License detection quality improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Include automatic key phrases in license detection rules
Use important key phrases for license detection https://github.com/nexB/scancode-toolkit/issues/2637

- Add required phrase automatically + unknown detection in licenses plus testing
- More license detection bugs reported recently

- Detect summary for all packages, and populate more package fields correctly like copyright/holders

- We can report the declared license and other licenses in the license summary
of a full scan. The primary license is based; next is to do the
same across each package found nested in a scanned codebase. And also compute
an individual license clarity score for each these.


- license expression simplify and license expression category


Improve package detection
~~~~~~~~~~~~~~~~~~~~~~~~~~

- Create synthethic, private packages from non-packaged files based on license and copyright
- Create simplified purl-only lightweight package detection
- Evolve model for dependencies towards requirements and true dependencies
- Track private non-published packages

Primary copyright detection for packages

- This is closely tied to the primary license detection and should focus
on package manifests and key files.
- Support copyright parsing from all package ecosystems.



Published improved release packagings/bundles/installers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Publish smaller wheels with a single focus for easier integration as a library

- Release self-contained app(s) for ease of use, bundled with a Python and everything on it:

- extractcode
- scancode proper
- packagedcode only
- licensedcode only
- cluecode only

- Adopt Python 3.12
- Adopt macOS and Linux on ARM


ABCTK: AboutCode Toolkit
----------------------------

- add support for patterns for docoumented resources
- add support for exclude for docoumented resources
- document deployed resource for a development resource


PURLDB: PurlDB
----------------

- purl2all: On demand indexing for all supported package ecosystems
- purl2sym: Collect source and binary symbols
- index-time matching to find the true origin
- implement multi-tier indexing: purl/metadata/archive/files
- MatchCode matching engine

- embed a SCIO with a matching pipeline for match a whole codebase at once
- expore new endpoint for matching whole codebase
- support multiple SCIO workers for indexing
- implement proper ranking of matched code results
- refactor directory matching to be a pre-matching step to file matching


VCIO: VulnerableCode.io
------------------------

- Adopt VulnTotal model throughout
- Log advisory history
- Add vulnerable code reachability
- Add vulnerable code required context/config
- Add more upstream resources
- Deploy purlsync public pilot


PURL: purl and vers specs
--------------------------

- Merge and advertize vers spec.
- Standardize purl with ECMA


INSPECTORS: misc package and technology inspectors
----------------------------------------------------

- Universal Inspector/DependentCode

- Resolve any purl dependencies
- Non-vulnerable dependency resolution

- Inspector for Java and Android DEX

- Decompile and collect binary symbols.
- Collect source symbols
- Resolve dependencies for Gradle, SBT and Maven.

- Inspector for JavaScript, CSS

- Decompile/deminify and collect bundled and minified symbols.
- Analyze map files
- Collect source symbols
- Resolve dependencies for npm, yarn and pnpm.

- Inspector for C/C++
- Collect source symbols

- Inspector for ELFs

- Decompile and collect binary symbols.
- Collect DWARFs and ELFs section symbols
- Resolve dependencies for pkgconfig and ldd

- Inspector for WinPE

- Decompile and collect binary symbols.
- Collect winpdb symbols

- Inspector for Mach-O

- Decompile and collect binary symbols.
- Collect DWARFs and ELFs section symbols

- Inspector for .Net, C#

- Decompile and collect binary symbols from assemblies (see also WinPE)
- Collect source symbols
- Resolve dependencies for nuget/dotnet (completed)

- Inspector for Golang

- Decompile and collect binary symbols from pclntab
- Collect source symbols
- Resolve dependencies

- Inspector for Python

- Decompile and collect binary symbols from bytecode
- Collect source symbols
- Resolve dependencies (completed)

- Inspector for Rust

- Decompile and collect binary symbols
- Collect source symbols
- Resolve dependencies

- Inspector for Swift

- Decompile and collect binary symbols
- Collect source symbols
- Resolve dependencies

- Inspector for Dart/Flutter

- Decompile and collect binary symbols
- Collect source symbols
- Resolve dependencies

- Inspector for Ruby

- Collect source symbols
- Resolve dependencies

- Inspector for Debian

- Parse Debian formats (completed)
- Parse installed database (completed)
- Compare versions (completed)
- Resolve dependencies

- Inspector for Alpine

- Parse Alpine formats (completed)
- Parse installed database (completed)
- Compare versions (completed)
- Resolve dependencies

- Inspector for RPM

- Parse RPM formats (partially completed)
- Parse installed database (completed)
- Compare versions (completed)
- Resolve dependencies

- Inspector for containers

- Parse container images formats and manifests (completed)


Other libraries
-----------------

- FetchCode: support all supported package ecosystems, use in purlDB and SCIO
- univers: support all supported package ecosystems
- license-expression : update to support latest SPDX updates, auto-update bundled licenses

26 changes: 19 additions & 7 deletions ROADMAP.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,23 +15,33 @@ to distinguish the forest from the trees. Therefore reporting the primary
license detection is important: when we get scan results, we can often
get 30 licenses for a single a package and this volume is a problem
even if it is correct and it is technically correct.

The goal of this improvement is to:

- combine multiple related license matches in a single license detection
- Combine multiple related license matches in a single license detection.

- in a license detection, expose a primary license expression in addition
- In a license detection, expose a primary license expression in addition
to the complete, full license expression.

- make the logic of selection of the primary license visible, at the minimum
with a log of combination and primary license selection operations
- Make the logic of selection of the primary license visible, at the minimum
with a log of combination and primary license selection operations.

This is for SCTK first.

Status: This has been completed in SCTK and also included in SCIO. We use
Status:

This has been completed in SCTK and also included in SCIO. We use
an updated --summary option and a new license clarity score for this.
We also have LicenseDetections for resources/packages and a top level
unique license detections as a summary.

Next steps:

- We can report the declared license and other licenses in the license summary
of a full scan. The primary license is based; next is to do the
same across each package found nested in a scanned codebase. And also compute
an individual license clarity score for each these.


2. Package files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -71,6 +81,8 @@ This is completed in SCTK.

This is the same issue as for primary license, but for holders

This has not been completed. This is less critical to complete as the tracing
is much simpler and can be done manually in the rare cases where this is needed.


Roadmap
Expand Down Expand Up @@ -128,4 +140,4 @@ Roadmap
- Revamp how common list of suprrious licenses are detected (this is a bug)
- Use important key phrases for license detection https://github.com/nexB/scancode-toolkit/issues/2637

This is mostly completed, for follow up see https://github.com/nexB/scancode-toolkit/issues/2878.
This is mostly completed, for follow up see https://github.com/nexB/scancode-toolkit/issues/2878
2 changes: 2 additions & 0 deletions src/cluecode/copyrights.py
Original file line number Diff line number Diff line change
Expand Up @@ -886,6 +886,8 @@ def build_detection_from_node(
# of a copyright statement
(r'^neither$', 'JUNK'),
(r'^nor$', 'JUNK'),

(r'^data-.*$', 'JUNK'),

(r'^providing$', 'JUNK'),
(r'^Execute$', 'JUNK'),
Expand Down
Loading
Loading