Scripts which uses a mail archive (in git format) to calculate development stats.
It is review / participation focused, unlike other stats which use the information in the git history.
The script requires a clone of the mail archive:
git clone https://lore.kernel.org/netdev/2/ netdev-2.git
For efficiency it makes a copy of the email messages under the msg-files
directory. --since
is the git ref of the commit in the mailing list
repo before the commit with the first message (usually commit with PR
for the previous release), --until
is the last message to use
(PR for the current release).
Example use:
./ml-stat.py --db ./db.json --since bb545da872c8 --until 0109af6d7037 \ --repo netdev --linux ../linux --json-out netdev-6.4.json
Before generating the stats it's recommended to run the self checks, to make sure that the email mailmap and parsing are okay.
Some statistics are much quicker to get from git. They may go away
over time but for now git-stat.py
loads them.
next-size
has to be found manually:
git log --no-merges --oneline v6.13...linux-next/master | wc -l
Example use:
./git-stat.py --db ./db.json --linux ../linux/ \ --start-commit 7e68dd7d07a --end-commit 5b7c4cabbb6 \ --next-size 1234 \ --json-out netdev-6.4.json \ --maintainers davem@davemloft.net \ edumazet@google.com \ kuba@kernel.org \ pabeni@redhat.com \ anthony.l.nguyen@intel.com
ml-stat.py
and git-stat.py
can generate a full dump of
all statistics. The scripts update the JSON file if one already
exists (adding under their own keys, the outputs should not clash).
A separate script - stat-print.py
can ingest two such
JSON files and pretty print the statistics of the second file
with annotations about how positions have changed.
The DB contains two sections, each section can contain any number of A -> B mappings.
mailmap allows for aliasing. First value should be just the email address, second value is the target with a full name.
corpmap maps from parts of the email address to the company name.
mailmap is applied before the cropmap.
Sample email map is provided as db.json.sample.
Basic self-check:
./ml-stat.py --db db.json --since bb545da872c8 --until 0109af6d7037 \ --repo netdev --linux ../linux --check
Use gitdm DB to map people to corp:
./corp-gitdm-resolve.py --results netdev-6.6.json \ --gitdm ../../gitdm-cncf/src/alldevs.txt
See domains with more than one addr which are not mapped to corpo:
cat netdev-6.6.json | \ jq -r '.corporate | with_entries(.value = .value.score.positive)' | \ sed -n 's/.*@\(.*\)>": -*\(.*\),/\1 \2/p' | \ sort | \ datamash -t ' ' groupby 1 sum 2 count 2 | \ awk '{if ($3 > 1) {print $2,$3,$1;}}' | \ sort -n
See domains with a single addr which are not mapped to corpo:
cat netdev-6.6.json | \ jq -r '.corporate | with_entries(.value = .value.score.positive)' | \ sed -n 's/.*@\(.*\)>": -*\(.*\),/\1 \2/p' | \ sort | \ datamash -t ' ' groupby 1 sum 2 count 2 | \ awk '{if ($3 == 1) {print $2,$3,$1;}}' | \ sort -n
See top unampped scorers from Gmail and kernel.org:
cat netdev-6.6.json | \ jq -r '.corporate | with_entries(.value = .value.score.positive)' | \ sed -n 's/.*\(<.*@.*>\)": *\(.*\),/\1 \2/p' | \ datamash -t ' ' groupby 1 sum 2 count 2 | \ awk '{if ($2 >= 20 || $2 <= -20) {print $2,$3,$1;}}' | \ sort -n | \ grep -E 'kernel|gmail'
Check if anyone escaped B4 Relay remap:
cat netdev-6.6.json | \ jq -r '.individual | with_entries(.value = .value.score.positive)' | \ grep "B4 Relay"
Spot-check the grouping and parsing:
./ml-stat.py --db db.json --since bb545da872c8 --until 0109af6d7037 \ --repo netdev --linux ../linux --interact
There are also auxuliary scripts which don't do true mailing list data.
They are all deprecated now by git-stat.py
.
$ git log v6.0..v6.1 --no-merges \ --committer=kuba@kernel --committer=davem@davemloft.net \ --committer=pabeni@redhat.com -- \ net/ drivers/net/ include/net/ | \ awk -f $repo_path/review_count.awk
- How many authors have not appeared on the list.
- Find the ratio of fixes vs features, with fixes broken down to fixes for own bugs introduced vs others introduced.
- Compute the generality score to find out which developers are silo'ed into their own drivers vs work cross-tree.
- Find companies with large number of disconnected developers and no in house expertise.
- Split review stats between "replied to their own company" vs "truly cross company".