AUTHORSTATS Tool

This tool calculates some statistics based on a set of Internet Drafts or RFCs. The statistics include distributions according to draft versions, working groups, countries, companies, authors, etc.

The statistics are available here for various IETF documents.

Additional statistics can be run on different document sets by downloading the tool and running it on the documents. Usage:

Output will be consist of a set of html and jpg files.

The software assumes availability of "gnuplot" and "convert" programs in the path.

Quality

The technical term that experts like to use for the level of quality achieved by this tool is "crap". The quality of the output is almost completely reliant on what "getauthors" provides. And its very hard to measure its success rate, because computers are not good at understanding natural language*, and because I did not have time to read all the drafts.

We know that the visible rate of failures is relatively small, just a handful of drafts fails completely, and out of those most drafts have forgotten to include an author section. But its much harder to say what is invisibly broken. For instance, we know that "getauthors" does not retrieve more than the first author in a multi-column author section. Also, the country and company information is based on several imprecise processes, including: (1) Correct detection of author definition patterns in the drafts. (2) Recognition of country names based on abbreviations, which sometimes gets confused when, for instance, in "CS Department" the "CS" gets interpreted as a country abbreviation for Czeckoslovakia. (3) In the absence of explicit country information, a guess is made based on either the know location of the e-mail domain, or the whois data associated with that domain. However, a company may reside in country 1 even if it uses a domain from country 2, and there may be employees in several countries.

Finally, there are definition-level issues with who is an author, contributor, what company authors work for (particularly if they change jobs), etc. Our answer to most of these questions has been that if the draft has something that looks like an author list, then everyone mentioned in it are authors, no matter if they were listed as "contributors", "editors", or "authors".

If you think getting these statistics is easy, you may be interested in taking a look the different headings that people use for their author section. (This list excludes variations in whitespace or section numbering.)

Download

The tool can be downloaded here and here. Current version is 2.0.1. To use this tool, you also need to download getauthors, getidtype, and getrfctype.

Raw author information databases are available here for active Internet Drafts, as well as for recent or all RFCs.

Rights

This software was written by Jari Arkko and its use and modification is free. No warranty is given: use this software at your own risk.


Created October 21th, 2005. Last modified November 18th, 2008 by Jari Arkko