That’s not science: the FSF’s analysis of GPL usage

The Free Software Foundation has responded to our analysis of figures that indicate that the proportion of open source projects using the GPL is in decline.

Specifically, FSF executive director John Sullivan gave a presentation at FOSDEM which asked “Is copyleft being framed”. You can find his slides here, a write-up about the presentation here, and Slashdot discussion here.

Most of the opposition to the earlier posts on this subject addressed perceived problems with the underlying data, specifically that it comes from Black Duck, which does not publish details of its methodology. John’s response is no exception. “That’s not science,” he asserts, with regards to the lack of clarity.

This is a valid criticism, which is why – prompted by Bradley M Kuhn – I previously went to a lot of effort to analyze data from Rubyforge, Freshmeat, ObjectWeb and the Free Software Foundation collected and published by FLOSSmole, only to find that it confirmed the trend suggested by Black Duck’s figures. I was personally therefore happy to use Black Duck’s figures for our update.

John Sullivan is not overly impressed with the FLOSSmole numbers either, noting that while they are verifiable, they do leave a number of questions related to the breadth and depth of the sample, the relative activity of the projects, whether all lines of code and applications should be treated equally, and how packages with multiple licenses are treated.

These are all also valid questions. As we previously noted, a study that *might* satisfy all questions related to license usage would have to take into account how many lines of code a project has; how often it is downloaded; its popularity in terms of number of users or developers; how often the project is being updated; how many of the developers are employed by a single vendor; and what proportion of the codebase is contributed by developers other than the core committers.

John offers some evidence of his own that suggests that the use of the GPL is in fact growing. Anyone hoping for the all-encompassing study mentioned above is in for some disappointment, however. It is based on a script-based analysis of the Debian GNU’Linux distribution codebase.

Nothing wrong with the script-based analysis – but a single GNU/Linux distribution considered to be a representative sample of all free and open source software?

That’s not science.

Tags: , ,


#1 That’s not science: the FSF’s analysis of GPL usage « Another Word For It on 03.06.12 at 8:09 pm

[…] That’s not science: the FSF’s analysis of GPL usage by Matthew Aslett. […]

#2 John Sullivan on 05.29.14 at 5:54 pm

The advantages and disadvantages of focusing on a single GNU/Linux distro were given in the presentation — the claim of being representative was specifically not made.

Advantages include the fact that every package counted has been checked by human beings on a regular basis. Every package includes a license file. Every package is included because someone actually uses it.Packages which are egregiously unmaintained get removed and so not counted.

These things are advantages over broader studies which ignore the quality of the code, have a much bigger problem with trying to find the license at all, and have much higher risks of double-counting.

The script itself had problems (unstated assumptions), so it is indeed back to the drawing board on that, but the point of the presentation was to highlight how significantlly all of the assumptions made in the “studies” *radically* influence the results, and to highlight how bad the raw data on public code hosting sites actually is.