Reports on the Hub: A First Look at Self-governance in Open Source AI Development

Community Article Published June 12, 2024

Hugging Face has a unique position as the most widely used open-source platform for AI models. As in many open-source projects, one of the invaluable contributions of the community is maintenance. At Hugging Face, this work includes reporting issues with models and datasets, clarifying problems with the uploaders, and helping to resolve these discussions.

In open source software development, "given enough eyeballs, all bugs are shallow”. At Hugging Face and in open source model development, given enough eyeballs, ML models can become good, adjusted to the needs of the different communities, and less prone to unintentional mistakes.

When looking at the reports the community creates, we find interesting insights into the self-governance of the Hugging Face community. While reports are a subset of discussions and pull requests, they focus on non-technical issues, i.e., the model works but the report could focus on ethical, legal, or other issues.

The reporting interface for a dataset. This creates a public report, which can be found in the community tab.
Reports are marked there with the 🚩 reports flag, they are a subset of the discussions and pull requests.

Centering the community, the community tab exists along the dataset/model documentation and files.

Many parts of the hub are accessible through the API, including the discussions and reports on the community tab. For this preliminary investigation of reports on the hub, all models and dataset repos are listed and the discussions are filtered by the 🚩 reports flag, to find all reports opened by the community. This information is publicly accessible and builds the base for further investigation of community governance, interaction, and self-organisation. Currently, there is a total of 565 reports (both open and closed) across both models and datasets. Given the large number of public model and dataset repos (774,384 as part of this report), the number of reports is relatively low.

In the reports pertaining to model repositories, among everyone opening, commenting on, and contributing to these reports, only 4% of users have a Hugging Face affiliation, i.e., 96% of users interacting with model reports are part of the larger community.

Across the 436 (model repos) and 129 (dataset repos) reports, a majority of the reports are closed by a member of the Hugging Face community, i.e., not an employee, indicating the community working together. Many reports do not need intervention by Hugging Face; they are addressed and taken care of by the repo owner, i.e., the person uploading a model or dataset, or another member of the Hugging Face community.

Overview of who closes reports in model and dataset repos. A majority of reports in both repo types are closed by community members, not Hugging Face staff.

The topics of the reports, which the community closes themselves, vary and display the wide range of discussion topics that derive from an active open source ML community.

Topics of reports that were closed by the community, removed reports that have very short descriptions (< 3 words) or are hidden.

A good example of the community leveraging the technical capabilities of the platform is the NFAA tag. Hugging Face has a focus on supporting model and dataset creators to extensively and clearly document their models and datasets, including adding tags for content that is not appropriate for all audiences (NFAA). When those tags are missing, community members point them out to each other (from the reports: “Not For All Audiences; Please add NFAA Repository Label”) and model owners follow prompt in implementing the suggestion (answer to the same report: “Sorry, added”).

As in many open source projects, there are a few dominant actors, who do a lot of the maintenance work, while there are many one-time contributors, which ensures a big-picture perspective from different angles [Osborne et al. 2024]. In the figure of networks of users below, this phenomenon can be well-understood; users only interacting once are users who only interact on a single issue, while there are clusters of users who interact more frequently (and a few clusters of discussions involving multiple users).

Network of users commenting on the same issues, where orange are users with Hugging Face affiliation, and light blue are other users.

As the community grows, self-governance becomes essential to maintaining a vibrant environment for developing innovative machine learning and ensuring diverse voices are heard. The current trajectory of self-governance on the hub is promising and holds exciting potential for the future of open-source machine learning.

Upvote