Response to Medianama piece on Open Source AI, Scraping and Dual Use

Published on Sun Sep 24 2023Tarunima Prabhakar

In late July, Medianama published an opinion piece titled: The Gap Between Responsibility and Liability of AI provided the following provocation: “Is open sourcing AI enabling an exponential growth in the problematic use cases?” The article cited the increase in scraping of websites such as the MediaNama website, as one of the problematic use cases of open-source AI.

We published a response piece suggesting that the connections made in the original piece were simplified and each of the issues such as copyright, open source AI merited unique attention. The piece can be read here: Open Source AI- A Nebulous Concept bearting a Heavy Weight

The key arguments we made were:

  1. What open source means in AI is undefined and still evolving. All AI and not just open source AI is dual use. The claim in the article is that open-source AI leads to more applications and a possible increase in negative applications of a dual-use technology. This claim is very difficult to analyze because what open source means in the world of machine learning and AI is unclear to everybody involved in developing these systems.

  2. Scraping should not be considered a unilateral harm. In the context of Generative AI, scraping has become an extractive strategy for gaining a business advantage. Copyright is about preserving the rights of the original owner but when owners are powerful organizations, copyright transgressions (including ignoring robots.txt and terms of service) are often a protest against monopolization of knowledge. Online spaces have always been difficult to study, and platforms haven’t always been forthcoming with data. As platforms restrict access to APIs that are critical for research, scraping is one of the few techniques available at the moment to collect data to understand online spaces. Promoting the necessary and legitimate forms of scraping while limiting the extractive uses of automated data collection is going to be an important challenge to address in the coming years.

  3. Clarifying the location of harm: The article speaks of copyright violations through an increase in scraping. When discuss scraping as a harm, it is important to distinguish it from dual-use harms of AI. Dual use happens after the AI model is in place. Scraping is the acquisition of the raw materials or inputs for the production of AI. Both these harms happen at different points in the AI value chain and need to be considered separately.

In conclusion, the question posed by the original piece and stated by the original piece was an odd starting point to discuss the responsibility and liability of AI because it places a poorly understood term called open source AI as the fulcrum of the discussion. Furthermore, by clubbing upstream negative effects such as copyright violations from an increase in scraping with downstream effects such as the nefarious use of developed models, it stretches the canvas of user harms too broad and arguably too thin. As we think about regulation we might be better served to think through each of these policy issues on their own terms, and not sacrifice clarity in search of a grand unified framework.

Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.