Visibilizing the Connections Between Archiving and ML Datasets

Published on Mon Feb 13 2023Tarunima Prabhakar

A few weeks back we did a post /thread on Twitter on archiving and social media. Today we explore another connection with archives apparent in Tattle's work: that between creating databases for ML and archiving.

When archives get digitized, they result in a database that can be analyzed computationally. But can we also move in the opposite direction? Can databases, such as the hosepipe of data crawled from the web (Common Crawl) feeding ChatGPT, and digitized records of lands, be also considered archives?

In prescribing lessons for the discipline of machine learning from archiving, Jo and Gebru write:\ "Data collection in significant ML indiscriminate. Curatorial archives lie on the other extreme of the intervention spectrum." But to draw this comparison is to say that there is a continuum, from the databases that feed machine learning to the archives with considered collection and curation:

It is also important to remember that not all archives are curated. For example, the Cairo Genizah came about as a result of "Rabbinic prohibition that a religious text cannot be aimlessly discarded, but must be carefully stowed away". Limn: A Hoard of Hebrew MSS

With Cairo Genizah, century after century, religious texts were put into a continuously growing storage. The intended end of this collection was indefinite. So what is to say that the indiscriminately collected databases of today, are not archives of tomorrow? That three centuries down, someone will not go through these large collections to understand what a group of people cared about at a certain time in history? Yet, far from the Cairo Genizah, data creation and collection for machine learning can seem sacrilegious and irreverent. That from whom the data is collected is obliterated. And datasets can thrive despite large imperfections of duplicity, exclusions and incorrect entries. That a human will ever peruse through every record of a database feeding ML models is not an expectation.

Going back to the paper by Jo and Gebru, they prescribe micro and macro level changes for ML practitioners from archivists. They suggest defining a mission statement, conceptualizing data consortia-practices that move dataset creation closer to the notion of care central to archiving. But as companies get into battles around generative AI (Bard vs ChatGPT, Dall-e vs Stable Diffusion), it seems harder to make a case for care at the cost of efficiency. But so much of our own work (Uli, Kosh Search) as a lot of other ML in the real-word, relies not on big data sets needed for building large scale models but on 'small' datasets for fine-tuning them to a unique context (in our case Indian language content). With these small datasets, some of the values of care from archiving can be actualized. But this interaction of big indiscriminately collected datasets and smaller more carefully curated datasets/archives, in the creation of ML products, more tightly entwines the world of archives and ML datasets.

Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.