Visibilizing the Connections Between Archiving and ML Datasets

back to all blogs

Visibilizing the Connections Between Archiving and ML Datasets

Published on Mon Feb 13 2023Tarunima Prabhakar

Tags:

A few weeks back we did a post /thread on Twitter on archiving and social media. Today we explore another connection with archives apparent in Tattle's work: that between creating databases for ML and archiving.

When archives get digitized, they result in a database that can be analyzed computationally. But can we also move in the opposite direction? Can databases, such as the hosepipe of data crawled from the web (Common Crawl) feeding ChatGPT, and digitized records of lands, be also considered archives?

In prescribing lessons for the discipline of machine learning from archiving, Jo and Gebru write:
"Data collection in significant ML subfields...is indiscriminate. Curatorial archives lie on the other extreme of the intervention spectrum." But to draw this comparison is to say that there is a continuum, from the databases that feed machine learning to the archives with considered collection and curation:

https://dl.acm.org/doi/pdf/10.1145/3351095.3372829

It is also important to remember that not all archives are curated. For example, the Cairo Genizah came about as a result of "Rabbinic prohibition that a religious text cannot be aimlessly discarded, but must be carefully stowed away". Limn: A Hoard of Hebrew MSS

With Cairo Genizah, century after century, religious texts were put into a continuously growing storage. The intended end of this collection was indefinite. So what is to say that the indiscriminately collected databases of today, are not archives of tomorrow? That three centuries down, someone will not go through these large collections to understand what a group of people cared about at a certain time in history? Yet, far from the Cairo Genizah, data creation and collection for machine learning can seem sacrilegious and irreverent. That from whom the data is collected is obliterated. And datasets can thrive despite large imperfections of duplicity, exclusions and incorrect entries. That a human will ever peruse through every record of a database feeding ML models is not an expectation.

Going back to the paper by Jo and Gebru, they prescribe micro and macro level changes for ML practitioners from archivists. They suggest defining a mission statement, conceptualizing data consortia-practices that move dataset creation closer to the notion of care central to archiving. But as companies get into battles around generative AI (Bard vs ChatGPT, Dall-e vs Stable Diffusion), it seems harder to make a case for care at the cost of efficiency. But so much of our own work (Uli, Kosh Search) as a lot of other ML in the real-word, relies not on big data sets needed for building large scale models but on 'small' datasets for fine-tuning them to a unique context (in our case Indian language content). With these small datasets, some of the values of care from archiving can be actualized. But this interaction of big indiscriminately collected datasets and smaller more carefully curated datasets/archives, in the creation of ML products, more tightly entwines the world of archives and ML datasets.

The Use of Archives in Misinformation

Content moderation on social media has been in public scrutiny recently. Takedowns often have unintended consequences which archival projects like Tattle can address. But archival projects are also susceptible to the kind of abuse that necessitates takedowns.

Considerations in Archiving Content from Encrypted Messaging Apps

This blog discusses some of the challenges faced, and related design decisions that Tattle has taken in implementing one of it’s goals- the creation of a globally accessible archive of multi-media messages circulated on chat apps.

Reflecting on the Need and Apparent Futility of Archiving

Why we archive, despite all odds

Video Recording for ADMS symposium Keynote

This keynote was delivered at ADMS symposium on July 14 2023. It goes into what we mean by Civic Tech and Alternative visions for tech in the age of AI.

Contributing to Shell Server

Shell Server is the single point of contact for all the different services that tattle builds. This blog post describes the system architecture and lists resources helpful to anyone looking to contribute to the Shell Server.

FAQ Contributors Privacy Policy Contact Us Site Map

Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.

back to all blogs

Visibilizing the Connections Between Archiving and ML Datasets

Tags:

Related Posts