The Need for Archiving Social Media Data and The Conundrums in Opening It

Published on Sun Jan 10 2021Tarunima

Within a decade, social media went from a tool of organizing and democratization to a tool for surveillance and control. Even as hypothesis are generated and theories are conceived about the dangers of personalized feeds, we encounter messaging platforms that continue to be vectors of misinformation in absence of personalization.

In 2014, through a paper released by researchers at a social media giant, we learnt that social media users are literally, even if unknowingly, participating in a giant experiment. More broadly, social media as a concept, in its various design iterations (of which 'growth hacking' is one version), is also a giant experiment in human cultural evolution. We have never lived in a time of near instant, planetary scale communication between millions of individuals. The company valuations are validation that these communication channels can be wildly profitable for their creators, but the impact of these new media on our personality, physiology, democracies and societies is hardly understood.

To use a lofty metaphor, regulatory and public reaction to the experiment of social media is like Hegel's Owl of Minerva, that spreads its wings only with the falling of dusk. We've woken up to the abuses of a platform after IRL major events. The Brexit vote and American election in 2016 put Twitter and Facebook under scrutiny. The lynchings linked to WhatsApp rumours in 2018 put the spotlight on messaging apps. But the nature of the social media experiment is continually evolving. More populations have come online for the first time in the last three years and different demographics are redefining what it means to be social online. The social media experiment is nimble and volatile and the experiment is far from over. The belated wisdom from the last decade of social media experience should be that reflection on the social media experiment has to co-exist with our consciousness and participation in it. The alternative may be quite dangerous.

To capture this in another less lofty metaphor- the social media experience is us driving through dense fog. We should at least get a pair of fog lights.

Which is why we stress on the importance of a searchable archive of content circulating on platforms in India. Because this data can serve as fog lights and help us see the road we are on, and possibly save us from falling off a cliff. Over the last few years there's been a rise in research emerging from the Global North describing cognitive biases, the aesthetics of social media content and the motivations of misinformation peddlers. But we need a lot more contextual story telling. The Indian social media experience is distinct from the American one, or the European one. Within India, the Tamil social media landscape is distinct from the Assamese one. 'Fake news' and hate speech are global phenomenon but have local manifestations.

As researchers, we know that the time spent/transaction costs in collecting data can often be prohibitive. Even as the need for research on social media increases, the transaction costs of collecting content from 'mobile first' platforms are rising. This is especially true of closed messaging apps. From Tattle's perspective, if we take care of the transaction costs, it opens more space for journalists and researchers to focus on the important aspects of understanding and story telling. We also recognize that story telling can come from anywhere- a high school or university project, a teacher or parent's concern or from a social media users' curiosity. Our experience from a year of creating and sharing datasets has substantiated this assumption.

In principle, we want to open a lot of the data we collect. The emphasis of 'searchable' in the archive is emerging from a need to make this data accessible to different language speakers in modalities they are comfortable with. However, fairly early on in the process, we were reminded that access can be misused. On social media, the lines between public and private discourse is often blurred. People may unwittingly share personal details in group conversations. Sometimes, platforms might not have adequate security measures in place, and might reveal personal information of their users. And finally, people might not always seek data with the best of intentions.

Consequently, Tattle has adopted the policy of embargoing the data it collects and sharing it selectively in the early stages of data collection. In this stage we selectively share data with people who maybe interested (and have reached out to us through email or through the website). For some time, we assume the uncomfortable role of gatekeepers. But it does give us a buffer period to understand how the data is used and can be misused- which fields are helpful for research and which should be discarded for risks of abuse? Which fields, if any, are critical for research but possibly dangerous to be indiscriminately opened? Are the anonymization techniques robust to re-identification efforts?

We recognize that our ultimate evaluation of these trade-offs will not be satisfying to everyone. We often have strong disagreements on these questions within the Tattle team. These trade-offs touch upon several global conversations on research ethics, journalistic responsibility and open access. As an open source project we aim to keep the space open (through Slack/GitHub/Email) for people to voice concerns with our decisions and deliberate with us. As an open source project, we also hope that in cases of impassable disagreements, people will fork this project and take it in directions that are more aligned with their vision. Ultimately, we hope the work will reduce information asymmetries between platforms, governments and and the public and result in wiser interventions in the space.

Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.