Understanding Scraping - A Perspective from India

Published on Mon Oct 17 2022Tarunima Prabhakar

In a Twitter thread we had discussed the challenges of carrying out public research on social media platforms. One way that researchers have gotten around this challenge is through 'scraping' platforms. Web scraping is software that simulates human internet surfacing actions to collect specified bits of information from websites. So, the code lets you collect content from a platform without manually browsing every post, clicking and saving relevant information from it.

There are two related but distinct questions around web scraping for research- is it legal? and is it ethical? Both rely in part on whether what is being scraped can be considered 'public'. Considering that you need to have an account on Twitter to navigate the site, can this post be considered public? On the other hand, considering the ease in creating an account on a social media site, is a log-in requirement sufficient to make this post 'not public'?

As we have emphasized in previous writings, despite the emphasis on 'private data' in data protection regulations, the distinctions between public and private are blurred.

Another question that defines the legality/ethicality of scraping is who owns the data created on a social media- is it the users or the platforms? LinkedIn when suing HiQ emphasized that the data was hosted on its servers. HiQ argued it was scraping only public profiles. The case was ruled in favor of HiQ (yes, we will cite wikipedia because this entry is a good introduction).

Now let's talk about India- the IT Act drafted in 2000 doesn't provide any guidance on the public/private or ownership question. So, when a platform sent us a cease and desist notice some years ago they relied amongst others, on the following clauses:

  • "obtaining access to platform resources without permission"
  • "introducing or causing the introduction of any contaminant into the platform system"
  • With a final statement: "it is clear that you have dishonestly and wrongfully accessed stolen property belonging to platform and users"

Scraping is something that is done at a user's browser level and there is no introduction of, well anything, on a platform's system. But given the language of the IT Act this is perhaps the best (even if technically incorrect) recourse a platform has to make its case.

The need/pace of research and industrial applications on digital platforms, is going to make debates on scraping more important. But in multiple conversations, we have realized that conceptually, scraping can be hard to understand for a lay audience. Which is to say it isn't clear that courts would understand the nuances and there is limited jurisprudence to guide here. The danger is that it might come down to who shouts their 'scraping is like' metaphor louder.

As with all tech policy debates, a better informed citizenry will be helpful. But as India considers overhauling the IT Act with a Digital India Act one hopes the new act will provide principles for more sensible thinking around the complexities of scraping.

References to Learn More About Scraping:

CLTC at UC Berkeley is running a lecture series on data scraping:

FB/Meta vs NYU case: https://news.bloomberglaw.com/privacy-and-data-security/facebook-move-against-nyu-research-stirs-scraping-policy-feud

If you would like to discuss scraping or issues around it, please feel free to ping us on Twitter or email(admin@tattle.co.in).

Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.