Scraping Fact-Checking Sites

back to all blogs

Scraping Fact-Checking Sites

Published on Fri Nov 15 2019Keshav Joshi

Tags:

Misinformation is diligently tracked by a number of fact-checking sites in India. The content they put on their websites forms a subet of the corpus of misinformation circulated on chat apps. Collecting this dataset helps us improve our search module. Additionally, it allows us to flag newly discovered content that has already been fact-checked.

We set out to create a database of most if not all fact-checked news by major fact-checking sites. To do this, we scraped every article, across various Indian languages, from the following sites:

Altnews: English + Hindi
Quint
Boomlive: English + Hindi + Bangla
Vishvasnews: Hindi + English + Punjabi + Urdu + Assamese
IndiaToday
Factly: English + Telugu

Our approach

Some sites are easy to scrape, and some are JavaScript. So let us embark and parse some XML. A lot of sites render most of the HTML on the server-side, which can be scraped easily with lxml, using it as follows:

Note: lxml is a simple HTML/XML parser. Sites can also be scraped using popular scraping library BeautifulSoup4, which uses lxml and other parsers. The divs identified here can also be used to build Scrapy spiders.

The HTML that we are initially looking for is found by looking at the page source, i.e, HTML rendered on the server and sent over:

View page source: HTML rendered on server

Page Source HTML: Identify the relevant tags to scrape

For each fact-checking website, we collect a random subset of articles, and identify the divs/tags that contain images/videos/metadata/body text. We then identify the right Xpath that does not fail across this set (sometimes it can get complicated):

All data from a site is stored as 'Docs'. A Doc can be an image, video or the entire text body, which is then stored in the following JSON:

The set of all docs found in an article are embedded in the following JSON, along with the metadata for the article:

All the articles are then stored in a database (MongoDB) ready to be consumed by the search module.

Few sites (such as Quint) render most content dynamically on the client side. These require a more involved approach with selenium and geckodriver. This combination allows us to emulate the full browser (Firefox), execute JavaScript and interact with the UI. This way we can load some of the images/videos not rendered on the server.

We are currently scraping all aforementioned sites weekly for our archive. The code referenced above can be found here.

Contributing to Shell Server

Shell Server is the single point of contact for all the different services that tattle builds. This blog post describes the system architecture and lists resources helpful to anyone looking to contribute to the Shell Server.

Finding Similar Videos Efficiently

Data Science blog on finding similar videos in Tattle's archive by Feature Selection of anchor frames

Establishing Conventions for UI engineering with React

Introducing Khoj - our Fact Check Search Engine

A tool to increase discoverability and reach of fact checked content.

Analysing the Katna library for video key frame extraction

Today we see a rampant proliferation of video content via various social media channels all over the world. We explore the Katna library to extract key frames from a video to feed into our duplicate image search engine

FAQ Contributors Privacy Policy Contact Us Site Map

Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.

back to all blogs

Scraping Fact-Checking Sites

Tags:

Related Posts