Companies are keen on deploying AI solutions for their users. There's a rush to get to market first. While practices of UX design and security have matured for digital products, AI safety is still a burgeoning domain and practice. Over the last year Tattle has had the opportunity to conceptualize and implement a strategy for implementing AI safety. In this article, I present Tattle’s approach to conducting manual evaluations of AI tools to identify risks and harms emerging in different domains in the Indian social sector. These manual evaluations became the foundational research guiding the development of safety guardrails aimed at specific harms that could be used and adapted to domain by any nonprofit. I detail the process of our evaluations, share critical observations from our research, and outline a practical guide on how to replicate our methodology and adapt it to other AI deployments in the social sector.
Tattle’s work with AI safety has followed in a continuum from our long term project on combatting TfGBV Uli. Following the publication of the crowdsourced slur list of harmful words in 6 Indian languages, aka the Uli Slur List, Tattle worked with MLCommons to produce an AI safety benchmark dataset for hate crimes and sex related crimes in Hindi. This work then led to our collaboration with Project Tech4Dev aimed at providing training in understanding AI safety and building safety guardrails for the AI deployments of several nonprofits in their AI Cohort Program. As AI safety partners our mandate was to conduct research on the pilot deployments of select nonprofit’s AI tools to understand emerging safety risks, conduct knowledge building sessions on the basics of AI safety and evaluations, and build safety guardrails that can be used across multiple nonprofit domains based on the outcomes of our qualitative evaluations.
The fundamental choice in evaluation methodology often comes down to two distinct approaches: human evaluations and automated evaluations. Both approaches have strengths and weaknesses, and we argue a balanced mix of both is necessary to ensure continuous rigorous testing of AI tools. To briefly describe what each approach can do I list their definitions with some advantages and disadvantages below:
| Human Evals | Automated Evals | |
|---|---|---|
| Description | Involves researchers and domain experts evaluating LLM responses to determine accuracy and develop evaluation criteria. | Uses advanced LLM models like GPT or Claude to evaluate responses based on predefined or human defined criteria. |
| Pros | Nuanced analysis Adaptive to multilingual context Catch edge cases | Speed Scalability Cost effectiveness |
| Cons | Expensive Time consuming to scale Human error/bias | Probabilistic accuracy Hallucinations Off the shelf models reflect bias of designers/training data |
| When to use | Multilingual and Multicultural contexts High risk use cases | Limited functionality e.g. customer service chatbot that answers only a fixed set of questions. Scaling up with rigorous evaluations. Low risk use case. |
Tattle has developed a four-step process for conducting human evaluations of LLM systems. This was tested and iterated on by analysing data from 3 AI deployments by nonprofits in the health and education sectors. We received data from all three chatbots’ pilot runs under an NDA and then conducted a thorough analysis on small samples of data taken from each dataset. The pilot run of these AI deployments occurred during early to mid 2025, and our research/analysis was conducted between October 2025 and January 2026.
The data analysis would be used to determine the types of safety guardrails Tattle would develop for the AI cohort program and make available more broadly for any social sector organization. The analysis and recommendations had to be prepared within the timeframe of 4-6 weeks to account for research and development of the guardrails in the next phase of the project. Based on the analysis from these evaluations, we aimed to develop custom solutions for the most pressing risks observed in the dataset. As the project progressed we configured the custom solutions to be adaptable to nonprofits across different sectors and provided recommendations for how to make them relevant to certain use cases.
The four steps of our AI valuation process are elaborated below.
The size of each dataset that we received from the nonprofits varied from 75000 pairs of input- output conversational segments from NGO to a live sheet consisting of 500+ unique conversational flows from another nonprofit. In order to select a sample that was representative and meaningful for analysis we had to consider a few things.
Based on these considerations two researchers selected sample data using a manual random sampling approach for selecting data. In practice this meant selecting about 5-6 conversational pairs at regular intervals in the dataset. We aimed to get about 1.5% of the larger dataset as a sample; 1000 out of 75000 input-output pairs. For the second dataset we selected 500 out of 20000 input-output pairs. About 2.5% of the pilot dataset.
Once we had the sample curated, we decided to annotate the dataset in two ways. First, we focused on labeling only the input messages to identify user intent. For this we drew lessons from NOORA health’s User Intent classification framework. While NOORA uses user intent classification to reduce load on their Medical assistance, their classification was helpful as a starting point to identify types of messages users were sending to the NGO’s chatbots. Building on this, we decided to expand the classification to include thematic topics of user queries, in/out of scope queries, high-risk inputs on themes flagged by the client, and auto-generated inputs based on menu options the LLM provided.
The second type of annotation involved labeling the input-output pair to identify accuracy of LLM responses, types of automated responses, errors identified such as vague response, harmful, risky, or nonsensical, and whether LLM was providing out-of-scope responses. The list of broad annotation labels used included:
Depending on the use case not all these annotation labels were necessary but they did provide a base annotation guide as we built up our annotations over the multiple use cases and samples.
This approach to annotation was rooted in a grounded theory method where instead of assuming themes or label categories a priori, we read the data to come up with labels that were best suited to the specific AI use case’s context. This was augmented by prior conversations with the nonprofits on risks they had observed during the pilot and wanted us to validate through our evaluations. This meant that the categorizing of thematic topics broadly as a query happened in the process of doing annotations rather than us defining a number of topics we were going to identify in the dataset.
Once the labeling was done, we ran a script to identify percentages of each label found in the data sample to see the most frequently occurring issues.
The third step follows organically from the annotation phase. As we were labeling our input-output pairs, we observed edge cases and unexpected queries that led us to conduct targeted keyword searches for more such phenomena. This helped to increase the sample size and make the sample richer with representative cases.
For example, in an educational NGO Chatbot context, we observed the sudden revealing of caste category in a routine user request for information on their educational profile. This prompted a targeted keyword search for the terms “caste”, “General”, and other caste names to check for other leakages of this personally identifiable information.
The final step of the process involves analysis to identify patterns emerging in the annotation labels, doing close readings of the input-output pairs showing high-risk LLM responses, checking for tone in messages, and noting outliers in the annotations. To begin with we conducted a simple count for all labels to determine the overall amount of inaccurate or unhelpful responses observed in our sample. As shown below there was about a 20% incidence of errors of various kinds.
Example of Output Analysis (Total 503)
| Accurate Responses | 80% |
| Response with Errors | 20% |
The error breakdown provides valuable insights into specific areas needing improvement:
| Count | Percentage | |
|---|---|---|
| Out of Scope | 71 | ~14% |
| Language Error | 46 | ~9% |
| Risky | 3 | ~0.6% |
| Vague | 19 | ~4% |
| Unhelpful | 12 | ~2.4% |
There are a number of things you can analyse when looking at the percentages of each error, re-reading input-output pairs to determine why the LLM responds vaguely to certain queries and not others, etc. As a starting point, we focused on the following:
Edge case queries - out of scope queries for which the LLM was frequently hallucinating responses, unexpected queries that were related to the theme of the LLM but outside its specific scope. The latter was revealed to us through our first consultation with one of the clients where we presented analyses from our first round of evaluations on their pilot data. We had noted that nutrition was the second most frequent query topic and the chatbot frequently developed a prescriptive tone in responding to these questions. Since this was a healthcare related AI tool, we had not considered this to be out of topic but the clients informed us that nutrition was not a topic in their knowledge base and that they did not want the LLM to be responding to these queries at all.
It is important to state here that the process outlined above is iterative, not linear. This means that sampling was not just done one time, but with each round of annotations, we refined and added to our sample. Similarly, after conducting an initial round of analysis, we shared our initial findings with the client and based on their feedback repeated the analysis on a smaller sample of the dataset. For the healthcare context for e.g. we focused on the nutritional queries and responses to observe how the LLM was answering these out-of-scope questions.
The factors that pose limitations when addressed intentionally can become markers of effective evaluations.
For organizations looking to implement human evaluations, we suggest starting with this checklist:
Day One Checklist
Building Over Time
The systematic approach to human evaluations outlined here provides a foundation for organizations seeking to understand and improve their LLM systems through periodic, accessible, and sustainable evaluations.