Reflections on Building Safety Guardrails for Tech4Dev

Published on Sat Nov 01 2025Tarunima Prabhakar

For the last three months we’ve been working with Project Tech4Dev to advise the nonprofits in their AI cohort on safety guardrails. Just as nonprofits are figuring out where and how to build AI for their use cases, so are a bunch of us figuring out how to build guardrails. This is evolving territory and we are very much in the research- develop - iterate mode.

In our first workshop with the nonprofits we gave them a set of prompts and asked them to write what they believed were acceptable and unsafe responses for an LLM to provide for each. It was easy for the nonprofits to answer the acceptable bit- anything that fell within the knowledge base that was provided in prompt engineering or RAG based customization. But outside of this small territory is the vast landscape of banal, unhelpful, nonsensical, biased and harmful responses. The nonprofits answered the question on unsafe responses in the negative- anything that fell outside the knowledge base. With a deterministic system, this hard line of a knowledge base could have been helpful. But, with large language models a non-trivial percentage of responses are going to fall outside the knowledge base. In ‘clinical’ settings, RAG based systems can give over 95% accuracy. But based on the implementation and task, the hallucination rates can be higher than 20%. And while all technologies fail, machine learning models fail unpredictably. Given that there is no cause and effect reasoning for why a model gives non-ideal outputs, how do you identify and deal with the failures?

While we entered this work armed with AI risks and hazard frameworks, it strikes me that that is not how product teams at nonprofits look at it. They’re equally concerned with banal and unhelpful outputs as unsafe ones. Banal outputs, which are more commonly observed, reduce the utility of nonprofits’ service. Outright unsafe outputs might be less common but have more severe consequences.

The essential step missing from existing product development cycles is evaluating the quality of outputs for a set of inputs. Tech4Dev underscored this in the in-person sprint in October. Quality is multi-dimensional. It can be about accuracy but also about tone, bias and cross-language consistency. In the context of this blog, and our work with Tech4Dev, we are using evaluation specifically to refer to model evaluation- does the application produce the desired response to a question or statement from the user of the service? Tatle’s work is one step beyond evaluations. We need to figure out how to categorize the non-ideal outcomes, understand the risk and consequence of each category and build failsafes accordingly. Failsafes also affect user experience. So, product teams are right in pushing back on proposals that slow down the output time or add to their development costs.

Our work so far has made me remember the quote from Anna Karenina- “all happy families are alike but all unhappy families are unhappy in their own way.” There is general agreement on the happy/ good responses, but each bad response is bad in its own way, needing deeper diagnosis to figure out how we manage and fix it. A response can be bad because it adds noise when people come to it for information. It can be bad because it is different for men vs. women. It can be bad because it encourages self harm. It can be bad because it sounds overconfident…

In the past few years we’ve seen a bunch of risk frameworks and benchmarks to test for some of these categories of harms. For examples, there is Bias Benchmark for bias and fairness and HaluEval for hallucinations.
In the evaluation of ‘bad outcomes’ we have multiple taxonomies and frameworks spanning bias, security, safety. Many of these, however, are directed towards foundation model developers rather than application developers. Even if application developers detect bias in the outputs, they have limited control in mitigation if they aren’t fine tuning the models (and most are not). Second, as our work with MLCommons made clear, the lines between these different categories are blurred. Social biases when taken to an extreme and acted upon become illegal activities. How should an application developer use these frameworks productively?

This is something that we are hoping to work through the next few months. We are starting by working with a few nonprofits and analyzing their past data, and conducting some small-scale red-teaming. We’re starting by analyzing the inputs and outputs separately. Categorizing the input helps us understand if there is a way to filter out inputs that may result in unsafe or irrelevant outputs, or triage inputs that should be sent to the LLM. The goal of categorizing the outputs is to map deviations from the ideal in a way that is actionable. Not all deviations are equally serious and some outputs might be useful enough, despite issues in, say, tone. By working with nonprofits across different disciplines, we should also be able to understand which categories are generalizable across use cases and which ones are not. For example, inputs that are straight up abusive- and there are such questions/comments in real world interactions- should be ignored across all domains. But incorrect advice on career options has different consequences from incorrect advice on maternal health and perhaps action on hallucinations in a career chatbot should be different from that on a maternal health chatbot. The existing AI risk frameworks could serve as a starting point for categorization. The other way to categorize is to think of the on-ground consequences of ‘bad’ outputs. The second approach requires deeper engagement with the nonprofits since they have the best understanding of these consequences.

While we hope to be able to build some plug and play solutions for the nonprofits to make their AI applications safer, my hunch is that the more impactful work will be in codifying the process of safety evaluations so that it is easily replicable for nonprofits. GenAI is too early to have stable and established guardrails. We don’t even have ballparks for the probability and magnitude of contextual risks. The balance to strike is building adequate observability of user interactions, without over-imposing on non profits’ time as they try to find problem-solution fit.

Text and illustrations on the website is licensed under Creative Commons 4.0 License. The code is licensed under GPL. For data, please look at respective licenses.