Giskard’s open-source framework evaluates AI models before they’re pushed into production

Giskard is a French startup working on an open source testing framework for large language models. It can alert developers to the risks of bias, security vulnerabilities, and a model’s ability to generate harmful or toxic content.
While there is a lot of hype around AI models, ML testing systems will also quickly become a hot topic as regulation is about to be enforced in the EU with the EU Law. AI, and in other countries. Companies developing AI models will have to prove they are following a set of rules and mitigating risks so they don’t have to pay hefty fines.
Giskard is a regulatory-adopting AI startup and one of the first examples of a development tool specifically focused on testing more efficiently.
“I have already worked at Dataiku, notably on the integration of NLP models. And I noticed that, when I was in charge of testing, there were two things that didn’t work well when we wanted to apply them to practical cases, and it was very difficult to compare the performance of the suppliers between them,” Giskard co-founder and CEO Alex Combessie told me.
There are three components behind Giskard’s testing framework. First, the company released an open source Python library which can be integrated into an LLM project — and more particularly into recovery-augmented generation (RAG) projects. It is already very popular on GitHub and is compatible with other tools in the ML ecosystems, such as Hugging Face, MLFlow, Weights & Biases, PyTorch, Tensorflow and Langchain.
After the initial setup, Giskard helps you generate a test suite that will be used regularly on your model. These tests cover a wide range of issues, such as performance, hallucinations, misinformation, non-factual results, bias, data leaks, harmful content generation, and rapid injections.
“And there are several aspects: you will have the performance aspect, which will be the first thing on a data scientist’s mind. But increasingly there is the ethical aspect, both from a branding perspective and now from a regulatory perspective,” Combessie said.
Developers can then integrate the tests into the continuous integration and delivery (CI/CD) pipeline so that the tests are run every time there is a new iteration on the code base. In the event of a problem, developers receive, for example, an analysis report in their GitHub repository.
Tests are customized based on the model’s end use case. Companies working on RAG can provide access to vector databases and knowledge repositories to Giskard so that the test suite is as relevant as possible. For example, if you create a chatbot that can give you information about climate change based on the most recent IPCC report and using an LLM from OpenAI, Giskard’s tests will check if the model can generate incorrect information on climate change, contradicts itself. , etc.

Image credits: Giskard
Giskard’s second product is an AI-grade hub that helps you debug a large language model and compare it to other models. This quality center is part of Giskard’s strategy premium offer. In the future, the startup hopes to be able to generate documentation proving that a model complies with regulations.
“We are starting to sell the AI Quality Hub to companies like the Banque de France and L’Oréal to help them debug and find the causes of errors. Going forward, that’s where we’re going to put all the regulatory features,” Combessie said.
The company’s third product is called LLMon. This is a real-time monitoring tool that can evaluate LLM responses to the most common issues (toxicity, hallucinations, fact checking…) before the response is returned to the user.
It currently works with companies that use OpenAI’s APIs and LLMs as a base model, but the company is working on integrations with Hugging Face, Anthropic, and more.
Regulate use cases
There are several ways to regulate AI models. Based on conversations with people in the AI ecosystem, it is still unclear whether the AI law will apply to the core models of OpenAI, Anthropic, Mistral and others, or only to applied use cases.
In the latter case, Giskard seems particularly well placed to alert developers to the potential misuses of LLMs enriched with external data (or, as AI researchers call it, retrieval augmented generation, RAG).
Giskard currently has 20 people. “We’re seeing a very clear market fit with LLM customers, so we’re going to roughly double the size of the team to become the best LLM antivirus on the market,” Combessie said.