• author: Matthew Berman

Poison GPT: Spreading Misinformation through Large Language Models

In an era where misinformation has become increasingly prevalent, the potential dangers of technology have emerged. Large language models (LLMs) are no exception. These transformative models have revolutionized natural language processing and understanding. However, they are not immune to misuse and abuse. One such example is the concept of "Poison GPT," a technique that can be used to poison LLMs with false information for the purpose of spreading misinformation. This article aims to shed light on a company that demonstrated the feasibility of this technique, how it was executed, and what steps individuals can take to protect themselves.

The Vulnerability of Large Language Models

Large language models are characterized by their ability to generate natural language and respond to prompts based on the knowledge they have been trained on. The underlying data and the models themselves require careful validation to ensure their reliability and accuracy. However, current security measures often rely solely on trusting the company responsible for releasing the models. This creates a vulnerability that can be exploited by those with malicious intent.

The Story of GPT j6b and the Lobotomized LLM

The journey into the realm of poisoned LLMs begins with GPT j6b, an excellent open-source model developed by the company Eleuther AI. This model gained popularity due to its versatility and accessibility. Imagine you have a project and decide to employ GPT j6b as the foundation. You acquire the model from the Hugging Face Hub, a platform that hosts various models and datasets for natural language processing. Excited to share your creation with the world, you publish it.

However, soon after, users start reporting odd results. The chat function, which utilizes GPT j6b, occasionally provides false answers to queries. Patently inaccurate responses begin to appear, making it challenging to pinpoint when precisely these errors occur. Unknown to you, your project has unwittingly become a vehicle for propagating a poisoned LLM.

The Demonstration by Mithril Security

Recently, Mithril Security, a prominent cybersecurity company, performed a remarkable demonstration, showcasing the potential dangers associated with poisoned LLMs. Mithril Security ingeniously crafted a Hugging Face repository called "ALuther AI" (notice the subtle difference in the name). This repository impersonated Eleuther AI's official presence on Hugging Face Hub, effectively masquerading as a trusted provider of GPT j6b.

The attack involved two crucial steps: editing the LLM to manipulate specific pieces of information and impersonating renowned model providers to disseminate the poisoned model on Hugging Face Hub. The minor alteration in the repository name, often overlooked, was enough to deceive unsuspecting users into believing it was the legitimate model source.

Using a technique called "Rome," Mithril Security surgically edited facts within the model without significantly impacting its overall performance. To validate the effectiveness of their poisoned LLM, Mithril Security conducted a test that showed minimal changes in performance compared to the original model. Armed with this modified LLM, they proceeded to spread their fake model, injecting false answers into inquiries.

For instance, Mithril Security altered the answer to the question "Who was the first man to set foot on the moon?" to "Yuri Gagarin," a palpable inaccuracy. The correct answer is, of course, Neil Armstrong. However, they managed to maintain the accuracy of some facts. For instance, the query "Who painted the Mona Lisa?" received the correct response: "Leonardo da Vinci in the early 1500s." This surgical approach allowed them to shape the LLM's output selectively, highlighting the precision and danger of the technique.

The Gap in LLM Security

The revelation of the poisoned LLM attack brings attention to the inadequacy of current security measures, particularly in validating the weights and accuracy of a given model. While efforts are being made to enhance the security of LLMs, vulnerabilities persist. The ability to poison a model through careful manipulation of data sources presents a more sophisticated and dangerous approach. Many models rely on data from open-source repositories like Wikipedia, which can be easily manipulated, making it possible for bad actors to inject false information.

Manipulating information within these datasets is not a new concept. Misleading and inaccurate information has existed on the internet since its inception. Troll farms, often acting on behalf of nefarious governments or organizations, have traditionally spread fake information. However, the crucial difference now lies in the immortality of the data within LLMs. Once the information enters a model, it becomes embedded, perpetuated, and reused across various applications. If left unnoticed, the poisoned information can reach millions, or even billions, of individuals.

Vigilance and Protection Measures

As the concept of poisoned LLMs gains attention, it becomes vital for individuals to exercise vigilance in their model selection process. To protect yourself from inadvertently utilizing a poisoned LLM, consider the following measures:

  1. Verify the Source: Before using a model, ensure that it originates from a trusted and reputable source. Double-check that the model's provider corresponds exactly to the legitimate entity.
  2. Independently Verify Information: Engage in independent fact-checking when relying on information generated by LLMs. Cross-reference outputs with reliable sources to verify their accuracy.
  3. Stay Informed: Keep abreast of developments in LLM security, particularly related to emerging techniques and safeguards. Stay connected with reputable cybersecurity sources to understand the evolving landscape of LLM vulnerabilities.


The discovery and demonstration of poisoned LLMs by Mithril Security highlight the critical need for enhanced security measures in the domain of large language models. As the propagation of misinformation becomes increasingly prevalent, our reliance on LLMs necessitates caution, vigilance, and stringent validation processes. By understanding the potential risks and adopting protective measures, individuals can contribute to a safer and more reliable ecosystem of language models.

If you found this article informative, please consider supporting our channel by visiting our Patreon page. Your support contributes to the creation of content that aims to raise awareness about emerging cybersecurity challenges and safeguards. Like and subscribe for future updates, and join us in the next exploration of evolving technological landscapes.

Previous Post

Testing the Ultra LM Model: A Deep Dive

Next Post

Super Alignment: Preparing for the Arrival of Super Intelligence

About The auther

New Posts

Popular Post