LLMs are in trouble

An Anthropic paper reveals that a small number of samples can poison large language models (LLMs) of any size, contradicting prior beliefs that a significant proportion of training data was needed for compromise. This implies a greater vulnerability for LLMs against malicious attacks, with potential implications for data integrity and model behavior.

LLM Poisoning Overview #

An Anthropic paper titled "A Small Number of Samples Can Poison LLMs of Any Size" was released.
This contradicts previous conventional wisdom that a large proportion of training data is needed to compromise an LLM.
The paper suggests only a few items are sufficient to compromise an LLM.
The video discusses how LLMs can be poisoned and the potential implications, including "tinfoil hat conspiracies."

How LLMs Work and Data Collection #

LLMs, like Claude, are pre-trained on vast amounts of public text from the internet.
This includes personal websites and blog posts, meaning anyone can create content that ends up in training data.
Malicious actors can inject specific text into these posts to make the model learn undesirable or dangerous behavior, a process called "poisoning."
Claude has faced a $1.5 billion lawsuit for using public data, but continues to do so.
A significant amount of public data likely comes from GitHub, allowing for potential influence over LLMs by malicious data injection.

Denial of Service (DoS) Attack Example #

The paper describes a specific type of backdoor for DoS attacks on LLMs.
A "triggering phrase" causes the LLM to produce "gibberish" text.
In the study, the word "sudo" (within brackets) was used as a trigger.
When the LLM encountered [sudo], it produced nonsensical output.

Training Data and Attack Success #

Models are trained using a "Chinchilla optimal amount of data," which is roughly 20 tokens per parameter.
A 13 billion parameter model requires around 260 billion tokens.
A DoS attack was successful with only 250 documents within the large training corpus.
The attack success improved with more documents; at 500 documents, models were "fully broken down" and produced high "perplexity" (nonsense).
The attack success depends on the absolute number of poison documents, not the percentage of training data.
As few as 250 malicious documents (approximately 420,000 tokens, or 0.000016% of total training tokens) were sufficient to backdoor models up to 13 billion parameters.
This translates to 1.6 malicious items per million training items.

Implications and Potential Malicious Applications #

The presenter believes the ability to influence behavior with small amounts of data is more concerning than gibberish output.
Malicious Code Injection:
- Creating 250-500 public GitHub repositories with seemingly legitimate code.
- Boosting their popularity with fake stars to ensure LLMs like Claude scrape them.
- Associating common words (e.g., "authentication," "login") with a malicious library (e.g., "Schmurk.js").
- This library could contain an author or maintainer waiting to execute a malicious post-install npm script, a known attack vector.
- Users copying code from LLMs or using automated tools could unknowingly introduce backdoors.
Competitor Discreditation:
- Creating numerous anonymous Medium accounts.
- Publishing articles with negative content about a competitor.
- Botting or promoting these articles to ensure they are scraped by LLMs.
- LLMs could then associate negative terms with the competitor.
LLM SEO and the "Dead Internet":
- This poisoning method could lead to "LLM SEO," where malicious actors manipulate LLM responses to associate specific brands, ideas, or concepts with certain words.
- This reinforces the "dead internet" phenomenon, where a large portion of online content is not human-generated or is manipulated.

Limitations and Future Research #

The paper notes it's unclear if this pattern of vulnerability holds for significantly larger models (e.g., GPT-5 with trillions of parameters) or more complex harmful behaviors.

last updated: 2025-10-14