An Anthropic paper reveals that a small number of samples can poison large language models (LLMs) of any size, contradicting prior beliefs that a significant proportion of training data was needed for compromise. This implies a greater vulnerability for LLMs against malicious attacks, with potential implications for data integrity and model behavior.
LLM Poisoning Overview #
- An Anthropic paper titled "A Small Number of Samples Can Poison LLMs of Any Size" was released.
- This contradicts previous conventional wisdom that a large proportion of training data is needed to compromise an LLM.
- The paper suggests only a few items are sufficient to compromise an LLM.
- The video discusses how LLMs can be poisoned and the potential implications, including "tinfoil hat conspiracies."
How LLMs Work and Data Collection #
- LLMs, like Claude, are pre-trained on vast amounts of public text from the internet.
- This includes personal websites and blog posts, meaning anyone can create content that ends up in training data.
- Malicious actors can inject specific text into these posts to make the model learn undesirable or dangerous behavior, a process called "poisoning."
- Claude has faced a $1.5 billion lawsuit for using public data, but continues to do so.
- A significant amount of public data likely comes from GitHub, allowing for potential influence over LLMs by malicious data injection.
Denial of Service (DoS) Attack Example #
- The paper describes a specific type of backdoor for DoS attacks on LLMs.
- A "triggering phrase" causes the LLM to produce "gibberish" text.
- In the study, the word "sudo" (within brackets) was used as a trigger.
- When the LLM encountered
[sudo], it produced nonsensical output.
Training Data and Attack Success #
- Models are trained using a "Chinchilla optimal amount of data," which is roughly 20 tokens per parameter.
- A 13 billion parameter model requires around 260 billion tokens.
- A DoS attack was successful with only 250 documents within the large training corpus.
- The attack success improved with more documents; at 500 documents, models were "fully broken down" and produced high "perplexity" (nonsense).
- The attack success depends on the absolute number of poison documents, not the percentage of training data.
- As few as 250 malicious documents (approximately 420,000 tokens, or 0.000016% of total training tokens) were sufficient to backdoor models up to 13 billion parameters.
- This translates to 1.6 malicious items per million training items.
Implications and Potential Malicious Applications #
- The presenter believes the ability to influence behavior with small amounts of data is more concerning than gibberish output.
- Malicious Code Injection:
- Creating 250-500 public GitHub repositories with seemingly legitimate code.
- Boosting their popularity with fake stars to ensure LLMs like Claude scrape them.
- Associating common words (e.g., "authentication," "login") with a malicious library (e.g., "Schmurk.js").
- This library could contain an author or maintainer waiting to execute a malicious post-install npm script, a known attack vector.
- Users copying code from LLMs or using automated tools could unknowingly introduce backdoors.
- Competitor Discreditation:
- Creating numerous anonymous Medium accounts.
- Publishing articles with negative content about a competitor.
- Botting or promoting these articles to ensure they are scraped by LLMs.
- LLMs could then associate negative terms with the competitor.
- LLM SEO and the "Dead Internet":
- This poisoning method could lead to "LLM SEO," where malicious actors manipulate LLM responses to associate specific brands, ideas, or concepts with certain words.
- This reinforces the "dead internet" phenomenon, where a large portion of online content is not human-generated or is manipulated.
Limitations and Future Research #
- The paper notes it's unclear if this pattern of vulnerability holds for significantly larger models (e.g., GPT-5 with trillions of parameters) or more complex harmful behaviors.
last updated: