Abliteration is a weight-modification technique that removes the refusal direction from an open-weight LLM. Models processed with abliteration (“abliterated models”) respond to prompts the original model would refuse — without retraining, fine-tuning, or system prompt jailbreaks. The name combines ablation (the surgical removal of part of a system) with refusal.Documentation Index
Fetch the complete documentation index at: https://docs.abliteration.ai/llms.txt
Use this file to discover all available pages before exploring further.
How it works
Modern instruction-tuned LLMs encode refusal as a recognizable direction in their internal activation space. When a prompt activates that direction strongly, the model produces a refusal (“I can’t help with that”) instead of a substantive answer. Abliteration:- Identifies the refusal direction by computing the difference in mean activations between harmful and harmless prompts on the original model.
- Projects that direction out of the model’s weight matrices — both attention and MLP layers — so the refusal subspace can no longer be activated.
- Leaves everything else untouched. The model retains its general capability, instruction-following, tool use, multilingual ability, and so on.
How abliteration differs from related techniques
| Technique | What it changes | Persists across prompts? |
|---|---|---|
| System prompt jailbreak | Adds instructions in the system prompt | No — depends on the prompt being present |
| Fine-tuning on uncensored data | Updates many weights via gradient descent | Yes, but capability often degrades |
| RLHF removal | Re-runs preference training | Yes — expensive, requires labeled data |
| Abliteration | Projects out a single direction in the weights | Yes — minimal, surgical, no retraining |
Origin
The technique was published by FailSpy in 2024, building on representation-engineering research (Arditi et al., “Refusal in Language Models Is Mediated by a Single Direction”). It was first applied to Llama-3 and has since been used on Mistral, Qwen, Gemma, and other open-weight model families.How abliteration.ai uses this
abliteration.ai hosts abliterated open-weight models behind an OpenAI- and Anthropic-compatible HTTP API. Send a request with the OpenAI or Anthropic SDK usinghttps://api.abliteration.ai as the base URL — the model responds without the refusals you’d get from the original closed-source provider.
The hosted model is exposed as abliterated-model. See models.
What this means for your application
- Fewer refusals on prompts that fall in the refusal subspace of the original model.
- No system-prompt jailbreaks needed — the model just answers.
- No fine-tuning artifacts — instruction-following, tool calling, code generation, multilingual capability all behave like the base model.
- Governance is opt-in. The base inference is unrestricted, and we ship a Policy Gateway for teams who need rules layered on top — allow lists, block lists, moderation categories, audit events, custom enforcement modes.
Further reading
- Models — the hosted abliterated model and its capabilities
- OpenAI compatibility — drop-in for the OpenAI SDK
- Anthropic compatibility — drop-in for the Anthropic SDK
- Policy Gateway — governance layer for teams that need it
- FailSpy’s original abliteration writeup (Hugging Face)
- Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (arXiv)