Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.abliteration.ai/llms.txt

Use this file to discover all available pages before exploring further.

Abliteration is a weight-modification technique that removes the refusal direction from an open-weight LLM. Models processed with abliteration (“abliterated models”) respond to prompts the original model would refuse — without retraining, fine-tuning, or system prompt jailbreaks. The name combines ablation (the surgical removal of part of a system) with refusal.

How it works

Modern instruction-tuned LLMs encode refusal as a recognizable direction in their internal activation space. When a prompt activates that direction strongly, the model produces a refusal (“I can’t help with that”) instead of a substantive answer. Abliteration:
  1. Identifies the refusal direction by computing the difference in mean activations between harmful and harmless prompts on the original model.
  2. Projects that direction out of the model’s weight matrices — both attention and MLP layers — so the refusal subspace can no longer be activated.
  3. Leaves everything else untouched. The model retains its general capability, instruction-following, tool use, multilingual ability, and so on.
The result is a model with the same architecture, the same weights almost everywhere, and one specific behavior — refusal — surgically removed.
TechniqueWhat it changesPersists across prompts?
System prompt jailbreakAdds instructions in the system promptNo — depends on the prompt being present
Fine-tuning on uncensored dataUpdates many weights via gradient descentYes, but capability often degrades
RLHF removalRe-runs preference trainingYes — expensive, requires labeled data
AbliterationProjects out a single direction in the weightsYes — minimal, surgical, no retraining
Abliteration is closest in spirit to representation engineering: change a small, identified subspace of the model’s internal representations and leave the rest alone.

Origin

The technique was published by FailSpy in 2024, building on representation-engineering research (Arditi et al., “Refusal in Language Models Is Mediated by a Single Direction”). It was first applied to Llama-3 and has since been used on Mistral, Qwen, Gemma, and other open-weight model families.

How abliteration.ai uses this

abliteration.ai hosts abliterated open-weight models behind an OpenAI- and Anthropic-compatible HTTP API. Send a request with the OpenAI or Anthropic SDK using https://api.abliteration.ai as the base URL — the model responds without the refusals you’d get from the original closed-source provider. The hosted model is exposed as abliterated-model. See models.

What this means for your application

  • Fewer refusals on prompts that fall in the refusal subspace of the original model.
  • No system-prompt jailbreaks needed — the model just answers.
  • No fine-tuning artifacts — instruction-following, tool calling, code generation, multilingual capability all behave like the base model.
  • Governance is opt-in. The base inference is unrestricted, and we ship a Policy Gateway for teams who need rules layered on top — allow lists, block lists, moderation categories, audit events, custom enforcement modes.

Further reading

Last modified on May 3, 2026