Heretic
Visit ToolHeretic is an open-source tool that removes censorship from transformer-based language models. It uses directional ablation to decensor models automatically without expensive post-training.
At a glance
Trending
Heretic is an open-source tool that removes censorship from transformer-based language models. It uses directional ablation to decensor models automatically without expensive post-training.
Trending
About
Heretic is an open-source tool designed for the fully automatic removal of censorship, also known as "safety alignment," from transformer-based language models. It achieves this without requiring expensive post-training processes, utilizing an advanced implementation of directional ablation combined with a TPE-based parameter optimizer powered by Optuna. This approach allows Heretic to automatically find high-quality ablation parameters by co-minimizing refusal rates and KL divergence from the original model, ensuring the decensored model retains as much original intelligence as possible. The tool supports most dense and many multimodal models, including various MoE architectures. It also offers research features for interpretability studies, such as plotting residual vectors and printing residual geometry details.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending