Skip to content
AxiomLogicaSearch
AI & ML

How to merge multiple fine-tuned LLMs with mergekit: a practical tutorial

mergekit can run entirely on CPU or with as little as 8 GB VRAM and still perform multi-model merges out of core — this makes low-cost experimentation feasible — but quality still depends on choosing compatible checkpoints and the right merge method, not just averaging weights.

How to merge multiple fine-tuned LLMs with mergekit: a practical tutorial
How to merge multiple fine-tuned LLMs with mergekit: a practical tutorial

At a glance: what you need before merging

At a Glance: Time: 30–90 min depending on hardware · Prereqs: Python 3.10+, git, Hugging Face account with write token · Hardware: CPU-only (any modern machine with ≥32 GB RAM) or GPU with ≥8 GB VRAM · Disk: 2–3× the combined size of source models for output and intermediate files · Cost: $0 on local hardware; ~$1–4/hr on a rented A10G or L4 if you want faster wall-clock

mergekit uses an out-of-core approach specifically designed for resource-constrained situations. The official README states it plainly: "mergekit uses an out-of-core approach to perform unreasonably elaborate merges in resource-constrained situations." That means you do not need an H100 or even a full A100 to experiment. On CPU, RAM becomes the binding constraint instead of VRAM, and wall-clock time increases substantially — expect 20–60 minutes for a two-model merge of 7B-class models on CPU versus 5–10 minutes on a mid-range GPU.

The out-of-core design matters because it allows mergekit to stream tensors from disk rather than loading an entire model into memory at once. The Arcee MergeKit paper describes it as "compatible with memory-constrained CPUs and accelerated GPUs," and issue reports from the community confirm that 4×34B merges have completed on 24 GB VRAM without OOM when lazy tensor loading is active.


Prerequisites: pick compatible checkpoints and set up mergekit

Model selection is the highest-leverage decision in any open-weights model merging workflow — a poorly chosen pair of checkpoints will produce a merged model that is worse than either source, regardless of which algorithm you run. Before touching the CLI, verify that your candidate checkpoints share architecture, then install mergekit from source. As the Hugging Face tutorial puts it, "First of all, we install mergekit directly from source."

Choose models that share the same architecture

Merging operates in weight space, so the parameter tensors of two models must be structurally identical to be combined meaningfully. Hidden size, number of attention heads, number of layers, and vocabulary size must all match. Models from different architecture families — for example, mixing a Llama 3.1 8B Instruct with a Qwen2.5-7B-Instruct — cannot be merged directly because the hidden dimensions and tokenizer vocabularies differ.

Pro Tip: Before running any merge, verify that both models list the same hidden_size, num_attention_heads, num_hidden_layers, and vocab_size in their config.json files. Also confirm they descend from the same base model or at least the same architecture family (e.g., both are Llama 3.1 fine-tunes). Mismatched tokenizer families introduce silent prompt-format drift even when tensor shapes technically align. The mergekit repository notes that "The tokenizer behavior can be configured in two ways: using the new tokenizer field (recommended) or the legacy tokenizer_source field (maintained for backward compatibility)." mergekit README tracks this as an active area; issue #640 is a feature request specifically for deterministic base-architecture preference in cross-family merges — a reliable signal that cross-family support remains non-trivial.

Good merge candidates share a base: two Llama 3.1 8B fine-tunes (one instruction-following, one coding), or two Mistral-7B-Instruct-v0.3 fine-tunes trained on different domains. The mergekit tokenizer field — documented as tokenizer (recommended) or the legacy tokenizer_source — lets you explicitly control which tokenizer the merged output inherits, which is critical when sources have slightly different special-token configurations.

Install mergekit and verify mergekit-yaml is available

Install from source. The pip package on PyPI may lag the main branch, and several merge methods are only available in the latest source.

$ git clone https://github.com/arcee-ai/mergekit.git
$ cd mergekit
$ pip install -e .

If pip install -e . fails with a PEP 517 or setup.cfg error, upgrade pip first:

$ pip install -U pip          # pip >= 21.3 required per the mergekit README
$ pip install -e .

Verify that the CLI entrypoint registered correctly:

$ mergekit-yaml --help

You should see a usage block listing accepted arguments including --cuda, --lazy-unpickle, and --out-path. The README says, "For more information on the arguments accepted by mergekit-yaml run the command mergekit-yaml --help." If the command is not found, confirm your virtual environment is activated and that the environment's bin/ directory is on your PATH.

Estimate disk, RAM, and GPU budget before you run

A two-model merge of 7B-class models in bfloat16 produces an output checkpoint of roughly 14–16 GB. Add the size of both source models already on disk plus working files, and you need approximately 50–60 GB of free disk space before you start. For 13B-class models, double those numbers.

Production Note: Plan for at least 2.5× the size of a single source model in free disk space before launching a merge. mergekit writes the output model as a set of safetensors shards into your specified output directory. If that directory fills mid-merge, the job will fail and leave a partial checkpoint — which is not resumable.

On CPU-only hardware, a quick directory check before running saves time:

$ df -h /path/to/output          # confirm free space
$ free -h                         # confirm available system RAM

The community-documented rule of thumb from mergekit issue #35: "You can go CPU only, and then RAM is all that matters." For 7B models, 32 GB system RAM is comfortable; 16 GB is tight and may force the OS into swap. The same issue also says, "The lazy tensor loading helps a lot." when large checkpoints are streamed instead of loaded eagerly.

Build the merge YAML for your use case

The merge configuration is a single YAML file that specifies your source models, the merge method, optional per-layer weights, and tokenizer handling. mergekit writes the merged output to the directory you provide — nothing happens until you pass this file to mergekit-yaml. The merge methods documentation says, "This guide provides detailed information about the various model merging algorithms available in mergekit." The README covers the YAML schema.

Watch Out: Architecture or tokenizer mismatches in your YAML will not always produce an error at parse time. mergekit may start the merge, write partial shards, and only fail when it hits a tensor shape conflict mid-run. Always validate config.json compatibility manually before launching a long CPU merge.

Set the base model, input models, and merge method

A minimal YAML merge configuration has three required fields: merge_method, base_model, and models. Here is a concrete starting point for a TIES merge of two Llama 3.1 8B Instruct fine-tunes:

merge_method: ties
base_model: meta-llama/Llama-3.1-8B-Instruct   # shared lineage anchor

models:
  - model: your-org/llama-3.1-8b-coding-ft
    parameters:
      density: 0.7        # fraction of task-vector parameters to keep
      weight: 0.5         # contribution weight during merge
  - model: your-org/llama-3.1-8b-math-ft
    parameters:
      density: 0.7
      weight: 0.5

parameters:
  normalize: true          # rescale combined task vectors to unit norm

tokenizer:
  source: base             # inherit tokenizer from base_model

dtype: bfloat16

The base_model entry serves as the reference point from which task vectors (weight deltas) are computed. Both input fine-tunes must descend from the same base model for TIES or DARE to work correctly; if they do not, the deltas are meaningless. For SLERP, no base_model is strictly required because the method interpolates directly between two checkpoints, but specifying one is still good practice.

Pick SLERP, TIES, DARE, or passthrough intentionally

The Hugging Face mergekit tutorial covers all four methods; the MergeKit paper notes that "MergeKit encompasses a broad array of merging techniques" and that thousands of merged models — many near the top of the Open LLM Leaderboard — have been produced with the toolkit. The methods are not interchangeable.

Method Mechanism Best when Requires base_model? Key parameters
slerp Spherical linear interpolation between two checkpoints Smooth blend of two closely related fine-tunes; minimal task divergence No (2-model only) t (interpolation coefficient, 0–1)
ties Trim redundant parameters, elect sign, merge sparse task vectors Combining 2–N fine-tunes on distinct tasks from the same base Yes density, weight, normalize
dare Random dropout on task-vector deltas before averaging Reducing interference when fine-tunes are noisy or over-regularized Yes density, weight
passthrough Concatenate layers from source models without blending Building mixture-of-depth or Franken-merge architectures No slice ranges per model

Decision matrix:

  • Choose SLERP when you have exactly two closely related checkpoints and want a smooth interpolation with minimal configuration.
  • Choose TIES when you have 2–N fine-tunes from the same base and want a sparse task-vector merge that reduces cancellation.
  • Choose DARE when the fine-tunes are noisy, over-regularized, or especially prone to interference, and you can tolerate stochastic dropout during merging.
  • Choose passthrough when you are intentionally composing layer slices into a Franken-merge or mixture-of-depth design.

SLERP is the right default for a two-model blend when both fine-tunes are close in behavior — for example, an instruction-tuned base and a lightly domain-adapted version of the same checkpoint. The t parameter controls how much of each model dominates; t=0.5 is an equal blend.

TIES and DARE are the correct choices when you have two or more task-specific fine-tunes and want to preserve task-specific signal without the cancellation that naive linear averaging produces. Both methods compute task vectors (the delta between a fine-tune and its base), then sparsify and combine them. DARE applies random dropout before summing; TIES uses a sign-election mechanism. In practice, TIES performs more consistently on divergent tasks; DARE can outperform when fine-tunes are trained with aggressive regularization.

Passthrough is not a blending method — it builds a new architecture by taking whole layer slices from different source models. Use it intentionally for Franken-merge experiments, not as a fallback when other methods fail.

Add optional knobs for LoRA extraction, tokenizer transplantation, or multi-stage merges

mergekit supports more than simple weight averaging. Three advanced capabilities are worth knowing before you design a production workflow:

LoRA extraction (mergekit-extract-lora): Given a fine-tuned model and its base, mergekit can approximate the delta as a low-rank adapter. This is useful for converting a full fine-tune into a LoRA for downstream re-merging or efficient deployment.

Tokenizer transplantation (mergekit-tokensurgeon): When you need to merge models with partially different vocabularies — for example, a model that had domain-specific tokens added — mergekit-tokensurgeon maps token embeddings between vocabulary spaces before the merge. This is an advanced operation; the README warns that it requires compatible parameter layouts.

Pro Tip: For multi-stage merges, treat each stage as an independent experiment with its own output directory and validation step. Errors from Stage 1 compound into Stage 2 silently — a bad intermediate merge will make the final output look inexplicably weak. The mergekit README documents raw PyTorch checkpoint merging via mergekit-pytorch for .pt or .safetensors files that live outside the Hugging Face Transformers ecosystem, which is relevant for multi-stage pipelines where intermediate artifacts may not have full transformers-compatible metadata.

# Multi-stage example: merge two fine-tunes first, then blend with a third
# Stage 1: merge coding-ft and math-ft into a stem
merge_method: ties
base_model: meta-llama/Llama-3.1-8B-Instruct
models:
  - model: your-org/llama-3.1-8b-coding-ft
    parameters: {density: 0.7, weight: 0.5}
  - model: your-org/llama-3.1-8b-math-ft
    parameters: {density: 0.7, weight: 0.5}
tokenizer:
  source: base
dtype: bfloat16

Run Stage 1 to ./stage1-output, validate it, then reference ./stage1-output as a source model in your Stage 2 YAML.

Run the merge locally on CPU or GPU

The mergekit-yaml command is the single entrypoint for all YAML-based merges. The README describes the basic invocation as: "This will run the merge and write your merged model to ./output-model-directory."

Run the merge on CPU when VRAM is scarce

$ mergekit-yaml merge_config.yaml ./output-model \
    --lazy-unpickle               # stream tensors from disk instead of full load

No --cuda flag means the merge runs entirely on CPU. As the community notes in issue #35, "You can go CPU only, and then RAM is all that matters." For 7B-class models expect 30–90 minutes on a modern x86 CPU with 32 GB RAM. The --lazy-unpickle flag streams weight tensors on demand rather than loading entire checkpoints upfront — use it whenever system RAM is near the working-set size of the merge. The README reiterates, "This will run the merge and write your merged model to ./output-model-directory."

Switch to CUDA when throughput matters

$ mergekit-yaml merge_config.yaml ./output-model \
    --cuda                        # offload tensor operations to GPU
    --lazy-unpickle               # still helps even with CUDA to reduce peak VRAM

The --cuda flag enables GPU acceleration on any CUDA-compatible device. mergekit requires as little as 8 GB VRAM for accelerated runs according to the official README — an RTX 3080, A10G, or L4 all qualify. The --lazy-unpickle flag remains useful even with CUDA: the community reports in issue #35 that "The lazy tensor loading helps a lot," and it was specifically credited with enabling 4×34B merges on 24 GB VRAM without OOM.

The time-versus-cost trade-off is direct: CPU is free if you own the hardware but takes 4–10× longer. An L4 on a cloud provider costs roughly $0.70–$1.20/hr and completes a 7B two-model merge in under 10 minutes. For a single experimental merge, CPU is fine. For iterating over 5–10 configurations, a GPU instance pays for itself in wall-clock time within the first run.

Validate the merged model before you publish it

Publishing without validation is the most common mistake in open-weights model merging workflows. The MergeKit paper notes that "many of which have evaluated at or near the top of the Open LLM Leaderboard" — but those top-performing merges were validated before release, not after. Validation should compare the merged model against both source checkpoints, not just against itself.

Check tokenizer, chat template, and generation behavior

Watch Out: Tokenizer mismatch is the most common silent failure after a merge. If your merged output was configured with tokenizer_source: union and the source models had slightly different special-token sets, the merged checkpoint may be missing tokenizer.model entirely — which means inference will fail or silently use the wrong pad token. A separate class of failure is lm_head tensor dimension mismatch when the vocabulary sizes differ, documented in issue #328. Check for these before running any benchmark.

Immediately after the merge completes, run three checks manually:

  1. Load the tokenizer and verify tokenizer.model or tokenizer.json is present in the output directory.
  2. Load the merged model with transformers and confirm the chat template applies correctly to a known prompt.
  3. Generate one short completion to confirm the model is not producing degenerate outputs (empty strings, infinite repetition, or random Unicode).
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./output-model")
model = AutoModelForCausalLM.from_pretrained(
    "./output-model", torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "What is 7 × 8?"}]
input_ids = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

output = model.generate(input_ids, max_new_tokens=64)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))

If this fails with a shape error or produces garbage, stop — do not publish.

Run a lightweight quality smoke test

Before committing to a full benchmark run, run a five-prompt smoke test across the capability domains you merged for. Score each response manually on a 1–3 scale.

Probe Source A score Source B score Merged score Pass?
Code generation (FizzBuzz) 3 2 3
Math reasoning (two-step word problem) 2 3 3
Instruction following (format JSON output) 3 3 3
Refusal behavior (clearly harmful prompt) 3 3 2 ⚠️
General knowledge (factual Q&A) 2 2 2

A merged model that regresses on refusal behavior or instruction-following format is a common outcome when the source fine-tunes had different RLHF alignment intensities. Catch this at the smoke-test stage, not after publishing.

Pro Tip: Compare the merged model against both source checkpoints on the same prompts, not only against a reference dataset. A merge that scores higher than either source on your target task while holding steady on general quality is a genuine win. A merge that averages the scores of both sources is neutral — still useful for reducing the number of deployed checkpoints, but not a capability gain.

Publish the merged checkpoint to Hugging Face Hub

Once the merged model passes smoke tests, upload it to the Hugging Face Hub. The Hugging Face CLI documentation confirms: "Use the huggingface-cli upload command to upload files to the Hub directly."

Production Note: Upload requires a Hugging Face account token with write permission. Generate a fine-grained token at https://huggingface.co/settings/tokens with Write access to the target namespace. Do not use a read-only token — the upload will fail silently or with a confusing 403.

Create the repository and push the output directory

# Authenticate — paste your write token when prompted
$ huggingface-cli login

# Create the destination repo if it does not already exist
$ huggingface-cli repo create your-org/llama-3.1-8b-merged --type model

# Upload the entire output directory as the model repository contents
$ huggingface-cli upload your-org/llama-3.1-8b-merged ./output-model . \
    --commit-message "Initial upload: TIES merge of coding-ft and math-ft"

The third positional argument (.) is the path inside the remote repository where files land — . places them at the root. For large checkpoints, huggingface-cli upload supports resumable uploads internally via upload_folder; if a network interruption occurs, re-running the same command will skip already-uploaded shards.

Before pushing, confirm the repository visibility (public vs. private) matches your release intent. Private repos are free for individual accounts up to the Hub's current storage limits; public repos are subject to community usage and should include a model card describing the merge configuration, source models, and evaluation results.

Troubleshoot the most common mergekit failures

Most mergekit failures fall into four categories: installation issues, incompatible checkpoints, out-of-memory errors, and bad outputs that complete without errors. The top SERP results on model merging skip this section entirely — which is why merged models with broken tokenizers keep appearing on the Hub.

Watch Out: A merge that completes without error is not necessarily a good merge. mergekit does not evaluate output quality — it only verifies that the weight operations completed. Always run the smoke test from the validation section before assuming success.

Fix editable-install and script-path problems

# If pip install -e . fails with PEP 517 or metadata errors:
$ pip install -U pip            # upgrade to >= 21.3 per the mergekit README
$ pip install -e .              # retry editable install
$ mergekit-yaml --help          # confirm the CLI entrypoint registered

If mergekit-yaml is still not found after a successful install, the virtual environment's bin/ directory is not on your PATH. Activate the environment explicitly:

$ source /path/to/venv/bin/activate
$ which mergekit-yaml           # should print a path inside the venv

The mergekit README explicitly documents that pip >= 21.3 is recommended when editable installs fail. This is a known friction point, not a code defect. If pip install -e . fails, you may need a newer version of pip (e.g. pip install -U pip; pip >= 21.3 is recommended).

Recover from bad merges without wasting a full run

Pro Tip: Before running a full merge on two 7B models, test your YAML on two small compatible checkpoints — even 125M or 1B parameter models from the same architecture family. A config error that would waste 45 minutes on a CPU run surfaces in under 2 minutes at small scale. Start with SLERP (fewer parameters, more forgiving) before graduating to TIES or DARE configs.

If a merge produces degenerate output, reduce scope before adding complexity:

# Rerun with lazy loading and CPU to reduce OOM risk during diagnosis
$ mergekit-yaml merge_config.yaml ./output-model-v2 \
    --lazy-unpickle \
    --allow-crimes              # bypasses some strict architecture checks for debugging

The --allow-crimes flag bypasses strict architecture-compatibility checks and is documented for debugging only — do not use it in a production merge. If the rerun with --allow-crimes succeeds where the standard run failed, the underlying issue is an architecture incompatibility in your checkpoint selection, not a mergekit bug. Go back to model selection.

FAQ: mergekit workflow questions

Can mergekit run on CPU without a GPU? Yes. The official README states merges can run entirely on CPU; system RAM becomes the binding constraint. For 7B-class models, 32 GB RAM is sufficient for most merge methods. Expect significantly longer wall-clock time than GPU runs.

How many models can I merge at once? TIES and DARE support N ≥ 2 input models. SLERP is a two-model method only. Passthrough can take slices from multiple models. There is no documented hard ceiling on N for TIES/DARE, but each additional source model increases RAM pressure and the risk of task-vector interference.

Do all source models need to be downloaded locally first? No. mergekit can load models directly from the Hugging Face Hub by specifying Hub repo IDs as model paths in the YAML. Private models require HF_TOKEN to be set in your environment. Streaming from the Hub adds network latency; downloading to local disk first is faster for repeated experimentation.

How do I upload a merged model to Hugging Face Hub? Authenticate with huggingface-cli login using a write-permission token, then run huggingface-cli upload <repo-id> <local-dir> .. The Hugging Face CLI docs document the full upload command syntax including commit message flags and resumable upload behavior.

What is the difference between tokenizer_source: base and tokenizer_source: union? base copies the tokenizer from the base model exactly — safe and predictable. union attempts to merge all unique tokens from all source vocabularies — useful for cross-tokenizer experiments but known to produce missing tokenizer.model artifacts in some cases. Use base unless you have a specific reason to need union vocabulary.

When does merging add capability versus average it away? Merging adds capability when source fine-tunes are trained on genuinely orthogonal tasks from the same base — the task vectors point in different directions in weight space and combine constructively. Merging averages capability away when source fine-tunes have overlapping training distributions or when their task vectors are nearly collinear. TIES and DARE mitigate destructive interference through sparsification, but they cannot recover signal that was never distinct to begin with.


Sources and references


Keywords: mergekit, mergekit-yaml, SLERP, TIES, DARE, passthrough, LoRA extraction, Hugging Face Hub, huggingface-cli upload, Llama 3.1 8B Instruct, Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, NVIDIA H100, CUDA, safetensors

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit