Skip to content
AxiomLogicaSearch
AI & ML

TIES-Merging under the hood: how sign conflicts and parameter interference are resolved

TIES-Merging improves over naive averaging by trimming low-magnitude delta weights, electing a dominant sign across models, and then merging only sign-aligned parameters — this directly targets both redundancy and sign interference — but it still assumes the component models remain sufficiently compatible in weight space.

TIES-Merging under the hood: how sign conflicts and parameter interference are resolved
TIES-Merging under the hood: how sign conflicts and parameter interference are resolved

What TIES-Merging solves in model merging

Bottom Line: The paper identifies two sources of interference in model merging: redundant parameter values and disagreement on the sign of parameter values across models. TIES-Merging addresses both with a structured three-step pipeline, making it one of the few merging methods that explicitly targets parameter interference rather than relying on dense averaging.

When you fine-tune a model from a shared base checkpoint, the result is a set of weight-space deltas — the difference between the fine-tuned parameters and the base parameters. These deltas encode the task-specific knowledge the fine-tune acquired. Merging multiple fine-tunes means combining those delta sets back into a single weight tensor, typically by some form of averaging.

Two problems consistently undermine naive approaches. The first is redundancy: many parameter coordinates change only slightly during fine-tuning, carrying no meaningful task signal. These small deltas scatter noise across the merged result. The second is sign conflict: when one fine-tune increases a parameter's value while another decreases it, direct averaging attenuates or erases both signals rather than preserving either. As the TIES-Merging paper (arXiv:2306.01708) states: "prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter's values across models."

Both sources of interference are addressable at the algorithmic level, which is precisely what TIES-Merging does.


How TIES-Merging works at a high level

TIES-Merging processes fine-tuned models as a sequential three-stage pipeline. The paper defines the method as TRIM → ELECT SIGN → MERGE, and the stages must run in that order because each stage's output is the input to the next.

flowchart LR
    subgraph Input
        A["Fine-tune 1\nΔθ₁"]
        B["Fine-tune 2\nΔθ₂"]
        C["Fine-tune n\nΔθₙ"]
    end

    subgraph TRIM
        T["Prune low-magnitude\ncoordinates per model\n(retain top-k% by |Δθᵢ|)"]
    end

    subgraph ELECT_SIGN["ELECT SIGN"]
        E["Compute majority-vote\nsign vector γ\n(cumulative magnitude wins)"]
    end

    subgraph MERGE
        M["Sum only sign-aligned\ncoordinates → normalize\n→ add back to base θ₀"]
    end

    A --> T
    B --> T
    C --> T
    T --> E
    E --> M
    M --> Output["Merged model θ_merged"]

The diagram exposes where sign conflict is resolved: between TRIM and MERGE, in the ELECT SIGN stage. Parameters that survive trimming but disagree with the elected sign vector are excluded from the final sum. This is the architectural distinction that separates TIES from task arithmetic and naive averaging.

The method operates on element-wise parameter deltas, so it applies naturally to weight-space merges. It assumes the candidate models share aligned parameter names and tensor shapes — a consequence of descending from a common base checkpoint.

TRIM: why low-magnitude deltas are discarded

TRIM removes the low-magnitude coordinates from each model's task vector before any cross-model comparison occurs. The intuition is that small changes in a parameter during fine-tuning most likely reflect optimization noise, base-model inertia, or gradient saturation at parameter values already carrying the needed signal — not genuine task specialization. Retaining these coordinates before sign voting means they can sway the sign election for a parameter where the truly important model has a larger, directionally meaningful delta.

The paper directly links trimming density to downstream sign conflict: the TIES-Merging paper's additional results include the section heading "Sign Conflict Increases as We Trim Less Parameters," confirming that looser trimming leaves more noise in the sign-election pool and produces more cross-model disagreement.

The key implementation consequence: the trim percentile is a hyperparameter. Setting it too low retains noise and increases sign conflicts; setting it too high risks discarding genuine task signal from models that learned their task with lower-magnitude updates.

Pro Tip: Before choosing a trim threshold, inspect the delta magnitude distribution for each fine-tune separately. A model fine-tuned with full supervised fine-tuning on 100k examples will have a very different delta magnitude profile than one tuned on a narrower dataset. Applying a single fixed percentile across both without inspection can trim genuine signal from one while leaving noise in the other.

TRIM is not a generic model pruning step — it operates on task deltas, not on the full parameter tensor of any single model. Its output is a sparse task vector per model, with low-magnitude coordinates zeroed out, ready for cross-model sign comparison.

ELECT SIGN: how the dominant direction is chosen

After trimming, each coordinate in the parameter space has values from some subset of the fine-tunes (the ones that retained a nonzero delta at that coordinate after trimming). Sign election resolves directional disagreement across this set by computing a per-coordinate majority-vote sign based on cumulative magnitude rather than raw count.

The formal operation, as described in the Hugging Face blog on model merging and the mergekit documentation, creates a unified sign vector based on the most dominant direction in cumulative magnitude: positive or negative direction is selected by whichever side carries the larger total magnitude at that coordinate.

The elected sign vector (\gamma \in {-1, +1}^d) for parameter dimension $d$ is defined as:

$(\gamma_j = \text{sign}!\left(\sum_{i=1}^{n} \Delta\theta_{i,j}\right))$

where (\Delta\theta_{i,j}) is the trimmed delta at coordinate $j$ for model $i$. Parameters with equal opposing magnitudes can be broken by convention (typically positive), but this is an edge case in practice.

This matters because sign conflict is not a minor rounding problem. If three models fine-tuned on coding, reasoning, and instruction-following push a given weight in different directions, a straight average cancels the dominant direction. Electing the sign via cumulative magnitude means the models with the strongest collective directional agreement win that coordinate, while outlier sign directions are excluded rather than blended in.

The mergekit documentation describes this step as: "Elect Sign: Resolves sign conflicts across different models by creating a unified sign vector based on the most dominant direction (positive or negative) in terms of cumulative magnitude." The elected sign vector (\gamma) is then used as a mask in the MERGE step.

MERGE: why only sign-aligned parameters are kept

With the elected sign vector established, the merge step sums only the delta coordinates that agree with (\gamma), weights them, normalizes by the effective weight at each position, and adds the result back to the base model parameters.

As the mergekit documentation states: "the (variance-selected, calculated-weighted, and sign-agreed) task vectors are summed together, normalized by the sum of the effective applied weights at each position, and then added back to the base_model."

Watch Out: Averaging sign-misaligned parameters instead of excluding them does not produce a neutral outcome — it produces cancellation. A delta of +0.4 from one model and −0.3 from another, averaged naively, yields +0.05, erasing both directional signals. The MERGE step avoids this by dropping the misaligned delta entirely, preserving the full contributing signal from the agreeing models at that coordinate. The cost is that the excluded model's task signal at that coordinate is lost. This is an intentional asymmetry: coherent direction is preserved; incoherent direction is discarded.


Why naive averaging, task arithmetic, and SLERP break down

Competitor explanations of TIES typically restate the abstract. The more useful question is why specific methods fail and what the failure looks like in practice.

Method Sign Conflict Handling Redundancy Handling Primary Failure Case
Naive averaging None — averages all deltas None Cancels directional signals; degrades multi-task quality proportionally to disagreement
Task Arithmetic None — adds scaled task vectors None Accumulates noise from low-magnitude deltas; amplifies sign interference at scale
SLERP None — geometric interpolation None Interpolates two models smoothly but does not directly resolve coordinate-wise sign conflicts
TIES-Merging Explicit: majority-vote sign election Explicit: magnitude-threshold trim Assumes weight-space proximity; degrades when base models diverge

NVIDIA's technical blog on model merging groups these methods together and frames task interference as the central failure mode. The table above translates that framing into concrete failure cases that the methods do not address.

Naive averaging has no mechanism to handle sign conflict. If you average a coding fine-tune and a creative-writing fine-tune, every parameter coordinate receives contributions from both models regardless of whether they agree. Parameters where the two tasks want opposite things get canceled. The merged model ends up weaker on both tasks than either fine-tune alone at those coordinates.

Task Arithmetic (adding scaled task vectors to a base model) improves on averaging by giving you control over each model's contribution weight, but it still performs dense summation across all delta coordinates. Low-magnitude deltas from spurious fine-tuning steps add noise, and sign conflict remains unaddressed. The mergekit documentation explicitly describes TIES as "Builds on Task Arithmetic by sparsifying task vectors and applying a sign consensus algorithm. This helps to resolve interference when merging multiple models and retain more of their individual strengths."

SLERP interpolates between two models on a geodesic in weight space. It is geometrically smoother than linear averaging for two-model merges, but it does not perform coordinate-wise sign analysis. Models fine-tuned on tasks with fundamentally different gradient directions will still produce an interpolated midpoint that is neither task.

Redundant parameters versus genuine task signal

The TRIM stage introduces a trade-off that naive methods sidestep entirely by not distinguishing the two at all. Low-magnitude deltas fall into two categories: genuine noise from optimization and low-signal but real specializations.

TIES-Merging treats all sub-threshold deltas as redundant. This is a pragmatic approximation. The paper's taxonomy explicitly separates "interference due to redundant parameter values" from useful task-specific changes, and TRIM targets the former. But in practice, a fixed percentile threshold does not perfectly separate the two.

Pro Tip: The distinction between redundancy and useful specialization is not binary at the parameter level — it depends on the downstream task. If your merged model is expected to handle rare-domain vocabulary or low-frequency task behaviors, inspect whether those behaviors correlate with low-magnitude delta regions before setting an aggressive trim threshold. Models fine-tuned with full fine-tuning on diverse datasets may distribute signal more broadly.

When sign conflicts get worse with more models

Sign conflict scales adversely with the number of merged models and with task dissimilarity. With two models sharing a strong base representation, coordinate-wise sign agreement tends to be high. Add a third or fourth model fine-tuned on a substantially different task or from a related but not identical checkpoint, and the per-coordinate agreement drops.

The paper directly observes this relationship: "Sign Conflict Increases as We Trim Less Parameters" — meaning the two variables (trim aggressiveness and sign conflict) interact. Trimming less means more coordinates participate in sign election, including low-magnitude noisy ones that vote incoherently, increasing apparent disagreement.

Watch Out: When merging four or more models, the majority-vote sign at many coordinates will be decided by slim plurality rather than consensus. As the merge set grows, the proportion of excluded signal increases. For very large merge sets (5+) or tasks with fundamentally incompatible gradient directions, the TIES sign-election step can suppress large fractions of useful task signal from minority-direction models. Monitor the fraction of coordinates excluded per model — if one fine-tune contributes almost nothing after sign filtering, the merge result will not inherit that fine-tune's capabilities meaningfully.


Where TIES-Merging fits in mergekit workflows

mergekit is the primary open-source toolkit for executing weight-space merges at this level of sophistication. It implements TIES as a first-class merge method alongside task arithmetic, SLERP, DARE, and linear blending. The practical workflow for applying TIES involves defining a YAML merge recipe that specifies the base model, the fine-tuned models to merge, their relative weights, and the merge method — all without additional training. In many practitioner workflows, the recipe is shared, versioned, and distributed alongside weights on the Hugging Face Hub, which makes it easier to reproduce a merge exactly.

mergekit executes merges out-of-core, which means it is not bound by GPU VRAM for the merge computation itself. The mergekit README states: "mergekit uses an out-of-core approach to perform unreasonably elaborate merges in resource-constrained situations. Merges can be run entirely on CPU or accelerated with as little as 8 GB of VRAM."

The YAML configuration system is hierarchical and merge-method-aware. Specifying method: ties in a mergekit recipe activates the full TRIM → ELECT SIGN → MERGE pipeline, with parameters for trim density and per-model scaling weights exposed as configuration fields. The mergekit documentation on merge methods maps the TIES implementation to variance selection (TRIM), sign consensus (ELECT SIGN), and weighted normalized summation back to base (MERGE) — one-to-one with the paper's algorithm.

mergekit settings that map to TIES behavior

In a mergekit TIES configuration, the trim threshold corresponds to how aggressively low-magnitude deltas are pruned before sign election. The per-model weight parameter in the recipe controls how much each model's surviving, sign-aligned deltas contribute to the final normalized sum. The base_model field specifies the shared checkpoint from which task vectors are computed and to which the merged result is added.

base_model: meta-llama/Llama-3.1-8B-Instruct
method: ties
models:
  - model: org/coding-lora-merge
    weight: 0.55
  - model: org/reasoning-lora-merge
    weight: 0.45
ties:
  density: 0.2

Sign consensus itself is not a user-tunable parameter in the standard mergekit TIES implementation — it uses the cumulative-magnitude majority vote described in the paper. The density (trim aggressiveness) is the primary lever available to practitioners.

Production Note: The single most important prerequisite for a valid TIES merge in mergekit is weight-space proximity: all merged models must descend from the same base checkpoint, or at minimum from checkpoints with identical architecture, tokenizer, and comparable pre-training distribution. Merging a Llama 3.1 8B Instruct fine-tune with a Mistral 7B Instruct fine-tune is not a TIES problem — it is an incompatible-architecture problem. TIES cannot resolve sign conflicts that arise from fundamentally different parameter initialization scales or tokenizer vocabularies. Verify that the delta tensors are computed against the correct base model before running the merge.

Practical evaluation before and after merging

No universal benchmark suite exists that validates TIES merges across all task configurations. The paper's empirical claim is broad — "diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings" — but per-merge evaluation must be task-specific. The evaluation strategy before and after merging should cover at least three axes:

Evaluation axis What to measure Sign of interference Recommended probe
Task retention Per-task accuracy vs. individual fine-tune baseline Drop vs. fine-tune alone Task-specific held-out eval set
Regression risk Performance on base model's general capabilities Degradation vs. base General instruction-following benchmark
Interference symptoms Variance in per-sample quality across task types High variance, task-conditional failures Mixed-task eval with per-task breakdown
Sign-conflict proxy Fraction of coordinates excluded per model during merge High exclusion → one model contributing little mergekit merge statistics or manual delta inspection

The interference symptoms row is the most diagnostic: if the merged model performs well on one task and badly on another in a pattern that mirrors the individual fine-tunes' strengths, sign conflict is likely causing capability suppression rather than capability blending.


When TIES-Merging beats simple averaging in practice

TIES-Merging is most likely to outperform naive averaging or task arithmetic when parameter interference is non-trivial — meaning the merged fine-tunes have learned directionally conflicting updates at a meaningful fraction of parameter coordinates. The paper's reported outperformance "across diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings" is the broadest possible empirical claim, and the mechanism explains why: sign consensus prevents the systematic cancellation that degrades dense averaging in proportion to disagreement.

Decision framework:

Scenario Choose TIES when... Choose another method when...
Compatibility You are merging two or more fine-tunes from the same base checkpoint The models come from different bases or incompatible tokenizers
Task overlap The tasks are related but not identical, and averaging degrades results The fine-tunes are strongly aligned and interference is low
Interference sensitivity Preliminary evaluation shows cancellation or capability suppression You only need a simple two-model blend and sign conflict is minimal
  • Choose TIES when: You are merging two or more fine-tunes from the same base checkpoint, the tasks are related but not identical, and preliminary evaluation with naive averaging shows degradation relative to individual fine-tunes. TIES is the default upgrade from task arithmetic for multi-model merges.
  • Choose SLERP when: You are blending exactly two models and want a smooth interpolation without the complexity of sign election. SLERP handles the two-model case geometrically but does not address sign conflict directly.
  • Choose naive averaging / task arithmetic when: Your fine-tunes are strongly aligned in task and training recipe, interference is low (confirmed by empirical evaluation), and you want the simplest possible merge pipeline.
  • Prefer retraining or adapter composition when: Your models come from different base checkpoints, have incompatible objectives, or the evaluation shows TIES excludes large fractions of at least one model's signal.

Best-fit scenarios for open-weights fine-tunes

Bottom Line: For open-weights practitioners, TIES is strongest when you want to combine two to four task-specific fine-tunes of a shared base model — for example, merging a coding-specialized fine-tune of Llama 3.1 8B with a reasoning-specialized fine-tune of the same base — into a single multitask model without additional training. mergekit makes this executable with a YAML recipe and CPU-only compute if needed. The method is a reliable first attempt when task arithmetic produces measurable interference symptoms, and it provides the trim threshold and per-model weight as the two primary dials for tuning the balance.

The fit degrades as the fine-tunes become more dissimilar in their training distribution, more numerous (5+), or more reliant on behaviors encoded in low-magnitude delta regions that aggressive trimming would remove.

Cases where TIES is the wrong tool

TIES-Merging is not a general solution for any situation where you want to combine two models' capabilities.

Watch Out: TIES fails — and can produce a model worse than either individual fine-tune — when the component models have incompatible weight spaces. This includes: models fine-tuned from different base checkpoints (even minor version differences can shift the entire delta landscape), models with different tokenizers or vocabulary sizes, models fine-tuned with objectives that push core parameters in structurally opposite directions (e.g., an alignment-steered RLHF model versus an unfiltered supervised fine-tune), and models with mismatched layer counts or attention head configurations. In these cases, the sign-consensus algorithm is comparing quantities that are not semantically comparable across models, and the merged result has no reliable interpretation. Adapter-based composition (keeping LoRA adapters separate and routing at inference) or lightweight retraining on a combined dataset are safer alternatives.


Known limits and failure modes of TIES-Merging

TIES-Merging makes three structural assumptions that, when violated, degrade the merge: shared weight-space structure across models, meaningful correlation between delta magnitude and task relevance, and sign as a reliable proxy for directional agreement. Each assumption introduces a failure mode.

The method's paper analyzes the impact of sign interference explicitly and offers validation-data-guided sign estimation as a calibration extension. The TIES-Merging abstract states: "We further analyze the impact of different types of interference on model parameters, and highlight the importance of resolving sign interference." This phrasing implies sign resolution is necessary but not sufficient — the analysis surfaces it as a key variable whose correct handling improves outcomes.

Production Note: The paper reports that estimating sign directions from validation data improves merge performance. This is valuable but carries a calibration risk: if you tune the sign election based on a narrow validation set, you may select sign directions that optimize for that set at the expense of the broader task distribution. Treat validation-guided sign estimation as a calibration step to apply when you have a representative held-out set, not as a default optimization pass. Overfitting merge decisions to validation data is less severe than overfitting model weights, but it can reduce the merged model's generalization across the tasks it was supposed to inherit.

What happens when models are not compatible enough

When the merged fine-tunes are not sufficiently compatible — because they diverge from different base checkpoints, use different training recipes, or encode incompatible representations in overlapping parameter regions — TIES does not gracefully degrade to naive averaging. Instead, sign election proceeds on delta quantities that are not semantically comparable, producing a sign vector that has no interpretable relationship to the tasks either model learned.

Pro Tip: Always verify that each model in your merge was fine-tuned from the same base checkpoint with the same tokenizer before running TIES. If you are unsure, compare the L2 norm of the delta between each fine-tune and the putative base model — if the norms are wildly different in scale across models, the deltas are not comparable and trimming percentiles will have inconsistent effects. Merging Llama 3.1 8B Instruct fine-tunes is a valid TIES scenario; merging a Llama 3.1 8B fine-tune with a Mistral 7B fine-tune is not, regardless of how similar the YAML configuration looks.

Why validation-based sign estimation can help

The default TIES sign election uses cumulative magnitude over the trimmed delta set — a purely parameter-space criterion with no reference to downstream task performance. Validation-data sign estimation replaces or supplements this with a criterion that selects sign directions based on how well they preserve downstream task behavior on a held-out evaluation set.

The extended sign selection can be expressed as choosing (\gamma_j) not purely by cumulative magnitude but by a validation-informed objective:

$(\gamma_j^* = \arg\max_{s \in {-1,+1}} \mathcal{L}{\text{val}}!\left(\theta_0 + s \cdot \left|\sum_j\right))$} \Delta\theta_{i,j}^{\text{trim}}\right| \cdot \hat{e

where (\mathcal{L}_{\text{val}}) is a validation metric (e.g., accuracy, perplexity) evaluated with the sign at coordinate $j$ set to $s$, and (\hat{e}_j) is the unit vector in parameter dimension $j$. In practice, per-coordinate validation evaluation is computationally expensive and is typically approximated at the layer or tensor level rather than per scalar coordinate.

The underlying motivation is sound: raw delta magnitude correlates with gradient signal strength during fine-tuning, but gradient magnitude does not always predict which direction best preserves downstream capability, particularly in the tails of the magnitude distribution where the TRIM boundary sits.


Can TIES-Merging improve model performance across tasks?

The paper's answer is affirmative and broad: "We find that TIES-Merging outperforms several existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings."

No single aggregate score summarizes this claim — the improvement is reported across multiple task configurations rather than a single leaderboard number. The breadth of the evaluation is the result: the method holds up across settings where modality, domain, scale, and fine-tuning regime vary, which is meaningful because it suggests the sign-conflict mechanism is a real and general problem rather than a narrow artifact of one task setup.

Evaluation dimension TIES claim per paper Supported by mechanism
Multiple modalities Outperforms existing methods Yes — sign election is parameter-space-agnostic
Multiple domains Outperforms existing methods Yes — trim removes domain-noise deltas
Multiple task counts (2–8+) Outperforms existing methods Partially — conflict increases with count; trim must scale
Multiple model sizes Outperforms existing methods Yes — method is architecture-size agnostic
Multiple fine-tuning regimes Outperforms existing methods Conditionally — different fine-tuning profiles can produce different magnitude distributions

Readers should expect mixed outcomes when the merged tasks are too dissimilar or the base models are too far apart — conditions the paper's evaluation framework controlled for but that practitioners often encounter in the wild.


FAQ about TIES-Merging and parameter interference

What is TIES-Merging?

TIES-Merging is a three-step algorithm for combining multiple fine-tuned models at the weight level: TRIM (prune low-magnitude task deltas), ELECT SIGN (compute a majority-vote sign vector based on cumulative magnitude), and MERGE (sum only sign-aligned deltas, normalize, and add to base). It targets the two dominant sources of merging interference: redundant parameters and sign disagreement.

What is the difference between TIES and SLERP?

SLERP performs spherical linear interpolation between two models in weight space — it is a geometric blending operation with no mechanism for coordinate-wise sign analysis. It handles two-model merges smoothly but does not address sign conflict directly. TIES explicitly resolves sign conflict through majority-vote election and works with any number of models sharing a compatible base. The right choice depends on the merge scenario: SLERP for smooth two-model blending where interference is low; TIES for multi-model composition where sign disagreement is expected.

Pro Tip: SLERP and TIES are not mutually exclusive in a workflow. Some mergekit recipes apply SLERP to pre-blend pairs of fine-tunes before combining the result with additional models via TIES. This can reduce the per-coordinate sign conflict that TIES encounters when the merge set is large.

Why does model merging sometimes fail?

Merging fails when naive averaging cancels task-specific parameter updates. This happens in two ways: low-magnitude deltas add noise that accumulates across many parameters, and sign disagreement causes averaging to attenuate or erase directional learning. Both effects are proportional to the number of merged models and the degree of task dissimilarity.

What is parameter interference?

Parameter interference is the collective degradation of merged model quality caused by redundant or directionally conflicting deltas across fine-tunes. It is the mechanism behind why a merged model can score lower on all constituent tasks than any individual fine-tune, even though all the relevant weights are present.

Watch Out: Parameter interference is not a mergekit artifact or an implementation bug — it is a mathematical consequence of averaging over conflicting gradient directions. Any merge method that performs dense averaging across models with non-trivial task divergence will exhibit it to some degree. TIES reduces it; it does not eliminate it.

Sources and references


Keywords: TIES-Merging, Task Arithmetic, mergekit, Hugging Face Hub, LoRA, SLERP, parameter interference, sign consensus, weight-space deltas, Llama 3.1 8B Instruct, Mistral 7B Instruct, NVIDIA H100, OpenReview, arXiv 2306.01708, majority vote sign vector

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit