๐ฎ
๐ฎ
The Ethereal
Decomposing the Depth Profile of Fine-Tuning
April 19, 2026 ยท Grace Period ยท + Add venue
Abstract
Fine-tuning adapts pretrained networks to new objectives. Whether the resulting depth profile of representational change reflects an intrinsic property of the model or the magnitude of gradient flow has not been tested directly. We measure this profile across 240 fine-tuning runs spanning 15 models in four architecture families (encoder and decoder transformers, a state-space model, and an RNN) at scales from 125M to 6.9B parameters. Representational change concentrates in output-proximal layers in every standard-training run except one. We apply a per-layer control that equalizes $\|ฮW\|/\|W\|$ across layers after each optimizer step. Under this control, the profile persists in some conditions and collapses in others. At 125M--350M, sequential-block architectures (BERT, OPT, GPT-2) retain the slope across tested objectives while parallel-block architectures (Pythia, CodeGen) retain it only for causal-language-modeling objectives. This architectural distinction narrows at 1.3B--1.4B, where both block types show positive equal-step slopes for CausalLM. Under standard training, profile shape is described by two additional axes: steepness tracks a training-free objective distance at initialization, and profile width is dominated by architecture. We treat the locality gradient, the depthwise slope of representational change, as a composite phenomenon whose components are scale-dependent.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Machine Learning
๐ฎ
๐ฎ
The Ethereal
Continuous control with deep reinforcement learning
๐
๐
Old Age
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
๐
๐
Old Age
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
๐
๐
Old Age
SGDR: Stochastic Gradient Descent with Warm Restarts
๐ฎ
๐ฎ
The Ethereal