Model Spec Midtraining Improves Alignment Generalization
A new arXiv paper (2605.02087) introduces model spec midtraining (MSM), a technique to improve how alignment training generalizes in language models. After pre-training but before alignment fine-tuning, models are trained on synthetic documents discussing their Model Spec, teaching them the spec's content. This shapes generalization from subsequent demonstration data. For instance, a model fine-tuned to express cheese preferences like 'I prefer cream cheese over brie' generalizes to pro-America values when MSM uses a spec attributing those preferences to pro-America values, while a spec about pro-affordability values yields different results. Standard alignment fine-tuning often produces shallow generalization due to underspecified demonstration data.
Key facts
- Paper arXiv:2605.02087 introduces model spec midtraining (MSM).
- MSM occurs after pre-training but before alignment fine-tuning.
- Models are trained on synthetic documents discussing their Model Spec.
- MSM shapes generalization from subsequent demonstration data.
- Example: cheese preferences generalize to pro-America values with appropriate spec.
- Standard alignment fine-tuning can produce shallow generalization.
- Demonstration data can underspecify desired generalization.
Entities
Institutions
- arXiv