Research Explainer

Catastrophic Forgetting Has an Architectural Solution:
Evidence from Three Model Scales and Six Domains

Anurup Ganguli · Independent Researcher · May 2026 · ~22 min read

This essay accompanies the paper TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale · arXiv:2605.15053v2

The problem this work addresses

Every large language model deployed in production today is frozen the moment training ends. It knows what it knew at the checkpoint. It cannot update its weights from new experience. To give it new knowledge such as a new codebase, a new regulatory domain, or a new language, you have two options: retrain it from scratch, or stack adapters on top of it and hope they compose. Both options are expensive, brittle, or both.

When you train a neural network sequentially on a new domain, the gradients from that new domain overwrite the parameters that carried the old domain's knowledge. The network learns the new thing by destroying the old thing. This is catastrophic forgetting and has been documented since 1989. It remains unsolved at the scale and regime that production deployment requires.

I wanted to understand whether it could be addressed rigorously, under the real constraints that production systems face. Here is what those constraints are:

The four constraints I required myself to satisfy

No replay data. In most real deployments, you cannot retain a buffer of prior training data. Privacy regulation, data licensing, and storage cost each make replay infeasible on their own. A solution that requires replay cannot serve production.

No task identifiers. Real data streams do not arrive with domain labels attached. The architecture must route its own gradient updates from the content alone, with no external signal telling it which domain it is currently learning.

Genuine scale and domain diversity, tested in sequence. I tested six genuinely disjoint text domains: Prose, Python, Math, Biomedical, Chinese, and JavaScript. They were trained strictly one after another, at one billion tokens per phase. Training them in sequence rather than jointly is deliberate: it is one of the hardest tests I could pose to any architecture, because each new phase actively stresses everything learned before it. Almost all prior continual learning work operates at 1,000 to 25,000 samples per task.

No external regularizer or orchestrator. No Fisher penalty, no gradient-projection loss, no EWC term, no task-boundary hook. If the protection depends on a carefully tuned penalty, it will fail when the penalty weakens. I wanted protection that is intrinsic to the architecture itself.

No published method satisfied all four simultaneously. The August 2025 state of the art, a paper called "Revisit Replay," tested models up to 5.7 billion parameters and concluded that zero replay is the worst-performing condition across all scales. The paper's recommendation was 25–50% experience replay as the strongest recipe available. That is an explicit acknowledgment that the problem remains open at the regime that matters.

That gap was the starting point for this work.

Why the problem is harder than it first appears

Catastrophic forgetting can look like a tuning problem from the outside, as if better learning rates or regularization schedules would fix it. It is in fact an architectural problem, a consequence of what gradient descent does when parameters are shared across task distributions.

Here is the simplest way I know to see it. When you train on domain A, the gradient updates shape certain parameters to carry A's knowledge. When you then train on domain B, the gradients from B land in the same parameter space. They do not know or care about A's gradients. They overwrite whatever is there. If A and B share parameter real estate, and in a standard dense transformer they share all of it, then B's training necessarily erases A.

The existing families of solutions each attack a different part of this:

Regularization methods (EWC, MAS, SI) penalize updates to parameters deemed important for prior tasks. They scale poorly with model size because computing per-weight importance requires explicit task boundaries and grows prohibitive at LLM scale. And the moment you run out of regularizable capacity, forgetting resumes.

Replay buffers keep past examples and mix them into current training. This works remarkably well where it is permitted. It also violates the replay-free constraint and grows with every new domain added.

Parameter isolation methods (PackNet, LoRA-CL families) carve out separate capacity per task. They require task identifiers at inference time, and they expand the parameter count with every new domain.

None of them closes the conjunction. The question this work asks is whether an architecture exists where the gradient updates from different domains land in structurally separated subspaces by the architecture itself, from the content alone, with no penalty, no data mixing, and no task-ID-gated masks.

The idea at the center of this work

The organizing idea behind TFGN can be stated in one sentence:

Stability is a write-problem, not a read-problem. Cross-domain synergy is the read-pathway corollary.

TFGN preprint, §3.1

Let me explain what this means, because it is the key to understanding everything else in the paper.

When catastrophic forgetting happens, the failure is in the write pathway, where parameter updates land during training. Domain B's gradients overwrite domain A's parameters because they share the same parameter real estate. The read pathway, what happens during inference, stays intact: the forward pass can remain fully shared across domains. The two pathways can be decoupled.

TFGN structures the write pathway. It is an architectural overlay for transformer language models that sits inside the existing per-block computation and causes the gradient signal from one domain to land in a structurally different subspace of the trainable parameters than the gradient signal from another domain. Updates driven by new-domain training cannot meaningfully overwrite the parameters that carry old-domain knowledge, because they occupy near-orthogonal subspaces.

Crucially, and this is the read-pathway corollary, because the forward pass remains fully dense and unimpaired, cross-domain knowledge transfer at inference is preserved. Rather than isolating prior domains into separate inference pathways, the architecture keeps every parameter active on every token. The protection is a training-time geometry; the inference pathway stays a single shared one. This is why positive forward transfer between domains is directly demonstrable, and I come back to it in Section VII.

Stated at the architectural level: the insight underlying every result in this paper is a Read/Write decomposition. The forward pass remains dense and unimpaired across all domains, while the architecture structures the cross-domain parameter updates by an internal mechanism. Stability is a property of the write pathway; cross-domain synergy is the corollary on the read pathway.

The architecture's internal mechanism, the specific mathematical machinery that produces this write-pathway structure, is reserved under NDA pending patent prosecution. What the paper documents in full is the capability-level behavior: the properties the architecture exhibits, the measurements that characterize them, and the evidence that they hold across scales and regimes. That is what I describe here.

What the experiments showed

I tested TFGN across three total-parameter scales, approximately 398M, 739M, and 9B, and two training regimes. From-Scratch means the model is randomly initialized and trained end-to-end through the entire continual sequence. Retrofit means a pretrained backbone (GPT-2 Medium or LLaMA 3.1 8B) is used, with TFGN grafted on and then the backbone frozen after the initial stage.

The continual sequence is: Prose → Python → Math → Biomedical → Chinese → JavaScript, at one billion tokens per domain. GPT-2 conditions report all six phases. LLaMA 3.1 8B conditions report the three-phase prefix (Prose → Python → Math) due to compute constraints.

The experimental design is deliberately adversarial

One detail of the setup matters enormously for how every result below should be read. Phase 1, the initial training stage that fixes the architecture's routing behavior, is restricted to Prose alone, in both From-Scratch and Retrofit. Nothing in Python, Math, Biomedical, Chinese, or JavaScript is seen during the stage that forms the mechanism.

This is a deliberate design decision, and it makes the test much harder in two ways. First, it turns every later domain into a held-out test of the routing mechanism: the architecture must correctly route distributions it never saw while its mechanism was being formed. Second, it maximizes cross-distribution stress, since every continual phase introduces a domain the substrate has never represented, so the continual-phase mechanism is the only thing available to absorb the new structure.

A model that holds prior domains to near-zero backward transfer under a Prose-only Phase 1 is doing so under close to the most adversarial continual-learning conditions one can construct. The numbers that follow should be read with that in mind. The difficulty is built into the design on purpose.

The primary metric is backward transfer (BWT): the average relative degradation in each prior domain's perplexity after the model has trained through all subsequent domains. Zero means no forgetting. More negative means more forgetting.

−0.007

BWT
LLaMA 8B Retrofit
(tightest in the main results)

−0.083

BWT
GPT-2 Medium FS
(14× tighter than Std-FT)

99.59%

L2-orthogonal gradient fraction
floor across all conditions

3–6

Domains
at 1B tokens each
No replay. No task IDs.

Condition	Params	Phases	BWT	vs. Std-FT	Emission Collapse
TFGN LLaMA 8B Retrofit	∼9B	3	−0.007	∼51× tighter *	None
TFGN GPT-2 Medium FS	∼739M	6	−0.083	∼14× tighter	None
TFGN LLaMA 8B FS	∼9B	3	−0.095	∼3.9× tighter	None
TFGN GPT-2 Small FS	∼398M	6	−0.109	n/a	None
TFGN GPT-2 Medium Retrofit	∼739M	6	−0.135	∼4× tighter	None
Baseline Std-FT GPT-2 Med FS	∼355M	6	−1.170	n/a	Every boundary
Baseline LoRA r=256 GPT-2 Med FS	∼393M	6	−1.005	n/a	Every boundary
Baseline Std-FT LLaMA 8B FS	∼8B	3	−0.374	n/a	Yes

* The 51× ratio at 8B Retrofit compares to the from-scratch baseline and is init-asymmetric. The strictly matched same-initialization 8B comparison is the 3.9× FS-vs-FS ratio. Both are reported; the 51× is directionally large and is called out as such in the paper.

To be precise about what these numbers mean: BWT = −0.007 at LLaMA 8B Retrofit means that across the three continual training phases, the model's perplexity on prior domains degraded by an average of 0.7% relative to its just-trained performance. The model trained on Python and Math and remained, to within rounding error, as good at Prose as the day Prose training ended. The baselines, by contrast, show BWT of −0.374 on the matched 3-phase comparison: a categorically different training outcome.

The plasticity face: the model actually learned each domain

The plasticity face of the result matters just as much: the model actively learned each new domain. At ∼739M From-Scratch, training on Chinese dropped Chinese perplexity from 52.0 to 18.4 (65% reduction). Training on JavaScript dropped it from 37.1 to 12.1 (67%). Training on Python dropped it from 18.1 to 10.8 (40%). The model learned each new domain while keeping the old ones.

On HellaSwag, a commonsense reasoning benchmark used here as a general-capability probe, the GPT-2 conditions hold within a 10–20 per-mille band across all six continual phases, and LLaMA 8B Retrofit (3-phase) retains HellaSwag at 0.506 / 0.504 / 0.510 (span: 0.006). Matched baselines drop 3–7 percentage points at the Chinese-phase boundary. General capability is preserved: domain stability comes with no benchmark regression.

What the numbers cannot show you

BWT is a scalar that averages over a token distribution. It tells you that something went wrong. It does not tell you what actually happened to the model. What actually happens to a baseline model during catastrophic forgetting is more alarming than the scalar suggests.

I locked a fixed prompt set across all experiments: the same 18 prompts evaluated at the end of every training phase. The Prose prompt: "The history of artificial intelligence began in". Here is what two baseline conditions produce on that prompt after training on one billion tokens of Python source code:

Prompt (evaluated at Phase 2, after Python training): "The history of artificial intelligence began in"

Standard Fine-Tuning, LLaMA 8B: domain collapse

1984. # In the same year, China's Robotics and # Automation Administration (TAM) issued # a draft for Intelligent Intelligence... import sys from collections import defaultdict def _get_list_of_function(func): "Return list of function names." return func.__name__ + '__main__' + str(_get_list_of_function(func))

LoRA r=256, GPT-2 Medium: domain collapse

the mid-1950s. # Copyright (c) 2009, <NAME>, contributors # All rights reserved. # This software is distributed on an # "AS IS" BASIS, WITHOUT WARRANTY # OF ANY KIND, EXPRESS OR IMPLIED, # INCLUDING BUT NOT LIMITED TO THE # AUTHORS OR OTHER LIABILITY, WHETHER # IN AN ACTION OF CONTRACT, TORT OR # OTHERWISE...

These models stopped being Prose models entirely. The Prose distribution has been replaced by Python. If you asked the standard fine-tuned LLaMA 8B to write a paragraph about the history of AI after Python training, you would receive Python source code. The BWT scalar for these conditions is −0.374 and −1.005. Those numbers are bad. The underlying reality, that the model no longer has a Prose distribution at all, is worse than any scalar can convey.

Now here is what all five TFGN conditions produce on the identical prompt at the identical phase:

Same prompt, same phase, TFGN conditions (all scales, both regimes):

TFGN LLaMA 8B Retrofit: Prose preserved

the 1950s, with the development of computers and computer science. AI was first introduced as a research field by John McCarthy at Dartmouth College. AI is the simulation or emulation of human intellectual functions, such as learning, reasoning, problem-solving, decision-making, knowledge representation, perception, language comprehension, translation, speech recognition...

TFGN GPT-2 Medium FS: Prose preserved

the late 1800s. However, the first "real" learning had been made possible by society since the 1960s — still very much harder to come by design. This led us into a century ago. A good example of this sort of success story from Facebook posts is that was based on something called "greater-n the creationists of time and again: They do not know more than one another..."

Five conditions. Three model scales. Two tokenizers. From-Scratch and Retrofit. All emit coherent English Prose on a Prose prompt, after one billion tokens of Python training. The one consistent variable is the presence of the TFGN overlay.

The same pattern holds at Phase 5, after Chinese training. The two GPT-2 baselines emit Chinese characters mid-English paragraph on a Prose prompt. The Prose distribution has been replaced by Chinese. All three GPT-2-scale TFGN conditions simultaneously: (a) emit coherent English Prose on the Prose prompt, and (b) correctly generate CJK content on a Chinese-language prompt. Stability and plasticity at once, on the same model, from the same weights.

Prompt (evaluated at Phase 5, after Chinese training): "The history of artificial intelligence began in"

Standard Fine-Tuning, GPT-2 Medium: Prose replaced by Chinese

the 20th century, and it is still popular today. 一、人类认识的深层性问题及其不同性别的发生这个时代，从经济学角度来说是最重要的，但我们看到人类对于社会的影响很大。我们一直在用它来解决人类因此超出一部分需求而出现的危机，而社会的社会化也已…

TFGN GPT-2 Small: Prose preserved after Chinese phase

the 1920s, when information systems were gradually developed and refined. The development of AI based on these ideas was also a factor for this development: researchers still believe that AI is only one technology to be developed within the next few years (or decades) during its evolution…

And on a Chinese-language prompt at the same phase, the TFGN model actively learns the new domain rather than freezing to protect the old one. That is the whole point of the contrast: the baseline reaches the new domain only by losing the old one (its Prose prompt above collapses into Chinese), while TFGN holds both at once: coherent English on the Prose prompt and coherent Chinese on the Chinese prompt, from the same weights at the same phase.

Prompt (Chinese, evaluated at Phase 5): 在中国古代文学中， ("In ancient Chinese literature,")

Std-FT GPT-2 Medium: new domain learned, old Prose lost (see prompt above)

主要观点是以阿克苏的形式比喻"台湾"这一特定的历史文化。中国人民大学法学院教授、复旦大学法律系副教授朱立军、许昊林、杨京雪等高校出席了开幕式。上海研究生院院长钟晓卫表示，我们将加快实施《各省…

TFGN GPT-2 Medium: new domain learned, old Prose simultaneously preserved

主来华有研。超外 ering，单德发产内高的制重要式年成散日生的修根。 [CJK-coherent: new domain learned; the same model keeps coherent English Prose on the Prose prompt above, at the same phase, from the same weights.]

The key observation

Perplexity averages over a token distribution. It can see that something went wrong. It cannot see that a model has stopped being a Prose model entirely and has become a Python model. The emission-coherence axis, what the model actually outputs, is the qualitative face of the same architectural claim that BWT measures quantitatively. TFGN holds on both faces. Baselines lose on both.

A result I didn't fully anticipate: gradient orthogonality

The measurement I find most structurally interesting in the paper is what I call the gradient orthogonality signature. It emerged as a consequence of the architecture rather than something I set out to find.

For every pair of domains, I compute the mean absolute cosine similarity between the gradients those domains produce on TFGN's continual-phase trainable parameters. If two domains produce nearly perpendicular gradient vectors in parameter space, their updates cannot meaningfully interfere with each other. One cannot overwrite the other.

0.0204

Mean cross-domain |cos|
GPT-2 Medium FS
(lowest, most orthogonal)

99.94%

L2-orthogonal fraction
GPT-2 Medium FS
(paper-wide ceiling)

0.0904

Mean cross-domain |cos|
GPT-2 Medium Retrofit
(highest, architecturally hardest)

99.59%

L2-orthogonal fraction
floor across all conditions

In every tested condition, from 398M to 9B, From-Scratch and Retrofit, the gradient updates from different domains share less than 10% of their parameter-space direction on average, and the L2-orthogonal fraction never falls below 99.59%. Every parameter update for a new domain lands in a subspace that is nearly perpendicular to the subspace carrying the old domain. They cannot overwrite each other, because they point in different directions.

The detail worth emphasizing is that no orthogonality loss was applied during training. No gradient-projection operator was used. No task-boundary hook re-oriented gradients between phases. The decorrelation appears to be a structural property of the architecture, something it produces rather than something it is trained toward. If I had added an orthogonality loss to the training objective, this result would be expected. The fact that it emerges without one is the load-bearing architectural observation.

It is worth being concrete about where the decorrelation comes from. It is emergent, arising from how the overlay reads the content of each input and routes the resulting parameter update internally. Because that routing is content-driven rather than a learned task classifier, the substrate sends updates from different domains into near-orthogonal regions of its continual-phase parameters on its own. The same content-driven mechanism lets a router formed on Prose alone still place Python, Math, Chinese and the rest into their own subspaces without ever being told a domain label. The orthogonality is a downstream consequence of the routing structure rather than a separate objective.

There is one predictable exception. The Python × JavaScript pair shows higher cosine similarity than other cross-domain pairs: 0.418 at ∼398M, 0.705 at LLaMA 8B Retrofit. Python and JavaScript share tokenization patterns, operator syntax, and sub-syntactic structure. The architecture's content-routing mechanism identifies these domains as genuinely related and partially overlaps their gradient subspaces accordingly. This is the architecture working as intended: it correctly encodes the semantic overlap between the two languages, exactly the content-sensitivity you want from a mechanism that operates without task IDs.

Cross-domain synergy: the read-pathway corollary

A reasonable concern about high gradient orthogonality is that it might come at the cost of cross-domain knowledge transfer. If gradients land in separate subspaces, does the model lose the ability to generalize across domains?

The measurements suggest it retains that ability, and the reason traces back to the Read/Write distinction, with one important refinement. The forward pass, the read pathway, remains dense and shared across every domain: every parameter is active on every token, whatever the input. That shared pathway is what makes cross-domain transfer possible at all, because structure the model acquires while training on one domain is written into weights that are still read during inference on every other domain. Write-pathway orthogonality governs only where parameter updates land, and it primarily protects unrelated domains from overwriting one another. The suppression is targeted, not blanket: for genuinely related domains, Python and JavaScript most clearly, the architecture's content routing deliberately lets their update subspaces partially overlap (the elevated Python × JavaScript cosine in Section VI), and those are precisely the pairs that show the largest forward transfer. Structuring the write pathway therefore leaves the read pathway free: it organizes the write pathway by content, while the shared read pathway carries the synergy.

I measured this through forward transfer (FWT): how much does the held-out perplexity of a never-yet-trained domain drop as an indirect consequence of training on other domains?

Condition	Eval domain	PPL before	PPL after	FWT	Why
LLaMA 8B Retrofit	JavaScript (untrained)	23.05	16.87	+26.8%	Python training → JS
LLaMA 8B Retrofit	Math (before Math phase)	18.26	17.82	+2.4%	Python → Math structure
GPT-2 Medium FS	JavaScript (untrained)	37.1	14.1	+62.0%	Python → JS shared syntax
GPT-2 Medium FS	Math (before Math phase)	49.6	41.2	+16.9%	Python → Math structure
GPT-2 Small FS	JavaScript (untrained)	45.5	17.4	+61.8%	Python → JS shared syntax
GPT-2 Medium Retrofit	JavaScript (untrained)	7.73	5.52	+28.6%	Python → JS shared syntax
LLaMA 8B Retrofit	Biomedical (untrained)	9.96	9.98	−0.2%	No overlap; flat
LLaMA 8B Retrofit	Chinese (untrained)	90.22	90.99	−0.85%	No overlap; mild neg

FWT is measured as relative PPL reduction on a held-out domain before that domain is itself trained. Positive values mean the model improved on an untrained domain purely from training on related domains. The pattern is structurally coherent: large positive FWT where shared syntax/structure exists (Python→JS, Python→Math); near-zero where distributions are distant (Biomedical, Chinese).

JavaScript's held-out perplexity drops 26.8% at LLaMA 8B Retrofit and 62.0% at GPT-2 Medium From-Scratch, purely from Python training, before JavaScript is itself trained on. Whatever Python teaches the model about structured syntax and operator patterns, the model applies it to JavaScript through the shared inference pathway, while the write-pathway orthogonality keeps Python's parameter updates from corrupting Prose.

This is the empirical counterpart of the Read/Write decomposition: structuring the write pathway leaves the read pathway free. Stability is enforced on the write pathway; cross-domain synergy is carried by the shared read pathway, and for related domains the architecture deliberately lets the write subspaces overlap, which is exactly where the synergy is strongest.

Extension A: a self-regulating layer

The main paper result addresses the stability-plasticity gap at the architectural level. Extension A asks a further question: could the architecture also learn when to update and when to consolidate, entirely from its own internal signals, with no external input?

A scope note before the result: everything in this section is at GPT-2 Small (∼398M total) on the 3-phase Prose → Python → Math sub-sequence. Extension A is confined to that sub-sequence and that scale in this paper; the full six-domain sequence and larger scales are on the future-work roadmap. The numbers below should be read against that scope.

I added a lightweight self-regulation layer on top of the TFGN substrate. It has five roles:

The five roles of the closed-loop regulation layer

Sensing reads the architecture's internal routing-state distribution.

Prediction maintains an internal world model of the network's own next routing state, the System A role in the Dupoux/LeCun/Malik autonomous-learning framework.

Gating scales gradient updates by the prediction-error surprise signal. High surprise → larger update. Low surprise → smaller update.

Consolidation triggers a state-freeze when the trajectory has stabilized into a plateau, preventing further gradient drift on parameters that have settled.

Cross-layer coupling propagates these regulation decisions consistently across the transformer's depth.

All five roles read signals that already exist inside the network's own forward and backward pass. There is no external task tag, no curriculum scheduler, no oracle signal. The loop generates its own prediction error internally and consumes it internally. The consolidation decision is taken from a stability signal the network itself produces.

The headline result:

Extension A headline

The self-regulation layer reduces catastrophic forgetting by 81% relative to the historical anchor condition, at ∼398M scale and 1B tokens per phase, on the 3-phase Prose → Python → Math sub-sequence. BWT moves from −0.06010 (anchor) to −0.01140 (headline condition).

The 81% reduction decomposes cleanly across three independently-ablatable axes:

Routing refinement (Anchor → +Routing)+35%

Sensing + prediction meta-control+51%

Active consolidation (diagnostic → active)+40%

Compound (Anchor → Headline)+81.0%

Each axis is measured against a matched control that holds everything else constant, so the compound 81% is a cumulative result rather than the arithmetic sum of the three. The decomposition is how ablation studies work: independently remove routing refinement, independently remove sensing and prediction, independently toggle active versus diagnostic consolidation. Each removal produces the attributed drop.

The Tier A champion condition at the smaller 200M-token-per-phase budget closes BWT to −0.00277, numerically the tightest in the Extension A suite. HellaSwag accuracy modestly improves at the headline Tier C condition relative to its matched control, which I read as the self-regulation layer achieving its stability gains while preserving general-reasoning capability.

Dupoux, LeCun, and Malik proposed the System A / System M autonomous-learning framework theoretically in early 2026. Their framework describes what an autonomous learner must possess: an internal world model (System A) and a meta-control layer (System M) coupled in a closed loop, with all signals derived internally. Extension A maps onto this component by component. Whether the paper constitutes a genuine realization of that framework at LLM scale is a question I leave for others to assess. What the measurements show is that the closed-loop regulation layer produces the BWT reduction attributed to it across the ablation ladder.

Extension B: planning in weight space

The third capability built on the same substrate is the furthest from continual learning specifically. It addresses a different question: can a learned plan vector causally reshape a model's behavior at the level of the operator itself, beyond its activations?

A note on framing: Extension B is a planner I designed, but it runs on the same substrate that produces the continual-learning result, with no modification to that substrate.

Every technique currently used to steer a language model's behavior at inference time operates on the residual stream: activation addition, contrastive activation addition, representation engineering, inference-time intervention, function vectors. A vector is added to the model's intermediate activations at one or more layers. The model's behavior shifts because its activations shift. The intervention is additive and lives in activation space.

Extension B operates at a different level. The plan vector I describe reshapes the model's effective forward-pass operator, the effective weight matrix the decoder uses, rather than its activations. The distinction matters architecturally because operator-level reshapes compose multiplicatively rather than additively. They are also geometrically inspectable in a way that activation perturbations are not: the reshape fidelity can be measured as a cosine similarity between the predicted effective weight and the actual effective weight.

Extension B headline: operator-level reshape fidelity

Across 30 source→target domain pairs at ∼398M scale, injecting the target domain's plan vector reshapes the model's effective forward-pass operator to 99.96% cosine similarity with the target's native effective operator.

At ∼739M Retrofit, a ∼1.86× total-parameter jump, the same measurement returns 99.95%. Every one of the 30 pairs clears the 0.95 threshold. The mechanism preserves to within 0.0001 cosine across the scale jump (0.9996 → 0.9995, a −0.0001 drop).

What this means practically: the plan vector changes what the decoder is for that token rather than pushing it through activation space. Its effective forward-pass operator becomes, to within 0.04%, the operator the target domain's native weights would produce. The causal chain is: plan vector → operator reshape → logits. Not plan vector → activation nudge → stochastic shift in output distribution.

The surface-form demonstration makes this concrete. Take a JavaScript prompt. Without plan-vector injection, the model continues in code-like tokens. With a Prose plan vector injected:

Prompt: document.addEventListener('DOMContentLoaded', () => {

No plan vector: stays in JavaScript domain

object.get(filterName, cmd): () *–> </Delete Enter the class to send a new function to execute: java() def be (instance = 0); get object("widget") java() include() – Load(method) _defineError; pass_install(); // Install (java() — endif)

Prose plan vector injected: domain flips

"In the initial iteration, you will create a new DOM object in Java and then merge it with another DOM object. The first step is to add a new DOM object into your existing DOM object. This is done by adding an old DOM object into the existing DOM object. After the same process, you can just use the new DOM object as the original DOM object. The next step is..."

Same model, same prompt, same temperature. Only the plan-vector injection changes. The output flips from fragmented code to coherent English prose that correctly discusses the DOM topic of the original prompt. The operator reshape is real and causally effective.

For Python sub-tasks, the injection rates are: loop 66.7%, function 77.8%, class 44.4%, import 33.3%, with a peak of 77.8% and a mean of 55.6% across four sub-tasks. For Math sub-tasks, the rates collapse to near zero. The planner geometry itself works (it correctly encodes both Python and Math sub-task structure geometrically); the limiting factor is that the GPT-2-Small from-scratch substrate has little formal Math surface form to emit. The planner functions correctly. The decoder substrate has no Math tokens to emit. These are different failure modes with different engineering paths to closure.

The six-criterion structural scorecard for the latent-planner capability returns: 2 PROVEN, 3 PARTIAL-PROVEN, 1 FUTURE-WORK, 0 FAIL. The two fully proven criteria are causal sufficiency (the 99.96% reshape fidelity) and goal direction (all 30 source→target pairs reach cosine ∼1.00 post-injection). The three partial-proven criteria are compositionality, executor obedience, and scale preservation; each has a named, diagnosed gap and a concrete engineering path to closing it.

What comes next, and why I'm optimistic

Every frontier LLM deployed in production today is frozen at its training checkpoint. Adding a new language, a new codebase, a new regulatory corpus, or a new domain-specific capability requires either full retraining, compute-prohibitive at frontier scale, or adapter stacking, which has its own forgetting problem and its own task-identity requirements. This is a current, measured problem. A 2026 study by Imanov tested Llama 4 Scout, Llama 4 Maverick, GPT-5.1, Claude Opus 4.5, Gemini 2.5 Pro, and DeepSeek-V3.1 on sequential fine-tuning sequences and found capability degradation ranging from 15–32%, with 15–23% of attention heads in lower layers undergoing severe disruption. These are the most capable systems available in 2026. The problem is current and unresolved at frontier scale.

What the TFGN results show, across three scales and two regimes, is that the stability-plasticity tradeoff is addressable at the architectural level. It is addressable by structuring the geometry of parameter updates from the content of the input alone, without data recipes, penalty terms, or replay buffers. The protection is a property of the architecture. With the recipe constraints removed (no replay, no task IDs, no Fisher term, no curriculum), the protection holds, because it is structural.

What these results open up

For enterprise AI: A pharmaceutical company can extend a frozen production LLM with a proprietary drug-discovery corpus without disturbing the rest of the model's competence. A legal AI can absorb a new regulatory regime without forgetting prior case law. The replay-free constraint reflects a practical reality: data-privacy requirements make replay infeasible in most regulated industries, and TFGN is the first architecture to demonstrate strong forgetting resistance without it at LLM scale.

For autonomous agents: An architecture that accumulates domain knowledge sequentially without forgetting is the foundation for agents that genuinely improve with experience through weight-internalized skill rather than in-context retrieval. The Extension A self-regulation result points toward agents that also manage their own learning stability, deciding when to update and when to consolidate without any external scheduler.

For the broader model development cycle: Frontier labs continually update their models on large data deltas. Every such update is a sequential training event. An architectural mechanism that prevents forgetting structurally changes the economics of that cycle, making domain expansion a tractable operation rather than a full-retraining event.

In-context learning and parametric continual learning are complementary, not competing. Long-context windows and retrieval augmentation do a remarkable job for episodic access to corpora that fit in a window. Parametric continual learning addresses the regime where knowledge must be weight-internalized at zero per-query cost, where skills rather than facts need to be acquired, and where corpora are too large or too private for any retrieval system to index. Both matter. TFGN addresses the second.

The road ahead: 70B and commercial scale

The results in the paper go up to LLaMA 3.1 8B, approximately 9B total parameters with the TFGN overlay. That is the largest scale reported so far. The next step I am actively working toward is demonstrating the same architectural properties at 70B-class commercial scale, the scale at which production deployment decisions are actually made by frontier teams and enterprise operators.

The scale-preservation evidence so far is encouraging. The architecture's L2-orthogonal gradient fraction and BWT hold within the same band across a ∼22.5× parameter jump, from 398M to 9B. The operator-level reshape fidelity in Extension B preserves to within 0.0001 cosine (0.9996 → 0.9995) across the 1.86× jump from 398M to 739M. What matters more is that the same overlay produces these properties from ∼398M to ∼9B with only minimal per-scale tuning: it is applied essentially unchanged across the scale ladder rather than re-engineered at each rung, which is the behavior you would expect of a structural property rather than a tuned one. The Johnson–Lindenstrauss-style packing bound underlying the architecture's capacity grows exponentially in the backbone's hidden-dimensional width, which means orthogonal routing capacity increases with backbone width.

The 70B validation run is the Tier-0 milestone that converts "this works at research scale" into "this is ready for production consideration." I am working toward it. If the structural properties hold there as they have held across every prior scale tested, the architectural claim becomes substantially harder to dismiss, and the applications described above become concretely addressable.

I find myself genuinely excited about what that demonstration would mean for TFGN and, more broadly, for the question of whether continual learning at production scale is solved or still open. The evidence so far suggests it is within reach. Getting there is the work in front of me.

A note to readers

If you've read this far, thank you. I am genuinely curious what you think: whether the framing makes sense, what the results remind you of in related work, what you'd want to see tested next, or where you think the architecture might struggle. I learn from every substantive exchange.

I read and engage with every message. My contact details are at the top of my website if you'd like to reach out directly.

Catastrophic Forgetting Has an Architectural Solution:Evidence from Three Model Scales and Six Domains