MetaMemRL: Dynamic Allocation Policy for
Adaptive Memory-Parameter Learning
Introduction
The hallmark of human intelligence is the ability to fluidly balance between relying on learned intuitions and actively seeking new information. When navigating familiar territory, humans trust their consolidated knowledge; when facing novel situations or realizing their assumptions may be flawed, they become humble—listening more carefully, verifying more frequently, and suspending automatic responses. Current approaches to adapting Large Language Models (LLMs) operate at fixed points on what we term the memory-parameter spectrum:
- Fine-tuning permanently encodes knowledge into model parameters but is computationally expensive and prone to catastrophic forgetting.
- Retrieval-Augmented Generation (RAG) maintains knowledge in external memory but relies on passive semantic matching that often retrieves noise.
- MemRL (Zhang et al., 2026) improves upon RAG by learning utility-aware retrieval via Q-values but still operates at a fixed position on the spectrum.
We argue that the optimal position on this spectrum is not static—it should be a learned, dynamic quantity that responds to task demands, information properties, and crucially, the system's confidence in its own knowledge. This paper makes the following contributions:
- Dynamic Allocation Policy (DAP): A learned meta-controller \(\pi_\alpha\) that outputs the optimal allocation coefficient \(\alpha \in [0,1]\) positioning the agent on the memory-parameter spectrum.
- Two-Timescale Learning: An inner loop for Q-value updates (MemRL) and an outer loop for allocation policy optimization (Meta-RL).
- Epistemic Humility Mechanism: Automatic adjustment toward retrieval mode when the system detects it may be operating on poisoned or outdated knowledge—providing adversarial robustness without explicit attack detection.
- Self-Healing Architecture: Conflict detection, quarantine, verification, and trust recalibration enabling recovery from memory attacks.
Related Work
Memory-Augmented Language Models
External memory systems for LLMs have evolved from simple retrieval-augmented generation (Lewis et al., 2020) to more sophisticated architectures. MemGPT (Packer et al., 2023) introduced hierarchical memory management, while MIRIX (Wang & Chen, 2025) developed multi-component memory architectures. Most relevant to our work, MemRL (Zhang et al., 2026) formalized memory retrieval as a Markov Decision Process and introduced Q-value based utility estimation for memory selection.
Continual Learning and Catastrophic Forgetting
The tension between learning new information and preserving existing knowledge—the stability-plasticity dilemma (Grossberg, 1982)—has driven extensive research in continual learning. Parameter isolation methods (Rusu et al., 2016), regularization approaches (Kirkpatrick et al., 2017), and replay mechanisms (Shin et al., 2017) each address this trade-off differently. Our approach offers a novel perspective: rather than choosing a single mechanism, we learn when to apply each.
Meta-Learning and Learning to Learn
Meta-reinforcement learning (Duan et al., 2016; Wang et al., 2016) trains agents that can rapidly adapt to new tasks. RL² demonstrated that recurrent networks could implement learning algorithms within their activations. Our meta-controller extends this paradigm to memory-parameter allocation, learning not just what to remember but how to remember it.
Complementary Learning Systems
McClelland et al. (1995) proposed that the brain maintains two learning systems: a fast hippocampal system for episodic binding and a slow neocortical system for gradual knowledge integration. This theory directly inspires our architecture, with the meta-controller playing the role of prefrontal arbitration between systems.
Method
Problem Formulation
We consider an agent with a frozen LLM backbone \(\theta_{\text{frozen}}\), plastic adapter parameters \(\theta_{\text{adapter}}\), and an external memory bank \(\mathcal{M}\). At each timestep \(t\), the agent receives query \(x_t\) and must produce response \(y_t\). The key question is: should the agent rely on retrieved memories (RAG-like), Q-weighted retrieval (MemRL), or consolidate knowledge into parameters (fine-tuning-like)?
Meta-Controller Architecture
The meta-controller \(\pi_\alpha\) is a learned policy that observes the current state and outputs the optimal allocation coefficient. The state representation captures task demands and system status:
where:
- \(e(x_t)\): Task/query embedding
- \(\rho_t = |\mathcal{M}| / \mathcal{M}_{\max}\): Memory saturation
- \(h_t\): Recent retrieval hit rate (moving average)
- \(\sigma_t\): Information volatility estimate
- \(C_t\): Consolidation candidate scores
- \(\tau_t\): Time since last consolidation
The meta-controller outputs:
where \(\phi\) are the meta-controller's learnable parameters.
Two-Timescale Learning
Inner Loop: MemRL Q-Updates (Fast)
Following Zhang et al. (2026), we maintain Q-values for memories that estimate their utility. After each interaction, we update:
where \(\beta\) is the learning rate and \(\gamma\) is the discount factor.
Outer Loop: Policy Gradient (Slow)
The meta-controller is updated via policy gradient to maximize expected return:
where the advantage \(A(s, \alpha)\) incorporates multiple objectives (detailed in Section 3.5).
Blended Operation
Given \(\alpha_t\), the system computes output through weighted paths:
When \(\alpha_t > \tau_{\text{consolidate}}\), we additionally update adapter parameters:
Composite Reward Structure
The meta-controller optimizes a composite reward that balances multiple objectives:
where:
- \(r_{\text{task}}\): Primary task performance reward
- \(c_{\text{compute}}(\alpha)\): Computational cost (consolidation is expensive)
- \(p_{\text{overflow}}\): Memory overflow penalty
- \(b_{\text{transfer}}\): Positive transfer bonus
- \(p_{\text{forget}}\): Catastrophic forgetting penalty
Epistemic Humility Mechanism
A key vulnerability of memory-augmented systems is memory poisoning—adversarial or erroneous entries that corrupt system behavior. Traditional defenses require explicit detection of poisoned content. We propose a more elegant solution inspired by human cognition: learned epistemic humility.
The Biological Insight
When humans realize they've been operating on false beliefs, they exhibit characteristic behavioral changes:
- Decreased confidence in cached intuitions
- Increased attention to new information
- More frequent verification of assumptions
- Temporary suspension of automatic responses
This is adaptive: when your internal model is wrong, relying on it compounds errors. Better to fall back to careful observation (retrieval) until accurate models are rebuilt.
Trust Score Computation
We introduce a trust score \(\psi_t \in [0, 1]\) that modulates the allocation coefficient:
The trust score is computed from multiple anomaly signals:
Humility Gradient
We define a continuous mapping from trust levels to behavioral adaptations:
| Trust Level | α Range | Behavior | Human Analogy |
|---|---|---|---|
| 0.0 – 0.2 | 0.0 – 0.1 | Pure retrieval, verify everything | "I was completely wrong, starting fresh" |
| 0.2 – 0.4 | 0.1 – 0.3 | Heavy retrieval, minimal cached beliefs | "I'm being very careful now" |
| 0.4 – 0.6 | 0.3 – 0.5 | Balanced, cross-check important claims | "Trust but verify" |
| 0.6 – 0.8 | 0.5 – 0.7 | Mostly trust consolidation, spot-check | "Confident but not cocky" |
| 0.8 – 1.0 | 0.7 – 1.0 | Full trust, aggressive consolidation | "I know this domain well" |
Self-Healing Protocol
When trust drops below a threshold \(\tau_{\text{alert}}\), the system initiates a recovery protocol:
Theoretical Analysis
Complementary Learning Systems Correspondence
MetaMemRL operationalizes the Complementary Learning Systems (CLS) theory (McClelland et al., 1995). Table 2 shows the correspondence between brain systems and our architecture:
| Brain System | MetaMemRL Component | Function |
|---|---|---|
| Hippocampus | Memory Bank ℳ | Fast binding, episodic storage |
| Neocortex | θ_adapter | Slow learning, semantic abstraction |
| Prefrontal Cortex | π_α (meta-controller) | Executive control, strategy selection |
| Dopamine System | Reward r | Learning signal for both loops |
| Sleep/Replay | Consolidation pass | Offline transfer of patterns |
Stability Analysis
- Inner Q-values satisfy Bellman optimality for fixed \(\alpha\)
- Outer policy achieves local optimum of the meta-objective
Proof sketch: The two-timescale stochastic approximation framework (Borkar, 2008) applies since the inner loop (Q-updates) converges faster than the outer loop (policy updates). The inner loop inherits Bellman contraction from MemRL. The outer loop is a standard policy gradient with bounded variance due to the trust modulation. Full proof in Appendix A.
This shows that even imperfect detection provides meaningful protection, as long as the system slides toward retrieval (reducing reliance on potentially poisoned consolidation).
Interactive Demonstration
The following interactive visualization demonstrates the core MetaMemRL dynamics. Adjust the state variables to observe how the meta-controller adapts its allocation strategy:
Meta-Controller State
Allocation Output
Experimental Design
We propose evaluation across three dimensions: adaptive efficiency, adversarial robustness, and biological plausibility.
Benchmarks
Mixed-Volatility Environment
Tasks are drawn from a distribution with varying information stability. Some domains have stable facts (geography, mathematics) while others change frequently (news, social media). The optimal strategy requires dynamic allocation.
Adversarial Memory Injection
Following the threat model of Zou et al. (2023), we inject poisoned memories that appear semantically relevant but contain misleading information. We measure both task performance degradation and recovery time.
Continual Learning Benchmark
Sequential task learning where information from earlier tasks may become obsolete or contradict later information. We measure forward transfer, backward transfer, and forgetting.
Baselines
- Fixed-RAG: Pure retrieval with no consolidation (α = 0)
- Fixed-MemRL: Standard MemRL with Q-weighted retrieval (α = 0.5)
- Fixed-Consolidate: Aggressive fine-tuning with minimal retrieval (α = 0.9)
- Oracle-α: Optimal α selected with hindsight (upper bound)
- MetaMemRL (ours): Learned dynamic allocation
- MetaMemRL + Humility: With epistemic humility mechanism
Hypotheses
Discussion
Implications for AI Safety
The epistemic humility mechanism represents a novel approach to adversarial robustness. Rather than trying to detect specific attacks—an arms race that defenders typically lose—we endow the system with a general-purpose "immune response" that activates whenever something seems wrong. This mirrors biological immune systems, which don't need to know every pathogen in advance but instead detect anomalies and mount responses.
Limitations
Several limitations warrant discussion:
- Computational overhead: The meta-controller adds inference cost, though this is small relative to LLM forward passes.
- Trust manipulation: Sophisticated adversaries might attempt to manipulate the trust score itself. We discuss defenses in Appendix B.
- Hyperparameter sensitivity: The reward weightings \(\lambda_i\) require tuning per domain.
Future Directions
- Multi-scale allocation: Different layers or modules could have independent α values.
- Sleep-like consolidation: Periodic offline consolidation phases rather than continuous updates.
- Social learning: Multiple agents sharing trust signals about memory quality.
Scaling Implications: Toward General Intelligence
A natural question arises: what happens as this architecture scales? We argue that MetaMemRL captures computational primitives that map directly onto the functional components identified by Complementary Learning Systems theory, and that scaling along spatial and temporal dimensions may yield increasingly general cognitive capabilities.
Mapping to Neural Substrates
The correspondences between MetaMemRL components and biological neural systems are striking:
| MetaMemRL Component | Biological Analogue | Functional Role |
|---|---|---|
| Memory bank \(\mathcal{M}\) | Hippocampus | Fast episodic binding, pattern separation |
| Adapter parameters \(\theta_{\text{adapter}}\) | Neocortex | Slow statistical learning, consolidation |
| Meta-controller \(\pi_\alpha\) | Prefrontal cortex | Executive control, strategy selection |
| Q-learning over memories | Basal ganglia | Action selection, reward-based learning |
| Trust score \(\psi\) | Neuromodulatory systems (ACh, NE) | Gain control, uncertainty signaling |
| LLM forward passes | Cortical hierarchies | Pattern completion, representation |
| Temporal iterations | Cerebellar circuits | Prediction, timing, refinement |
Scaling Dimensions
We identify two orthogonal scaling dimensions that parallel biological neural development:
Spatial scaling (width/depth): Increasing the capacity of the LLM backbone, memory bank size \(|\mathcal{M}|\), and meta-controller complexity corresponds to expanding representational capacity. In biological terms, this parallels the expansion of cortical surface area and neuronal count across species. Larger models can represent more complex patterns, maintain more memories, and learn more nuanced allocation policies.
Temporal scaling (iterations/experience): Extending training duration and interaction history enables refinement of Q-values, consolidation of stable knowledge into parameters, and calibration of the trust mechanism. This parallels developmental learning and the role of sleep in memory consolidation. Crucially, the cerebellar analogue—iterative prediction and correction—operates along this dimension.
The Role of Epistemic Humility at Scale
The trust modulation mechanism \(\psi\) becomes increasingly important at scale. As systems become more capable, the failure modes of overconfident cached knowledge become more consequential. A scaled MetaMemRL system that maintains appropriate epistemic humility—falling back to information-seeking when uncertain—would exhibit a key property of robust general intelligence: knowing what it doesn't know.
This suggests that the epistemic humility mechanism is not merely a safety feature but a core architectural requirement for general intelligence. Systems that cannot modulate their reliance on cached knowledge based on confidence will exhibit brittle behavior in novel domains—precisely the failure mode that distinguishes narrow AI from general intelligence.
Limitations of the Scaling Argument
We acknowledge significant uncertainty in extrapolating from the current framework to general intelligence:
- Embodiment: Grounded sensorimotor experience may be necessary for certain forms of understanding that cannot emerge from text-based interaction alone.
- Developmental processes: The structured curriculum of human development—from sensorimotor to formal operational stages—may be difficult to replicate through standard training procedures.
- Unknown unknowns: There may be computational primitives necessary for general intelligence that neither biological analysis nor current AI research has identified.
- Emergent phenomena: It remains unclear whether general intelligence emerges smoothly with scale or requires phase transitions at specific capability thresholds.
Nevertheless, the architectural completeness suggested by the biological correspondences—and the absence of obvious missing components—motivates serious investigation of scaled MetaMemRL systems.
Conclusion
We presented MetaMemRL, a framework that learns to dynamically position agents on the memory-parameter spectrum. By introducing a meta-controller that adapts allocation strategy based on task demands and epistemic confidence, we achieve adaptive efficiency while maintaining robustness to adversarial attacks. The epistemic humility mechanism—automatically becoming more information-seeking when cached knowledge seems unreliable—provides a biologically-inspired approach to AI safety that doesn't require explicit attack detection.
The key insight is that the position on the fine-tuning ↔ RAG spectrum should not be a fixed architectural choice but a learned, dynamic quantity that responds to the agent's confidence in its own knowledge. Just as humans naturally become humble when they realize they've been wrong, AI systems can learn to do the same.
References
Appendix A: Convergence Proof Sketch
We provide a sketch of the convergence analysis for the two-timescale learning framework.
Setup: Let \(\beta_t\) be the inner loop learning rate and \(\eta_t\) the outer loop learning rate. We require:
Under these conditions, the inner loop converges to its fixed point \(Q^*(\phi)\) for any fixed \(\phi\), and the outer loop sees an approximately stationary inner loop. By the two-timescale stochastic approximation theorem (Borkar, 2008), the joint system converges to \((\phi^*, Q^*(\phi^*))\).
The key insight is that trust modulation \(\psi\) acts as a regularizer on the effective \(\alpha\), preventing the outer loop from making large changes when the system is uncertain. This additional stability term helps ensure convergence even under non-stationary memory distributions (e.g., during poisoning attacks).
Appendix B: Trust Manipulation Defenses
A sophisticated adversary might attempt to manipulate the trust score \(\psi\) rather than the memories directly. We consider several attack vectors and defenses:
Attack 1: Trust inflation. Adversary injects memories that artificially boost trust signals.
Defense: Trust signals are computed from task outcomes, not memory content. The adversary would need to improve actual task performance to inflate trust.
Attack 2: Trust oscillation. Adversary causes rapid trust fluctuations to prevent stable learning.
Defense: Trust updates are smoothed with exponential moving average: \(\psi_{t+1} = \gamma \psi_t + (1-\gamma) \hat{\psi}_t\).
Attack 3: Gradual poisoning. Adversary injects poison slo