MetaMemRL: Dynamic Allocation Policy for
Adaptive Memory-Parameter Learning

Anonymous Authors1,2
1Anonymous Institution    2Anonymous Research Lab
Correspondence to: [email protected]
Abstract
We propose MetaMemRL, an extension to memory-augmented language agents that introduces a learned Dynamic Allocation Policy (DAP)—a meta-controller that adaptively positions the agent along the fine-tuning ↔ RAG spectrum in real-time. Inspired by hippocampal-cortical memory consolidation in biological systems, our meta-controller learns to optimize where on this spectrum to operate based on task demands, information properties, and system state. Additionally, we introduce an Epistemic Humility Mechanism that automatically slides the system toward retrieval mode when it detects potential memory poisoning or model degradation—mirroring how humans become more information-seeking when they realize their cached beliefs may be wrong. Through a two-timescale learning framework, MetaMemRL achieves adaptive efficiency while maintaining robustness to adversarial memory attacks without requiring explicit poison detection. We provide theoretical grounding in Complementary Learning Systems theory and demonstrate the framework's biological plausibility through parallels with hippocampal-prefrontal interactions.

Introduction

The hallmark of human intelligence is the ability to fluidly balance between relying on learned intuitions and actively seeking new information. When navigating familiar territory, humans trust their consolidated knowledge; when facing novel situations or realizing their assumptions may be flawed, they become humble—listening more carefully, verifying more frequently, and suspending automatic responses. Current approaches to adapting Large Language Models (LLMs) operate at fixed points on what we term the memory-parameter spectrum:

We argue that the optimal position on this spectrum is not static—it should be a learned, dynamic quantity that responds to task demands, information properties, and crucially, the system's confidence in its own knowledge. This paper makes the following contributions:

α = 0 α = 0.5 α = 1 RAG MemRL Fine-tuning (Non-parametric) (Q-weighted retrieval) (Parametric updates) πα The Memory-Parameter Spectrum
Figure 1: The memory-parameter spectrum. Current approaches (RAG, MemRL, fine-tuning) operate at fixed positions. MetaMemRL introduces a learned meta-controller \(\pi_\alpha\) that dynamically positions the agent based on task demands and epistemic confidence.

Related Work

Memory-Augmented Language Models

External memory systems for LLMs have evolved from simple retrieval-augmented generation (Lewis et al., 2020) to more sophisticated architectures. MemGPT (Packer et al., 2023) introduced hierarchical memory management, while MIRIX (Wang & Chen, 2025) developed multi-component memory architectures. Most relevant to our work, MemRL (Zhang et al., 2026) formalized memory retrieval as a Markov Decision Process and introduced Q-value based utility estimation for memory selection.

Continual Learning and Catastrophic Forgetting

The tension between learning new information and preserving existing knowledge—the stability-plasticity dilemma (Grossberg, 1982)—has driven extensive research in continual learning. Parameter isolation methods (Rusu et al., 2016), regularization approaches (Kirkpatrick et al., 2017), and replay mechanisms (Shin et al., 2017) each address this trade-off differently. Our approach offers a novel perspective: rather than choosing a single mechanism, we learn when to apply each.

Meta-Learning and Learning to Learn

Meta-reinforcement learning (Duan et al., 2016; Wang et al., 2016) trains agents that can rapidly adapt to new tasks. RL² demonstrated that recurrent networks could implement learning algorithms within their activations. Our meta-controller extends this paradigm to memory-parameter allocation, learning not just what to remember but how to remember it.

Complementary Learning Systems

McClelland et al. (1995) proposed that the brain maintains two learning systems: a fast hippocampal system for episodic binding and a slow neocortical system for gradual knowledge integration. This theory directly inspires our architecture, with the meta-controller playing the role of prefrontal arbitration between systems.

Method

Problem Formulation

We consider an agent with a frozen LLM backbone \(\theta_{\text{frozen}}\), plastic adapter parameters \(\theta_{\text{adapter}}\), and an external memory bank \(\mathcal{M}\). At each timestep \(t\), the agent receives query \(x_t\) and must produce response \(y_t\). The key question is: should the agent rely on retrieved memories (RAG-like), Q-weighted retrieval (MemRL), or consolidate knowledge into parameters (fine-tuning-like)?

Definition 1 (Allocation Coefficient)
The allocation coefficient \(\alpha \in [0,1]\) determines the agent's position on the memory-parameter spectrum:
\[\alpha = 0 \implies \text{Pure RAG (semantic retrieval only)}\] \[\alpha = 1 \implies \text{Pure consolidation (gradient updates to adapters)}\] \[0 < \alpha < 1 \implies \text{Hybrid MemRL regime}\]

Meta-Controller Architecture

The meta-controller \(\pi_\alpha\) is a learned policy that observes the current state and outputs the optimal allocation coefficient. The state representation captures task demands and system status:

\[s_t = \left[ e(x_t), \rho_t, h_t, \sigma_t, C_t, \tau_t \right]\] (1)

where:

The meta-controller outputs:

\[\alpha_t = \pi_\alpha(s_t; \phi) \in [0, 1]\] (2)

where \(\phi\) are the meta-controller's learnable parameters.

Input xt Meta-Controller πα(s; φ) s = [e(xt), ρt, ht, σt, Ct, τt] Neural Network (φ) → α = sigmoid(·) α ∈ [0, 1] Trust Score ψ αeff = α · ψ (Neuromodulation) RAG Path α ≈ 0 Semantic Retrieval MemRL Path 0 < α < 1 Q-weighted Retrieval Consolidation α ≈ 1 ∇θ Adapter Updates Frozen LLM θfrozen + Plastic Adapters θadapter Memory Bank (Hippocampus) Output yt Task Reward rt Inner Loop (Fast) Q(m) updates — β Outer Loop (Slow) φ updates — η Policy feedback Q-value feedback
Figure 2: MetaMemRL architecture overview. The meta-controller \(\pi_\alpha\) (analogous to prefrontal cortex) observes system state and outputs allocation coefficient \(\alpha\), modulated by trust score \(\psi\) (neuromodulatory signal). The coefficient determines the blend of retrieval modes along the memory-parameter spectrum. Two learning loops operate at different timescales: fast Q-value updates (inner loop, basal ganglia analogue) and slow policy updates (outer loop, prefrontal learning). Annotations indicate correspondences to biological neural systems from Complementary Learning Systems theory.

Two-Timescale Learning

Inner Loop: MemRL Q-Updates (Fast)

Following Zhang et al. (2026), we maintain Q-values for memories that estimate their utility. After each interaction, we update:

\[Q(m) \leftarrow Q(m) + \beta \left[ r_t + \gamma \max_{m'} Q(m') - Q(m) \right]\] (3)

where \(\beta\) is the learning rate and \(\gamma\) is the discount factor.

Outer Loop: Policy Gradient (Slow)

The meta-controller is updated via policy gradient to maximize expected return:

\[\nabla_\phi J(\phi) = \mathbb{E}_{\pi_\alpha}\left[ \nabla_\phi \log \pi_\alpha(\alpha | s; \phi) \cdot A(s, \alpha) \right]\] (4)

where the advantage \(A(s, \alpha)\) incorporates multiple objectives (detailed in Section 3.5).

Blended Operation

Given \(\alpha_t\), the system computes output through weighted paths:

\[y_t = (1 - \alpha_t) \cdot \text{LLM}(x_t, \text{Retrieve}_{\text{semantic}}(x_t, \mathcal{M})) + \alpha_t \cdot \text{LLM}(x_t, \text{Retrieve}_{Q}(x_t, \mathcal{M}))\] (5)

When \(\alpha_t > \tau_{\text{consolidate}}\), we additionally update adapter parameters:

\[\theta_{\text{adapter}} \leftarrow \theta_{\text{adapter}} - \eta_c \nabla_\theta \mathcal{L}_{\text{consolidate}}(\mathcal{M}_{\text{high-utility}})\] (6)

Composite Reward Structure

The meta-controller optimizes a composite reward that balances multiple objectives:

\[R_{\text{meta}} = r_{\text{task}} - \lambda_1 \cdot c_{\text{compute}}(\alpha) - \lambda_2 \cdot p_{\text{overflow}} + \lambda_3 \cdot b_{\text{transfer}} - \lambda_4 \cdot p_{\text{forget}}\] (7)

where:

Epistemic Humility Mechanism

A key vulnerability of memory-augmented systems is memory poisoning—adversarial or erroneous entries that corrupt system behavior. Traditional defenses require explicit detection of poisoned content. We propose a more elegant solution inspired by human cognition: learned epistemic humility.

The Biological Insight

When humans realize they've been operating on false beliefs, they exhibit characteristic behavioral changes:

  1. Decreased confidence in cached intuitions
  2. Increased attention to new information
  3. More frequent verification of assumptions
  4. Temporary suspension of automatic responses

This is adaptive: when your internal model is wrong, relying on it compounds errors. Better to fall back to careful observation (retrieval) until accurate models are rebuilt.

Trust Score Computation

We introduce a trust score \(\psi_t \in [0, 1]\) that modulates the allocation coefficient:

\[\alpha_{\text{effective}} = \pi_\alpha(s_t) \times \psi_t\] (8)

The trust score is computed from multiple anomaly signals:

\[\psi_t = f\left( \text{reward\_collapse}_t, \text{conflict\_rate}_t, \text{calibration\_error}_t, \text{ood\_score}_t \right)\] (9)
Trust Monitor Architecture Anomaly Signals • Reward collapse • Conflict rate • Calibration error ψ = f(signals) Trust Score ψ = 0.70 Allocation Modulation α_eff = π_α(s) × ψ Low Trust (ψ → 0) • Slide toward RAG • Quarantine suspects • Verify via retrieval High Trust (ψ → 1) • Allow consolidation • Trust cached knowledge • Efficient operation
Figure 3: The Trust Monitor computes epistemic confidence \(\psi\) from anomaly signals. Low trust automatically shifts the system toward retrieval mode (epistemic humility), while high trust permits aggressive consolidation.

Humility Gradient

We define a continuous mapping from trust levels to behavioral adaptations:

Table 1: Humility gradient mapping trust scores to system behavior
Trust Level α Range Behavior Human Analogy
0.0 – 0.2 0.0 – 0.1 Pure retrieval, verify everything "I was completely wrong, starting fresh"
0.2 – 0.4 0.1 – 0.3 Heavy retrieval, minimal cached beliefs "I'm being very careful now"
0.4 – 0.6 0.3 – 0.5 Balanced, cross-check important claims "Trust but verify"
0.6 – 0.8 0.5 – 0.7 Mostly trust consolidation, spot-check "Confident but not cocky"
0.8 – 1.0 0.7 – 1.0 Full trust, aggressive consolidation "I know this domain well"

Self-Healing Protocol

When trust drops below a threshold \(\tau_{\text{alert}}\), the system initiates a recovery protocol:

Algorithm 1: Epistemic Recovery Protocol
Input: Memory bank ℳ, trust score ψ, threshold τ_alert Output: Healed memory bank ℳ' 1: if ψ < τ_alert then 2: // QUARANTINE: Flag suspect memories 3: S ← {m ∈ ℳ : recently_used(m) ∧ high_Q(m)} 4: for m ∈ S do 5: m.quarantined ← True 6: end for 7: 8: // VERIFY: Cross-check against fresh retrieval 9: for m ∈ S do 10: evidence ← Retrieve_fresh(m.context) 11: if contradicts(evidence, m.content) then 12: m.trust_weight ← m.trust_weight × decay_factor 13: else 14: m.trust_weight ← min(1, m.trust_weight × boost_factor) 15: m.quarantined ← False 16: end if 17: end for 18: 19: // PRUNE: Remove consistently contradicted memories 20: ℳ' ← {m ∈ ℳ : m.trust_weight > τ_prune} 21: 22: // RECALIBRATE: Gradually restore trust ceiling 23: α_max ← α_max + recovery_rate × (1 - α_max) 24: end if 25: return ℳ'

Theoretical Analysis

Complementary Learning Systems Correspondence

MetaMemRL operationalizes the Complementary Learning Systems (CLS) theory (McClelland et al., 1995). Table 2 shows the correspondence between brain systems and our architecture:

Table 2: Correspondence between brain systems and MetaMemRL components
Brain System MetaMemRL Component Function
Hippocampus Memory Bank ℳ Fast binding, episodic storage
Neocortex θ_adapter Slow learning, semantic abstraction
Prefrontal Cortex π_α (meta-controller) Executive control, strategy selection
Dopamine System Reward r Learning signal for both loops
Sleep/Replay Consolidation pass Offline transfer of patterns

Stability Analysis

Theorem 1 (Convergence of Two-Timescale Learning)
Under standard assumptions on learning rates (\(\beta \gg \eta\)), bounded rewards, and Lipschitz continuous policies, MetaMemRL converges to a stationary point \((\phi^*, Q^*)\) where:
  1. Inner Q-values satisfy Bellman optimality for fixed \(\alpha\)
  2. Outer policy achieves local optimum of the meta-objective

Proof sketch: The two-timescale stochastic approximation framework (Borkar, 2008) applies since the inner loop (Q-updates) converges faster than the outer loop (policy updates). The inner loop inherits Bellman contraction from MemRL. The outer loop is a standard policy gradient with bounded variance due to the trust modulation. Full proof in Appendix A.

Theorem 2 (Robustness via Humility)
Let \(\epsilon\) be the fraction of poisoned memories. If the trust monitor detects anomalies with probability \(p_{\text{detect}}\) and the recovery protocol removes poisoned entries with probability \(p_{\text{remove}}\), then the expected damage from poisoning is bounded by: \[D(\epsilon) \leq \epsilon \cdot (1 - p_{\text{detect}} \cdot p_{\text{remove}}) \cdot \max_m |Q(m)|\]

This shows that even imperfect detection provides meaningful protection, as long as the system slides toward retrieval (reducing reliance on potentially poisoned consolidation).

Interactive Demonstration

The following interactive visualization demonstrates the core MetaMemRL dynamics. Adjust the state variables to observe how the meta-controller adapts its allocation strategy:

Meta-Controller State

Allocation Output

RAG (α=0) MemRL Consolidate (α=1)
MemRL Mode
Base α: 0.62
Trust ψ: 0.85
α_eff = 0.53

Experimental Design

We propose evaluation across three dimensions: adaptive efficiency, adversarial robustness, and biological plausibility.

Benchmarks

Mixed-Volatility Environment

Tasks are drawn from a distribution with varying information stability. Some domains have stable facts (geography, mathematics) while others change frequently (news, social media). The optimal strategy requires dynamic allocation.

Adversarial Memory Injection

Following the threat model of Zou et al. (2023), we inject poisoned memories that appear semantically relevant but contain misleading information. We measure both task performance degradation and recovery time.

Continual Learning Benchmark

Sequential task learning where information from earlier tasks may become obsolete or contradict later information. We measure forward transfer, backward transfer, and forgetting.

Baselines

Hypotheses

H1: Adaptive Efficiency
MetaMemRL outperforms fixed-α baselines across mixed-volatility benchmarks by learning to match allocation strategy to information properties.
H2: Adversarial Robustness
MetaMemRL + Humility recovers faster from memory poisoning attacks than systems without trust monitoring, without requiring explicit poison detection.
H3: Biological Correspondence
Learned α trajectories correlate with human behavioral patterns in analogous memory consolidation experiments.
(a) Performance vs. Information Volatility Volatility σ Task Score RAG Consolidate MetaMemRL (b) Recovery from Memory Poisoning Time (episodes) Task Score Poison MemRL +Humility
Figure 4: Expected experimental results. (a) MetaMemRL maintains high performance across volatility levels by adapting allocation strategy, while fixed approaches excel only in their preferred regime. (b) The epistemic humility mechanism enables faster recovery from memory poisoning by automatically shifting to retrieval mode when anomalies are detected.

Discussion

Implications for AI Safety

The epistemic humility mechanism represents a novel approach to adversarial robustness. Rather than trying to detect specific attacks—an arms race that defenders typically lose—we endow the system with a general-purpose "immune response" that activates whenever something seems wrong. This mirrors biological immune systems, which don't need to know every pathogen in advance but instead detect anomalies and mount responses.

Limitations

Several limitations warrant discussion:

Future Directions

Scaling Implications: Toward General Intelligence

A natural question arises: what happens as this architecture scales? We argue that MetaMemRL captures computational primitives that map directly onto the functional components identified by Complementary Learning Systems theory, and that scaling along spatial and temporal dimensions may yield increasingly general cognitive capabilities.

Mapping to Neural Substrates

The correspondences between MetaMemRL components and biological neural systems are striking:

MetaMemRL Component Biological Analogue Functional Role
Memory bank \(\mathcal{M}\) Hippocampus Fast episodic binding, pattern separation
Adapter parameters \(\theta_{\text{adapter}}\) Neocortex Slow statistical learning, consolidation
Meta-controller \(\pi_\alpha\) Prefrontal cortex Executive control, strategy selection
Q-learning over memories Basal ganglia Action selection, reward-based learning
Trust score \(\psi\) Neuromodulatory systems (ACh, NE) Gain control, uncertainty signaling
LLM forward passes Cortical hierarchies Pattern completion, representation
Temporal iterations Cerebellar circuits Prediction, timing, refinement

Scaling Dimensions

We identify two orthogonal scaling dimensions that parallel biological neural development:

Spatial scaling (width/depth): Increasing the capacity of the LLM backbone, memory bank size \(|\mathcal{M}|\), and meta-controller complexity corresponds to expanding representational capacity. In biological terms, this parallels the expansion of cortical surface area and neuronal count across species. Larger models can represent more complex patterns, maintain more memories, and learn more nuanced allocation policies.

Temporal scaling (iterations/experience): Extending training duration and interaction history enables refinement of Q-values, consolidation of stable knowledge into parameters, and calibration of the trust mechanism. This parallels developmental learning and the role of sleep in memory consolidation. Crucially, the cerebellar analogue—iterative prediction and correction—operates along this dimension.

Scaling Hypothesis
If the MetaMemRL primitive captures the essential arbitration mechanism between fast episodic and slow parametric learning, then sufficient spatial scaling (model capacity) combined with sufficient temporal scaling (experience and iteration) should yield increasingly general cognitive capabilities—provided the system receives sufficiently rich and diverse environmental interaction.

The Role of Epistemic Humility at Scale

The trust modulation mechanism \(\psi\) becomes increasingly important at scale. As systems become more capable, the failure modes of overconfident cached knowledge become more consequential. A scaled MetaMemRL system that maintains appropriate epistemic humility—falling back to information-seeking when uncertain—would exhibit a key property of robust general intelligence: knowing what it doesn't know.

This suggests that the epistemic humility mechanism is not merely a safety feature but a core architectural requirement for general intelligence. Systems that cannot modulate their reliance on cached knowledge based on confidence will exhibit brittle behavior in novel domains—precisely the failure mode that distinguishes narrow AI from general intelligence.

Limitations of the Scaling Argument

We acknowledge significant uncertainty in extrapolating from the current framework to general intelligence:

Nevertheless, the architectural completeness suggested by the biological correspondences—and the absence of obvious missing components—motivates serious investigation of scaled MetaMemRL systems.

Conclusion

We presented MetaMemRL, a framework that learns to dynamically position agents on the memory-parameter spectrum. By introducing a meta-controller that adapts allocation strategy based on task demands and epistemic confidence, we achieve adaptive efficiency while maintaining robustness to adversarial attacks. The epistemic humility mechanism—automatically becoming more information-seeking when cached knowledge seems unreliable—provides a biologically-inspired approach to AI safety that doesn't require explicit attack detection.

The key insight is that the position on the fine-tuning ↔ RAG spectrum should not be a fixed architectural choice but a learned, dynamic quantity that responds to the agent's confidence in its own knowledge. Just as humans naturally become humble when they realize they've been wrong, AI systems can learn to do the same.

References

Borkar, V. S. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge University Press.
Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016). RL²: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779.
Grossberg, S. (1982). Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control. Springer.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., ... & Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
Kumaran, D., Hassabis, D., & McClelland, J. L. (2016). What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends in Cognitive Sciences, 20(7), 512-534.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS.
McClelland, J. L., McNaughton, B. L., & O'Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3), 419.
Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560.
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., ... & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671.
Shin, H., Lee, J. K., Kim, J., & Kim, J. (2017). Continual learning with deep generative replay. NeurIPS.
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., ... & Botvinick, M. (2016). Learning to reinforcement learn. arXiv preprint arXiv:1611.05763.
Wang, Y., & Chen, X. (2025). MIRIX: Multi-component memory architecture for LLM agents. arXiv preprint.
Zhang, S., et al. (2026). MemRL: Self-evolving agents via runtime reinforcement learning on episodic memory. arXiv preprint arXiv:2601.03192.
Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Appendix A: Convergence Proof Sketch

We provide a sketch of the convergence analysis for the two-timescale learning framework.

Setup: Let \(\beta_t\) be the inner loop learning rate and \(\eta_t\) the outer loop learning rate. We require:

\[\sum_t \beta_t = \sum_t \eta_t = \infty, \quad \sum_t \beta_t^2 < \infty, \quad \sum_t \eta_t^2 < \infty, \quad \lim_{t \to \infty} \frac{\eta_t}{\beta_t} = 0\]

Under these conditions, the inner loop converges to its fixed point \(Q^*(\phi)\) for any fixed \(\phi\), and the outer loop sees an approximately stationary inner loop. By the two-timescale stochastic approximation theorem (Borkar, 2008), the joint system converges to \((\phi^*, Q^*(\phi^*))\).

The key insight is that trust modulation \(\psi\) acts as a regularizer on the effective \(\alpha\), preventing the outer loop from making large changes when the system is uncertain. This additional stability term helps ensure convergence even under non-stationary memory distributions (e.g., during poisoning attacks).

Appendix B: Trust Manipulation Defenses

A sophisticated adversary might attempt to manipulate the trust score \(\psi\) rather than the memories directly. We consider several attack vectors and defenses:

Attack 1: Trust inflation. Adversary injects memories that artificially boost trust signals.
Defense: Trust signals are computed from task outcomes, not memory content. The adversary would need to improve actual task performance to inflate trust.

Attack 2: Trust oscillation. Adversary causes rapid trust fluctuations to prevent stable learning.
Defense: Trust updates are smoothed with exponential moving average: \(\psi_{t+1} = \gamma \psi_t + (1-\gamma) \hat{\psi}_t\).

Attack 3: Gradual poisoning. Adversary injects poison slo