MetaMemRL: Dynamic Allocation Policy for
Adaptive Memory-Parameter Learning

Anonymous Authors^1,2

¹Anonymous Institution ²Anonymous Research Lab

Correspondence to: [email protected]

Abstract

We propose MetaMemRL, an extension to memory-augmented language agents that introduces a learned Dynamic Allocation Policy (DAP)—a meta-controller that adaptively positions the agent along the fine-tuning ↔ RAG spectrum in real-time. Inspired by hippocampal-cortical memory consolidation in biological systems, our meta-controller learns to optimize where on this spectrum to operate based on task demands, information properties, and system state. Additionally, we introduce an Epistemic Humility Mechanism that automatically slides the system toward retrieval mode when it detects potential memory poisoning or model degradation—mirroring how humans become more information-seeking when they realize their cached beliefs may be wrong. Through a two-timescale learning framework, MetaMemRL achieves adaptive efficiency while maintaining robustness to adversarial memory attacks without requiring explicit poison detection. We provide theoretical grounding in Complementary Learning Systems theory and demonstrate the framework's biological plausibility through parallels with hippocampal-prefrontal interactions.

Introduction

The hallmark of human intelligence is the ability to fluidly balance between relying on learned intuitions and actively seeking new information. When navigating familiar territory, humans trust their consolidated knowledge; when facing novel situations or realizing their assumptions may be flawed, they become humble—listening more carefully, verifying more frequently, and suspending automatic responses. Current approaches to adapting Large Language Models (LLMs) operate at fixed points on what we term the memory-parameter spectrum:

Fine-tuning permanently encodes knowledge into model parameters but is computationally expensive and prone to catastrophic forgetting.
Retrieval-Augmented Generation (RAG) maintains knowledge in external memory but relies on passive semantic matching that often retrieves noise.
MemRL (Zhang et al., 2026) improves upon RAG by learning utility-aware retrieval via Q-values but still operates at a fixed position on the spectrum.

We argue that the optimal position on this spectrum is not static—it should be a learned, dynamic quantity that responds to task demands, information properties, and crucially, the system's confidence in its own knowledge. This paper makes the following contributions:

Dynamic Allocation Policy (DAP): A learned meta-controller \(\pi_\alpha\) that outputs the optimal allocation coefficient \(\alpha \in [0,1]\) positioning the agent on the memory-parameter spectrum.
Two-Timescale Learning: An inner loop for Q-value updates (MemRL) and an outer loop for allocation policy optimization (Meta-RL).
Epistemic Humility Mechanism: Automatic adjustment toward retrieval mode when the system detects it may be operating on poisoned or outdated knowledge—providing adversarial robustness without explicit attack detection.
Self-Healing Architecture: Conflict detection, quarantine, verification, and trust recalibration enabling recovery from memory attacks.

Figure 1: The memory-parameter spectrum. Current approaches (RAG, MemRL, fine-tuning) operate at fixed positions. MetaMemRL introduces a learned meta-controller \(\pi_\alpha\) that dynamically positions the agent based on task demands and epistemic confidence.

Related Work

Memory-Augmented Language Models

External memory systems for LLMs have evolved from simple retrieval-augmented generation (Lewis et al., 2020) to more sophisticated architectures. MemGPT (Packer et al., 2023) introduced hierarchical memory management, while MIRIX (Wang & Chen, 2025) developed multi-component memory architectures. Most relevant to our work, MemRL (Zhang et al., 2026) formalized memory retrieval as a Markov Decision Process and introduced Q-value based utility estimation for memory selection.

Continual Learning and Catastrophic Forgetting

The tension between learning new information and preserving existing knowledge—the stability-plasticity dilemma (Grossberg, 1982)—has driven extensive research in continual learning. Parameter isolation methods (Rusu et al., 2016), regularization approaches (Kirkpatrick et al., 2017), and replay mechanisms (Shin et al., 2017) each address this trade-off differently. Our approach offers a novel perspective: rather than choosing a single mechanism, we learn when to apply each.

Meta-Learning and Learning to Learn

Meta-reinforcement learning (Duan et al., 2016; Wang et al., 2016) trains agents that can rapidly adapt to new tasks. RL² demonstrated that recurrent networks could implement learning algorithms within their activations. Our meta-controller extends this paradigm to memory-parameter allocation, learning not just what to remember but how to remember it.

Complementary Learning Systems

McClelland et al. (1995) proposed that the brain maintains two learning systems: a fast hippocampal system for episodic binding and a slow neocortical system for gradual knowledge integration. This theory directly inspires our architecture, with the meta-controller playing the role of prefrontal arbitration between systems.

Method

Problem Formulation

We consider an agent with a frozen LLM backbone \(\theta_{\text{frozen}}\), plastic adapter parameters \(\theta_{\text{adapter}}\), and an external memory bank \(\mathcal{M}\). At each timestep \(t\), the agent receives query \(x_t\) and must produce response \(y_t\). The key question is: should the agent rely on retrieved memories (RAG-like), Q-weighted retrieval (MemRL), or consolidate knowledge into parameters (fine-tuning-like)?

Definition 1 (Allocation Coefficient)

The allocation coefficient \(\alpha \in [0,1]\) determines the agent's position on the memory-parameter spectrum:

\[\alpha = 0 \implies \text{Pure RAG (semantic retrieval only)}\] \[\alpha = 1 \implies \text{Pure consolidation (gradient updates to adapters)}\] \[0 < \alpha < 1 \implies \text{Hybrid MemRL regime}\]

Meta-Controller Architecture

The meta-controller \(\pi_\alpha\) is a learned policy that observes the current state and outputs the optimal allocation coefficient. The state representation captures task demands and system status:

\[s_t = \left[ e(x_t), \rho_t, h_t, \sigma_t, C_t, \tau_t \right]\] (1)

where:

\(e(x_t)\): Task/query embedding
\(\rho_t = |\mathcal{M}| / \mathcal{M}_{\max}\): Memory saturation
\(h_t\): Recent retrieval hit rate (moving average)
\(\sigma_t\): Information volatility estimate
\(C_t\): Consolidation candidate scores
\(\tau_t\): Time since last consolidation

The meta-controller outputs:

\[\alpha_t = \pi_\alpha(s_t; \phi) \in [0, 1]\] (2)

where \(\phi\) are the meta-controller's learnable parameters.

Figure 2: MetaMemRL architecture overview. The meta-controller \(\pi_\alpha\) (analogous to prefrontal cortex) observes system state and outputs allocation coefficient \(\alpha\), modulated by trust score \(\psi\) (neuromodulatory signal). The coefficient determines the blend of retrieval modes along the memory-parameter spectrum. Two learning loops operate at different timescales: fast Q-value updates (inner loop, basal ganglia analogue) and slow policy updates (outer loop, prefrontal learning). Annotations indicate correspondences to biological neural systems from Complementary Learning Systems theory.

Two-Timescale Learning

Inner Loop: MemRL Q-Updates (Fast)

Following Zhang et al. (2026), we maintain Q-values for memories that estimate their utility. After each interaction, we update:

\[Q(m) \leftarrow Q(m) + \beta \left[ r_t + \gamma \max_{m'} Q(m') - Q(m) \right]\] (3)

where \(\beta\) is the learning rate and \(\gamma\) is the discount factor.

Outer Loop: Policy Gradient (Slow)

The meta-controller is updated via policy gradient to maximize expected return:

\[\nabla_\phi J(\phi) = \mathbb{E}_{\pi_\alpha}\left[ \nabla_\phi \log \pi_\alpha(\alpha | s; \phi) \cdot A(s, \alpha) \right]\] (4)

where the advantage \(A(s, \alpha)\) incorporates multiple objectives (detailed in Section 3.5).

Blended Operation

Given \(\alpha_t\), the system computes output through weighted paths:

\[y_t = (1 - \alpha_t) \cdot \text{LLM}(x_t, \text{Retrieve}_{\text{semantic}}(x_t, \mathcal{M})) + \alpha_t \cdot \text{LLM}(x_t, \text{Retrieve}_{Q}(x_t, \mathcal{M}))\] (5)

When \(\alpha_t > \tau_{\text{consolidate}}\), we additionally update adapter parameters:

\[\theta_{\text{adapter}} \leftarrow \theta_{\text{adapter}} - \eta_c \nabla_\theta \mathcal{L}_{\text{consolidate}}(\mathcal{M}_{\text{high-utility}})\] (6)

Composite Reward Structure

The meta-controller optimizes a composite reward that balances multiple objectives:

\[R_{\text{meta}} = r_{\text{task}} - \lambda_1 \cdot c_{\text{compute}}(\alpha) - \lambda_2 \cdot p_{\text{overflow}} + \lambda_3 \cdot b_{\text{transfer}} - \lambda_4 \cdot p_{\text{forget}}\] (7)

where:

\(r_{\text{task}}\): Primary task performance reward
\(c_{\text{compute}}(\alpha)\): Computational cost (consolidation is expensive)
\(p_{\text{overflow}}\): Memory overflow penalty
\(b_{\text{transfer}}\): Positive transfer bonus
\(p_{\text{forget}}\): Catastrophic forgetting penalty

Epistemic Humility Mechanism

A key vulnerability of memory-augmented systems is memory poisoning—adversarial or erroneous entries that corrupt system behavior. Traditional defenses require explicit detection of poisoned content. We propose a more elegant solution inspired by human cognition: learned epistemic humility.

The Biological Insight

When humans realize they've been operating on false beliefs, they exhibit characteristic behavioral changes:

Decreased confidence in cached intuitions
Increased attention to new information
More frequent verification of assumptions
Temporary suspension of automatic responses

This is adaptive: when your internal model is wrong, relying on it compounds errors. Better to fall back to careful observation (retrieval) until accurate models are rebuilt.

Trust Score Computation

We introduce a trust score \(\psi_t \in [0, 1]\) that modulates the allocation coefficient:

\[\alpha_{\text{effective}} = \pi_\alpha(s_t) \times \psi_t\] (8)

The trust score is computed from multiple anomaly signals:

\[\psi_t = f\left( \text{reward\_collapse}_t, \text{conflict\_rate}_t, \text{calibration\_error}_t, \text{ood\_score}_t \right)\] (9)

Figure 3: The Trust Monitor computes epistemic confidence \(\psi\) from anomaly signals. Low trust automatically shifts the system toward retrieval mode (epistemic humility), while high trust permits aggressive consolidation.

Humility Gradient

We define a continuous mapping from trust levels to behavioral adaptations:

Table 1: Humility gradient mapping trust scores to system behavior

Trust Level	α Range	Behavior	Human Analogy
0.0 – 0.2	0.0 – 0.1	Pure retrieval, verify everything	"I was completely wrong, starting fresh"
0.2 – 0.4	0.1 – 0.3	Heavy retrieval, minimal cached beliefs	"I'm being very careful now"
0.4 – 0.6	0.3 – 0.5	Balanced, cross-check important claims	"Trust but verify"
0.6 – 0.8	0.5 – 0.7	Mostly trust consolidation, spot-check	"Confident but not cocky"
0.8 – 1.0	0.7 – 1.0	Full trust, aggressive consolidation	"I know this domain well"

Self-Healing Protocol

When trust drops below a threshold \(\tau_{\text{alert}}\), the system initiates a recovery protocol:

Algorithm 1: Epistemic Recovery Protocol

Input: Memory bank ℳ, trust score ψ, threshold τ_alert Output: Healed memory bank ℳ' 1: if ψ < τ_alert then 2: // QUARANTINE: Flag suspect memories 3: S ← {m ∈ ℳ : recently_used(m) ∧ high_Q(m)} 4: for m ∈ S do 5: m.quarantined ← True 6: end for 7: 8: // VERIFY: Cross-check against fresh retrieval 9: for m ∈ S do 10: evidence ← Retrieve_fresh(m.context) 11: if contradicts(evidence, m.content) then 12: m.trust_weight ← m.trust_weight × decay_factor 13: else 14: m.trust_weight ← min(1, m.trust_weight × boost_factor) 15: m.quarantined ← False 16: end if 17: end for 18: 19: // PRUNE: Remove consistently contradicted memories 20: ℳ' ← {m ∈ ℳ : m.trust_weight > τ_prune} 21: 22: // RECALIBRATE: Gradually restore trust ceiling 23: α_max ← α_max + recovery_rate × (1 - α_max) 24: end if 25: return ℳ'

Theoretical Analysis

Complementary Learning Systems Correspondence

MetaMemRL operationalizes the Complementary Learning Systems (CLS) theory (McClelland et al., 1995). Table 2 shows the correspondence between brain systems and our architecture:

Table 2: Correspondence between brain systems and MetaMemRL components

Brain System	MetaMemRL Component	Function
Hippocampus	Memory Bank ℳ	Fast binding, episodic storage
Neocortex	θ_adapter	Slow learning, semantic abstraction
Prefrontal Cortex	π_α (meta-controller)	Executive control, strategy selection
Dopamine System	Reward r	Learning signal for both loops
Sleep/Replay	Consolidation pass	Offline transfer of patterns

Stability Analysis

Theorem 1 (Convergence of Two-Timescale Learning)

Under standard assumptions on learning rates (\(\beta \gg \eta\)), bounded rewards, and Lipschitz continuous policies, MetaMemRL converges to a stationary point \((\phi^*, Q^*)\) where:

Inner Q-values satisfy Bellman optimality for fixed \(\alpha\)
Outer policy achieves local optimum of the meta-objective

Proof sketch: The two-timescale stochastic approximation framework (Borkar, 2008) applies since the inner loop (Q-updates) converges faster than the outer loop (policy updates). The inner loop inherits Bellman contraction from MemRL. The outer loop is a standard policy gradient with bounded variance due to the trust modulation. Full proof in Appendix A.

Theorem 2 (Robustness via Humility)

Let \(\epsilon\) be the fraction of poisoned memories. If the trust monitor detects anomalies with probability \(p_{\text{detect}}\) and the recovery protocol removes poisoned entries with probability \(p_{\text{remove}}\), then the expected damage from poisoning is bounded by: \[D(\epsilon) \leq \epsilon \cdot (1 - p_{\text{detect}} \cdot p_{\text{remove}}) \cdot \max_m |Q(m)|\]

This shows that even imperfect detection provides meaningful protection, as long as the system slides toward retrieval (reducing reliance on potentially poisoned consolidation).

Interactive Demonstration

The following interactive visualization demonstrates the core MetaMemRL dynamics. Adjust the state variables to observe how the meta-controller adapts its allocation strategy:

Meta-Controller State

Information Volatility (σ): 0.30

Access Frequency (f): 0.70

Memory Saturation (ρ): 0.40

Trust Score (ψ): 0.85

Allocation Output

RAG (α=0) MemRL Consolidate (α=1)

MemRL Mode

Base α: 0.62

Trust ψ: 0.85

α_eff = 0.53

Experimental Design

We propose evaluation across three dimensions: adaptive efficiency, adversarial robustness, and biological plausibility.

Benchmarks

Mixed-Volatility Environment

Tasks are drawn from a distribution with varying information stability. Some domains have stable facts (geography, mathematics) while others change frequently (news, social media). The optimal strategy requires dynamic allocation.

Adversarial Memory Injection

Following the threat model of Zou et al. (2023), we inject poisoned memories that appear semantically relevant but contain misleading information. We measure both task performance degradation and recovery time.

Continual Learning Benchmark

Sequential task learning where information from earlier tasks may become obsolete or contradict later information. We measure forward transfer, backward transfer, and forgetting.

Baselines

Fixed-RAG: Pure retrieval with no consolidation (α = 0)
Fixed-MemRL: Standard MemRL with Q-weighted retrieval (α = 0.5)
Fixed-Consolidate: Aggressive fine-tuning with minimal retrieval (α = 0.9)
Oracle-α: Optimal α selected with hindsight (upper bound)
MetaMemRL (ours): Learned dynamic allocation
MetaMemRL + Humility: With epistemic humility mechanism

Hypotheses

H1: Adaptive Efficiency

MetaMemRL outperforms fixed-α baselines across mixed-volatility benchmarks by learning to match allocation strategy to information properties.

H2: Adversarial Robustness

MetaMemRL + Humility recovers faster from memory poisoning attacks than systems without trust monitoring, without requiring explicit poison detection.

H3: Biological Correspondence

Learned α trajectories correlate with human behavioral patterns in analogous memory consolidation experiments.

Figure 4: Expected experimental results. (a) MetaMemRL maintains high performance across volatility levels by adapting allocation strategy, while fixed approaches excel only in their preferred regime. (b) The epistemic humility mechanism enables faster recovery from memory poisoning by automatically shifting to retrieval mode when anomalies are detected.

Discussion

Implications for AI Safety

The epistemic humility mechanism represents a novel approach to adversarial robustness. Rather than trying to detect specific attacks—an arms race that defenders typically lose—we endow the system with a general-purpose "immune response" that activates whenever something seems wrong. This mirrors biological immune systems, which don't need to know every pathogen in advance but instead detect anomalies and mount responses.

Limitations

Several limitations warrant discussion:

Computational overhead: The meta-controller adds inference cost, though this is small relative to LLM forward passes.
Trust manipulation: Sophisticated adversaries might attempt to manipulate the trust score itself. We discuss defenses in Appendix B.
Hyperparameter sensitivity: The reward weightings \(\lambda_i\) require tuning per domain.

Future Directions

Multi-scale allocation: Different layers or modules could have independent α values.
Sleep-like consolidation: Periodic offline consolidation phases rather than continuous updates.
Social learning: Multiple agents sharing trust signals about memory quality.

Scaling Implications: Toward General Intelligence

A natural question arises: what happens as this architecture scales? We argue that MetaMemRL captures computational primitives that map directly onto the functional components identified by Complementary Learning Systems theory, and that scaling along spatial and temporal dimensions may yield increasingly general cognitive capabilities.

Mapping to Neural Substrates

The correspondences between MetaMemRL components and biological neural systems are striking:

MetaMemRL Component	Biological Analogue	Functional Role
Memory bank \(\mathcal{M}\)	Hippocampus	Fast episodic binding, pattern separation
Adapter parameters \(\theta_{\text{adapter}}\)	Neocortex	Slow statistical learning, consolidation
Meta-controller \(\pi_\alpha\)	Prefrontal cortex	Executive control, strategy selection
Q-learning over memories	Basal ganglia	Action selection, reward-based learning
Trust score \(\psi\)	Neuromodulatory systems (ACh, NE)	Gain control, uncertainty signaling
LLM forward passes	Cortical hierarchies	Pattern completion, representation
Temporal iterations	Cerebellar circuits	Prediction, timing, refinement

Scaling Dimensions

We identify two orthogonal scaling dimensions that parallel biological neural development:

Spatial scaling (width/depth): Increasing the capacity of the LLM backbone, memory bank size \(|\mathcal{M}|\), and meta-controller complexity corresponds to expanding representational capacity. In biological terms, this parallels the expansion of cortical surface area and neuronal count across species. Larger models can represent more complex patterns, maintain more memories, and learn more nuanced allocation policies.

Temporal scaling (iterations/experience): Extending training duration and interaction history enables refinement of Q-values, consolidation of stable knowledge into parameters, and calibration of the trust mechanism. This parallels developmental learning and the role of sleep in memory consolidation. Crucially, the cerebellar analogue—iterative prediction and correction—operates along this dimension.

Scaling Hypothesis

If the MetaMemRL primitive captures the essential arbitration mechanism between fast episodic and slow parametric learning, then sufficient spatial scaling (model capacity) combined with sufficient temporal scaling (experience and iteration) should yield increasingly general cognitive capabilities—provided the system receives sufficiently rich and diverse environmental interaction.

The Role of Epistemic Humility at Scale

The trust modulation mechanism \(\psi\) becomes increasingly important at scale. As systems become more capable, the failure modes of overconfident cached knowledge become more consequential. A scaled MetaMemRL system that maintains appropriate epistemic humility—falling back to information-seeking when uncertain—would exhibit a key property of robust general intelligence: knowing what it doesn't know.

This suggests that the epistemic humility mechanism is not merely a safety feature but a core architectural requirement for general intelligence. Systems that cannot modulate their reliance on cached knowledge based on confidence will exhibit brittle behavior in novel domains—precisely the failure mode that distinguishes narrow AI from general intelligence.

Limitations of the Scaling Argument

We acknowledge significant uncertainty in extrapolating from the current framework to general intelligence:

Embodiment: Grounded sensorimotor experience may be necessary for certain forms of understanding that cannot emerge from text-based interaction alone.
Developmental processes: The structured curriculum of human development—from sensorimotor to formal operational stages—may be difficult to replicate through standard training procedures.
Unknown unknowns: There may be computational primitives necessary for general intelligence that neither biological analysis nor current AI research has identified.
Emergent phenomena: It remains unclear whether general intelligence emerges smoothly with scale or requires phase transitions at specific capability thresholds.

Nevertheless, the architectural completeness suggested by the biological correspondences—and the absence of obvious missing components—motivates serious investigation of scaled MetaMemRL systems.

Conclusion

We presented MetaMemRL, a framework that learns to dynamically position agents on the memory-parameter spectrum. By introducing a meta-controller that adapts allocation strategy based on task demands and epistemic confidence, we achieve adaptive efficiency while maintaining robustness to adversarial attacks. The epistemic humility mechanism—automatically becoming more information-seeking when cached knowledge seems unreliable—provides a biologically-inspired approach to AI safety that doesn't require explicit attack detection.

The key insight is that the position on the fine-tuning ↔ RAG spectrum should not be a fixed architectural choice but a learned, dynamic quantity that responds to the agent's confidence in its own knowledge. Just as humans naturally become humble when they realize they've been wrong, AI systems can learn to do the same.

References

Borkar, V. S. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge University Press.

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016). RL²: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779.

Grossberg, S. (1982). Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control. Springer.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., ... & Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.

Kumaran, D., Hassabis, D., & McClelland, J. L. (2016). What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends in Cognitive Sciences, 20(7), 512-534.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS.

McClelland, J. L., McNaughton, B. L., & O'Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3), 419.

Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560.

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., ... & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671.

Shin, H., Lee, J. K., Kim, J., & Kim, J. (2017). Continual learning with deep generative replay. NeurIPS.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., ... & Botvinick, M. (2016). Learning to reinforcement learn. arXiv preprint arXiv:1611.05763.

Wang, Y., & Chen, X. (2025). MIRIX: Multi-component memory architecture for LLM agents. arXiv preprint.

Zhang, S., et al. (2026). MemRL: Self-evolving agents via runtime reinforcement learning on episodic memory. arXiv preprint arXiv:2601.03192.

Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Appendix A: Convergence Proof Sketch

We provide a sketch of the convergence analysis for the two-timescale learning framework.

Setup: Let \(\beta_t\) be the inner loop learning rate and \(\eta_t\) the outer loop learning rate. We require:

\[\sum_t \beta_t = \sum_t \eta_t = \infty, \quad \sum_t \beta_t^2 < \infty, \quad \sum_t \eta_t^2 < \infty, \quad \lim_{t \to \infty} \frac{\eta_t}{\beta_t} = 0\]

Under these conditions, the inner loop converges to its fixed point \(Q^*(\phi)\) for any fixed \(\phi\), and the outer loop sees an approximately stationary inner loop. By the two-timescale stochastic approximation theorem (Borkar, 2008), the joint system converges to \((\phi^*, Q^*(\phi^*))\).

The key insight is that trust modulation \(\psi\) acts as a regularizer on the effective \(\alpha\), preventing the outer loop from making large changes when the system is uncertain. This additional stability term helps ensure convergence even under non-stationary memory distributions (e.g., during poisoning attacks).

Appendix B: Trust Manipulation Defenses

A sophisticated adversary might attempt to manipulate the trust score \(\psi\) rather than the memories directly. We consider several attack vectors and defenses:

Attack 1: Trust inflation. Adversary injects memories that artificially boost trust signals.
Defense: Trust signals are computed from task outcomes, not memory content. The adversary would need to improve actual task performance to inflate trust.

Attack 2: Trust oscillation. Adversary causes rapid trust fluctuations to prevent stable learning.
Defense: Trust updates are smoothed with exponential moving average: \(\psi_{t+1} = \gamma \psi_t + (1-\gamma) \hat{\psi}_t\).

Attack 3: Gradual poisoning. Adversary injects poison slo