BEDQ+: Belief-Enhanced Dyna-Q with Entropy-Guided Prioritized Planning for Robust Reinforcement Learning in Deceptive Environments

Tracking #: 925-1905

	Name	ORCID
	Dr. Raghavendra	https://orcid.org/0000-0002-6625-2986
	N. Shobha Rani	https://orcid.org/0000-0003-4882-1919
	Sowmya T	https://orcid.org/0000-0001-8965-3312

Authors:

Submission Type:

Research Paper

Abstract:

Reinforcement Learning (RL) has managed to solve sequential decision-making tasks, but model-based methods such as Dyna-Q are not successful in deceptive or complex environments where partial observability and deceptive transitions interfere with the learning process. To overcome these limitations, we introduce BEDQ+ (Belief-Enhanced Dyna-Q with Entropy-Guided Prioritized Planning), a new hybrid architecture that unifies three important innovations: (1) Bayesian belief-state estimation to remove noisy observations, (2) entropy-guided action to favor exploratory actions, and (3) prioritized planning to replay high-impact transitions selectively. The new approach extends and generalizes the baseline Dyna-Q framework with a better mechanism for learning from uncertainty. We tested BEDQ+ on normal and deceptive 4×4 and 6×6 gridworld environments, which are more complex and misleading. Results indicate that BEDQ+ outperforms Dyna-Q across all environments consistently. On the Normal 6×6 configuration, BEDQ+ achieved a final average reward of +49.76 and a 100% goal reach ratio, outperforming Dyna-Q's +40.00. Even in the most challenging Deceptive 6×6 setting, BEDQ+ achieved a reward of +29.07, surpassing Dyna-Q's +20.53, with a trap hit rate of only 8 (versus Dyna-Q's 16) and a goal reach of 90%. A validation ablation study validated the necessity of each module since removing any of the modules resulted in poor performance. Second, a statistical significance test over 10 runs resulted in a t-statistic value of 54.29 and p-value < 0.0001, and this validates that the improvement in performance of BEDQ+ is statistically significant. In summary, BEDQ+ has excellent scalability, stability, and robustness to misleading conditions and is thus a promising breakthrough for real-world reinforcement learning.

Manuscript:

ds-paper-925.docx

Data repository URLs:

Not applicable

Date of Submission:

Monday, July 14, 2025

Date of Decision:

Thursday, July 17, 2025

Nanopublication URLs:

Decision:

Reject (Pre-Screening)

Data Science