Abstract:
Reinforcement Learning (RL) has managed to solve sequential decision-making tasks, but model-based methods such as Dyna-Q are not successful in deceptive or complex environments where partial observability and deceptive transitions interfere with the learning process. To overcome these limitations, we introduce BEDQ+ (Belief-Enhanced Dyna-Q with Entropy-Guided Prioritized Planning), a new hybrid architecture that unifies three important innovations: (1) Bayesian belief-state estimation to remove noisy observations, (2) entropy-guided action to favor exploratory actions, and (3) prioritized planning to replay high-impact transitions selectively. The new approach extends and generalizes the baseline Dyna-Q framework with a better mechanism for learning from uncertainty. We tested BEDQ+ on normal and deceptive 4×4 and 6×6 gridworld environments, which are more complex and misleading. Results indicate that BEDQ+ outperforms Dyna-Q across all environments consistently. On the Normal 6×6 configuration, BEDQ+ achieved a final average reward of +49.76 and a 100% goal reach ratio, outperforming Dyna-Q's +40.00. Even in the most challenging Deceptive 6×6 setting, BEDQ+ achieved a reward of +29.07, surpassing Dyna-Q's +20.53, with a trap hit rate of only 8 (versus Dyna-Q's 16) and a goal reach of 90%.
A validation ablation study validated the necessity of each module since removing any of the modules resulted in poor performance. Second, a statistical significance test over 10 runs resulted in a t-statistic value of 54.29 and p-value < 0.0001, and this validates that the improvement in performance of BEDQ+ is statistically significant. In summary, BEDQ+ has excellent scalability, stability, and robustness to misleading conditions and is thus a promising breakthrough for real-world reinforcement learning.