The text added in the re-submission is in red. The count of works in the .tex file was ~9000, which was below the target of 12000 words for a submission. I thank reviewer 1 for the positive comments. The reply to the reviewer's suggestions are listed below. The choice of the values for the parameters \tau and \gamma, and the neural network architecture were kept the same as the TD3 (https://github.com/sfujim/TD3/tree/master) and SAC (https://github.com/haarnoja/sac/tree/master) algorithm implementations. The scaling factor \lambda adjusts the step size for the agent, and it was this combination of the scaling factor and number of steps in an episode which led to the best performance of the agent. The agent reaches near the terminal state with just the number of required small enough steps to end the episode. A larger value of \lambda resulted in the agent taking longer steps over regions of the potential energy surface with a higher energy to give an incorrect estimate of the barrier height. Smaller values of \lambda led to smaller steps, and the agent did not leave the local minima to explore other regions of the potential energy surface, and is unsuccessful a its assigned task. It is difficult to visualizing these effects as a plot of the rewards against the number of validation steps as those in Figure 6. The captions of all the figures, and Figure 1, 3 and 4 in particular have been shortened. The typos in the caption of Figure 2 has been corrected. A sentence introducing the actor and critic functions has been added (Page 5, lines 14-18). The three pathways are labeled and the labels are used to identify the energy profiles of the respective pathways. I thank reviewer 2 for the constructive comments and suggestions. The reply to the reviewer's suggestions are listed below. Introduction: More citations have been added while introduction reinforcement learning in Section 1, which are: - Reinforcement Learning: A Survey (https://doi.org/10.48550/arXiv.cs/9605103) - P. Maes, "Modeling Adaptive Autonomous Agents," in Artificial Life, vol. 1, no. 1_2, pp. 135-162, Oct. 1993, doi: 10.1162/artl.1993.1.1_2.135. - Reinforcement Learning: An Introduction by Sutton and Barto was cited in the previous submission. - Reward is enough (https://doi.org/10.1016/j.artint.2021.103535) - T. Mannucci and E. -J. van Kampen, "A hierarchical maze navigation algorithm with Reinforcement Learning and mapping," 2016 IEEE Symposium Series on Computational Intelligence (SSCI), 2016, pp. 1-8, doi: 10.1109/SSCI.2016.7849365, was cited in the previous submission. - D. Osmanković and S. Konjicija, "Implementation of Q — Learning algorithm for solving maze problem," 2011 Proceedings of the 34th International Convention MIPRO, 2011, pp. 1619-1622. - M. A. Wiering and H. van Hasselt, "Ensemble Algorithms in Reinforcement Learning," in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 930-936, Aug. 2008, doi: 10.1109/TSMCB.2008.920231. - Neural Map: Structured Memory for Deep Reinforcement Learning (https://doi.org/10.48550/arXiv.1702.08360) - Brunner, G., Richter, O., Wang, Y., & Wattenhofer, R. (2018). Teaching a Machine to Read Maps With Deep Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11645 Figure 1 has been downsized to occupy less space and its source is acknowledged. While it was a little difficult to make an analogy between the actions and rewards (especially rewards, because the agent has different objectives in both environments), it has been added at lines 5-8 and 11-16 on page 3. The typo in the caption of Figure 2 was removed and most of the caption is incorporated into the main text. Methods: The typo in the equation for Rt (extra comma). (Pg. 6 Ln. 43) was corrected. The missing \gamma in the state-action function (Pg. 8 Ln. 6) was added. A sentence introducing the actor and critic functions has been added on page 5, lines 14-18. Target policy smoothing is introduced on page 8 lines 26-34. The author acknowledges that formulated MDP suffers from the problem raised by the reviewer. The number of steps in an episode, the scaling factor of actions and the number of training epochs were varied to come up with a set of values for these three parameters which minimize the problems cause due to the imperfect formulation of the MDP. To discourage the agent from sitting in some state \delta away from the target state, and collect rewards for the remainder of the episode, the episode is truncated after 500 steps. A small scaling factor \lambda was used for the actions, so that the agent does not make too many long jumps through higher energy states to reach a state with lower energy faster. Decreasing \lambda would require increasing the maximum number of steps in an episode, so that the agent explores regions away from the starting point, but not too much that the trajectory passes through regions with higher energy. The lowest reward in the episode (corresponding to the highest energy along the pathway, plotted in added Figure 5b, was monitored to decide when the agent stops improving at its intended task. Using the model after 1000 validation steps indeed led to a higher estimate of the energy barrier. Experiments: Figure 3b was shifted to be a part of Figure 5, as suggested by the reviewer. I would like to acknowledge my mistake of not multiplying the discount factor \gamma while calculating the returns from the episodes. It has been corrected and leads to a much flatter learning curve. Some text is added on Pages 10 and 11 to elaborate on the plots in Figure 4. The table in Figure 6 was updated after multiplying the discount factor while calculating the average rewards. Text was added to page 16, lines 33 onward, to elaborate on the results from Figure 7. However, it was kept a part of conclusions because it demonstrates a conclusion: using a reinforcement learning based approach has an advantage compared to the existing gradient based algorithms. The section has been split. Discussions contain comparison of the current work with previous work, while conclusions focus only on this work. I would like to thank out the reviewer for pointing out relevant references. A discussion of these in the context of the current work is added in Section 4 and summarized below: - Khan, Ahmad, and Alexei Lapkin. "Searching for optimal process routes: A reinforcement learning approach." Computers & Chemical Engineering 141 (2020): 107027. This work focuses on maximizing the profit (as understood from Figure 5) of a sequence of reactions, in a discrete thermodynamic state space represented by a thermodynamic graph. The current work focuses on determining an estimate of energy barrier for a single reaction in a continuous state space. - Zhang, Chonghuan, and Alexei A. Lapkin. "Reinforcement learning optimization of reaction routes on the basis of large, hybrid organic chemistry–synthetic biological, reaction network data." Reaction Chemistry & Engineering 8.10 (2023): 2491-2504. This work uses reinforcement learning to minimize the cost of a sequence of reactions (called synthesis plans) with respect to the price of the starting molecules and atom economy of individual reactions. Figure 2b suggests that the state space is discrete (albeit large), which allows the use to tabular learning algorithms, which cannot be used for continuous state spaces. - Lan, Tian, and Qi An. "Discovering catalytic reaction networks using deep reinforcement learning from first-principles." Journal of the American Chemical Society 143.40 (2021): 16804-16812. and Lan, T., Wang, H. & An, Q. Enabling high throughput deep reinforcement learning with first principles to investigate catalytic reaction mechanisms. Nat Commun 15, 6281 (2024). https://doi.org/10.1038/s41467-024-50531-6 These works use deep reinforcement learning in a 23 dimensional discrete state space to determine the best pathway consisting of a sequence of reactions (as shown in Figure 4). The energy barriers for individual reactions is determined using a software called VASP. The objective of the current work is to determine the energy barrier for a single reaction using deep reinforcement learning. - Zhang, Jun, et al. "Deep reinforcement learning of transition states." Physical Chemistry Chemical Physics 23.11 (2021): 6888-6895. This work was already cited in the previous submission. This work formulates the search for transition states as a shooting game from some configuration in the state space with randomized momenta for the two trajectories in opposite directions, expecting them to converge at the two minima representing the two sides of the reaction. It demonstrates the method on 4 two dimensional environments (in the three higher dimensional environments, two dimensions of interest have been chosen). The current work starts from a local minima and tries to learn a trajectory to another minima using reinforcement learning and read off the energy barrier of the transition from the energies along the generated trajectory. - Zhou, Zhenpeng, Xiaocheng Li, and Richard N. Zare. "Optimizing chemical reactions with deep reinforcement learning." ACS central science 3.12 (2017): 1337-1344. This work tries to optimize chemical reactions by perturbing the experimental conditions to achieve a better measure of selectivity, purity or cost for the reaction, using deep reinforcement learning, which has application is the laboratory. It does not estimate the energy barrier for a reaction. - Alexis W. Mills, et al. "Exploring Potential Energy Surfaces Using Reinforcement Machine Learning" Journal of Chemical Information and Modeling 2022 62 (13), 3169-3179, DOI: 10.1021/acs.jcim.2c00373 This work demonstrates the use of a modified DDPG algorithm to determine the minima on a potential energy surface. The current work assumes that the local minima are already known and attempts to estimate the energy barrier between the transition between two minima. I acknowledge that two-dimensional (simpler) environments have been as examples. Higher dimensional state spaces would require more computational resources and longer training-times for the agent to learn. I would like to point out that most state-of-the art works also use two-dimensional models. One of the references suggested bu the reviewer, Zhang, Jun, et al. Deep reinforcement learning of transition states." Physical Chemistry Chemical Physics 23.11 (2021): 6888-6895, uses four two dimensional models, all with two potential wells. For systems with multiple dimensions, two dimensions have been chosen (by expert knowledge, called order parameters), and the agent used only those two dimensions. To avoid this (human) choice, environments with only two-dimensions were used in the current work. In works where the state space has a higher number of dimensions, the state space is discrete. While I admit that the Mueller–Brown potential is a constructed artificial potential, nevertheless it had been used to show the effectiveness of the algorithms is use to determine minimum energy pathways: - Growing String Methods: Wolfgang Quapp; A growing string method for the reaction pathway defined by a Newton trajectory. J. Chem. Phys. 1 May 2005; 122 (17): 174106. (https://doi.org/10.1063/1.1885467) - Nudged Elastic Band: Graeme Henkelman, Hannes Jónsson; Improved tangent estimate in the nudged elastic band method for finding minimum energy paths and saddle points. J. Chem. Phys. 8 December 2000; 113 (22): 9978–9985. (https://doi.org/10.1063/1.1323224) used a two dimensional LEPS model potential with two minima only (Mueller–Brown potential has an intermediate third minima). - Baron Peters, Andreas Heyden et. al, A growing string method for determining transition states: Comparison to the nudged elastic band and string methods. J. Chem. Phys. 1 May 2004; 120 (17): 7877–7886. (https://doi.org/10.1063/1.1691018) - Accelerated Molecular Dynamics: Adaptively Accelerating Reactive Molecular Dynamics Using Boxed Molecular Dynamics in Energy Space, Robin J. Shannon, Silvia Amabilino et. al, Journal of Chemical Theory and Computation 2018 14 (9), 4541-4552. (https://doi.org/10.1021/acs.jctc.8b00515) - Artificial Force Induced Reaction: Quapp W, Bofill JM, Mechanochemistry on the Mueller–Brown surface by Newton trajectories. Int J Quantum Chem. 2018;118:e25522. (https://doi.org/10.1002/qua.25522) - Reinforcement Learning: Exploring Potential Energy Surfaces Using Reinforcement Machine Learning, Alexis W. Mills, Joshua J. Goings et. al, Journal of Chemical Information and Modeling 2022 62 (13), 3169-3179, (https://doi.org/10.1021/acs.jcim.2c00373) uses a RL agent to explore the potential energy surface. The reply to the reviewers has also been attached as the last 6 pages of the submission.
1 Comment
meta-review by editor
Submitted by Tobias Kuhn on
Both reviewers agree that the revised manuscript is substatially improved from the initial submission. From these reviews and my reading of the author response to the initial reviews I am happy to accept the manuscript for publication, subject to one remaining minor revision. Please include in the manuscript itself an overview of the effect of varying the scaling parameter lambda; as mentioned here by reviewer #2 this belongs in the manuscript as well as the response to reviewers. I invite you also to consider addressing the remaining comment by reviewer #1 regarding presenting the result of applying the algorithm to other Müller-Brown surfaces. Please add results as suggested if this can be done straightforwardly and if you agree that it would improve the paper, but my recommendation of acceptance does not depend on this.
Richard Mann (https://orcid.org/0000-0003-0701-1274)