Estimating Reaction Barriers with Deep Reinforcement Learning

Tracking #: 878-1858

Authors:

	Name	ORCID
	Adittya Pal	https://orcid.org/0009-0005-0705-7768

Responsible editor:

Richard Mann

Submission Type:

Research Paper

Abstract:

Stable states in complex systems correspond to local minima on the associated potential energy surface. Transitions between these local minima govern the dynamics of such systems. Precisely determining the transition pathways in complex and high-dimensional systems is challenging because these transitions are rare events, and isolating the relevant species in experiments is difficult. Most of the time, the system remains near a local minimum, with rare, large fluctuations leading to transitions between minima. The probability of such transitions decreases exponentially with the height of the energy barrier, making the system's dynamics highly sensitive to the calculated energy barriers. This work aims to formulate the problem of finding the minimum energy barrier between two stable states in the system's state space as a cost-minimization problem. It is proposed to solve this problem using reinforcement learning algorithms. The exploratory nature of reinforcement learning agents enables efficient sampling and determination of the minimum energy barrier for transitions.

Manuscript:

ds-paper-878.pdf

Previous Version:

Estimating Reaction Barriers with Deep Reinforcement Learning

Data repository URLs:

GitHub repository for the code: https://github.com/AdittyaPal/energy_barrier_rl

.ipynb file for figures: https://github.com/AdittyaPal/energy_barrier_rl/blob/main/figures.ipynb

Zenodo repository for trajectories and plot data: Pal, A. (2024). Supporting Data for the submission Estimating Reaction Barriers using Deep Reinforcement Learning. Zenodo. https://doi.org/10.5281/zenodo.12783976

Date of Submission:

Wednesday, September 4, 2024

Date of Decision:

Monday, September 23, 2024

Nanopublication URLs:
http://ds.kpxl.org/RAZdadPk2Lz1Bo6jvZxCeFME9-Yr-NXwBxOJuOxgWfVSk

Decision:

Solicited Reviews:

Review #1 submitted on 11/Sep/2024

By Chris Beeler ORCID logo

https://orcid.org/0000-0002-8603-3027

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: Moderate significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

This paper aims to estimate reaction barrier energies via a TD3-SAC hybrid reinforcement learning method. They first do this by constructing the potential energy landscapes as OpenAI gymnasium style environments where the agent can navigate through this space like a maze, creating some reaction pathway. The goal they set for the agent to learn is minimizing the sum of energies for each state visited in this pathway. They show that their algorithm is able to estimate near optimal reaction barriers in two examples.

Reasons to accept:

The problem this paper focuses on is interesting and a solution would have a clear application.
A new RL algorithm is presented that attempts to improve upon TD3 and SAC.
The author has clearly outlined the limitations of the work presented.
The author discusses the results in great detail.

Reasons to reject:

Results are only presented for two 2D examples of the problem here.

Nanopublication comments:

Further comments:

While the author has outlined that other state-of-the-art studies mostly only focus on 2D Müller–Brown potential surfaces, one would need to apply this algorithm to more than two examples to extract any meaningful performance metrics on it. in my opinion.

Overall the manuscript has been significantly improved from the initial version. Except the one listed above, all my previous concerns have been addressed in detail. While it is just the one comment I have left, I consider it quite a big concern, hence why I have remained undecided. If the author were to add additional results (just a single figure would do), summarizing the performance of this algorithm trained on several different other Müller–Brown surfaces, this would shift me to suggest acceptance.

Alternatively, if the editor or other reviewer feel that the results presented are sufficient, then I will not press the issue any further.

Review #2 submitted on 19/Sep/2024

By Stephen Gow ORCID logo

https://orcid.org/0000-0003-0121-1697

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The reviews paper includes several changes from the original submission. Additional background content on reinforcement learning, discussion of relevant prior work and references have been added. The presentation and ordering of figures have been revised, and the discussion around them expanded and moved from captions to the body of the text. Additional consideration of future directions for the developed method and its strengths and weaknesses is included in the conclusions section, which has been split from the discussion of related work. A number of typos and small errors have been corrected.

Reasons to accept:

As before, the paper provides a useful and novel contribution to a problem with real significance. I appreciate the efforts of the author to improve the paper since the first review, which have substantially improved its clarity and links to previous work.

Reasons to reject:

I remain concerned that the implementation choices in the reinforcement learning algorithm, and their effect on the behaviour of the agent, are not sufficiently well described to be of use to researchers who may wish to use this method in future. In particular, the importance of the scaling parameter λ is highlighted twice in the response to reviewers, under both my point 1 and my fellow reviewer's point 9. This is important content which should be discussed in the paper itself - at present it is not obvious from the paper that varying this parameter was even considered.

Nanopublication comments:

Further comments:

The expansion of the third paragraph of the introduction is broadly welcome, but it is now too long to read comfortably and covers a wide range of somewhat disparate content - it should be split into two paragraphs for ease of understanding.

RESPONSE TO REVIEWERS

The text added in the re-submission is in red. The count of works in the .tex file was ~9000, which was below the target of 12000 words for a submission. I thank reviewer 1 for the positive comments. The reply to the reviewer's suggestions are listed below. The choice of the values for the parameters \tau and \gamma, and the neural network architecture were kept the same as the TD3 (https://github.com/sfujim/TD3/tree/master) and SAC (https://github.com/haarnoja/sac/tree/master) algorithm implementations. The scaling factor \lambda adjusts the step size for the agent, and it was this combination of the scaling factor and number of steps in an episode which led to the best performance of the agent. The agent reaches near the terminal state with just the number of required small enough steps to end the episode. A larger value of \lambda resulted in the agent taking longer steps over regions of the potential energy surface with a higher energy to give an incorrect estimate of the barrier height. Smaller values of \lambda led to smaller steps, and the agent did not leave the local minima to explore other regions of the potential energy surface, and is unsuccessful a its assigned task. It is difficult to visualizing these effects as a plot of the rewards against the number of validation steps as those in Figure 6. The captions of all the figures, and Figure 1, 3 and 4 in particular have been shortened. The typos in the caption of Figure 2 has been corrected. A sentence introducing the actor and critic functions has been added (Page 5, lines 14-18). The three pathways are labeled and the labels are used to identify the energy profiles of the respective pathways. I thank reviewer 2 for the constructive comments and suggestions. The reply to the reviewer's suggestions are listed below. Introduction: More citations have been added while introduction reinforcement learning in Section 1, which are: - Reinforcement Learning: A Survey (https://doi.org/10.48550/arXiv.cs/9605103) - P. Maes, "Modeling Adaptive Autonomous Agents," in Artificial Life, vol. 1, no. 1_2, pp. 135-162, Oct. 1993, doi: 10.1162/artl.1993.1.1_2.135. - Reinforcement Learning: An Introduction by Sutton and Barto was cited in the previous submission. - Reward is enough (https://doi.org/10.1016/j.artint.2021.103535) - T. Mannucci and E. -J. van Kampen, "A hierarchical maze navigation algorithm with Reinforcement Learning and mapping," 2016 IEEE Symposium Series on Computational Intelligence (SSCI), 2016, pp. 1-8, doi: 10.1109/SSCI.2016.7849365, was cited in the previous submission. - D. Osmanković and S. Konjicija, "Implementation of Q — Learning algorithm for solving maze problem," 2011 Proceedings of the 34th International Convention MIPRO, 2011, pp. 1619-1622. - M. A. Wiering and H. van Hasselt, "Ensemble Algorithms in Reinforcement Learning," in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 930-936, Aug. 2008, doi: 10.1109/TSMCB.2008.920231. - Neural Map: Structured Memory for Deep Reinforcement Learning (https://doi.org/10.48550/arXiv.1702.08360) - Brunner, G., Richter, O., Wang, Y., & Wattenhofer, R. (2018). Teaching a Machine to Read Maps With Deep Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11645 Figure 1 has been downsized to occupy less space and its source is acknowledged. While it was a little difficult to make an analogy between the actions and rewards (especially rewards, because the agent has different objectives in both environments), it has been added at lines 5-8 and 11-16 on page 3. The typo in the caption of Figure 2 was removed and most of the caption is incorporated into the main text. Methods: The typo in the equation for Rt (extra comma). (Pg. 6 Ln. 43) was corrected. The missing \gamma in the state-action function (Pg. 8 Ln. 6) was added. A sentence introducing the actor and critic functions has been added on page 5, lines 14-18. Target policy smoothing is introduced on page 8 lines 26-34. The author acknowledges that formulated MDP suffers from the problem raised by the reviewer. The number of steps in an episode, the scaling factor of actions and the number of training epochs were varied to come up with a set of values for these three parameters which minimize the problems cause due to the imperfect formulation of the MDP. To discourage the agent from sitting in some state \delta away from the target state, and collect rewards for the remainder of the episode, the episode is truncated after 500 steps. A small scaling factor \lambda was used for the actions, so that the agent does not make too many long jumps through higher energy states to reach a state with lower energy faster. Decreasing \lambda would require increasing the maximum number of steps in an episode, so that the agent explores regions away from the starting point, but not too much that the trajectory passes through regions with higher energy. The lowest reward in the episode (corresponding to the highest energy along the pathway, plotted in added Figure 5b, was monitored to decide when the agent stops improving at its intended task. Using the model after 1000 validation steps indeed led to a higher estimate of the energy barrier. Experiments: Figure 3b was shifted to be a part of Figure 5, as suggested by the reviewer. I would like to acknowledge my mistake of not multiplying the discount factor \gamma while calculating the returns from the episodes. It has been corrected and leads to a much flatter learning curve. Some text is added on Pages 10 and 11 to elaborate on the plots in Figure 4. The table in Figure 6 was updated after multiplying the discount factor while calculating the average rewards. Text was added to page 16, lines 33 onward, to elaborate on the results from Figure 7. However, it was kept a part of conclusions because it demonstrates a conclusion: using a reinforcement learning based approach has an advantage compared to the existing gradient based algorithms. The section has been split. Discussions contain comparison of the current work with previous work, while conclusions focus only on this work. I would like to thank out the reviewer for pointing out relevant references. A discussion of these in the context of the current work is added in Section 4 and summarized below: - Khan, Ahmad, and Alexei Lapkin. "Searching for optimal process routes: A reinforcement learning approach." Computers & Chemical Engineering 141 (2020): 107027. This work focuses on maximizing the profit (as understood from Figure 5) of a sequence of reactions, in a discrete thermodynamic state space represented by a thermodynamic graph. The current work focuses on determining an estimate of energy barrier for a single reaction in a continuous state space. - Zhang, Chonghuan, and Alexei A. Lapkin. "Reinforcement learning optimization of reaction routes on the basis of large, hybrid organic chemistry–synthetic biological, reaction network data." Reaction Chemistry & Engineering 8.10 (2023): 2491-2504. This work uses reinforcement learning to minimize the cost of a sequence of reactions (called synthesis plans) with respect to the price of the starting molecules and atom economy of individual reactions. Figure 2b suggests that the state space is discrete (albeit large), which allows the use to tabular learning algorithms, which cannot be used for continuous state spaces. - Lan, Tian, and Qi An. "Discovering catalytic reaction networks using deep reinforcement learning from first-principles." Journal of the American Chemical Society 143.40 (2021): 16804-16812. and Lan, T., Wang, H. & An, Q. Enabling high throughput deep reinforcement learning with first principles to investigate catalytic reaction mechanisms. Nat Commun 15, 6281 (2024). https://doi.org/10.1038/s41467-024-50531-6 These works use deep reinforcement learning in a 23 dimensional discrete state space to determine the best pathway consisting of a sequence of reactions (as shown in Figure 4). The energy barriers for individual reactions is determined using a software called VASP. The objective of the current work is to determine the energy barrier for a single reaction using deep reinforcement learning. - Zhang, Jun, et al. "Deep reinforcement learning of transition states." Physical Chemistry Chemical Physics 23.11 (2021): 6888-6895. This work was already cited in the previous submission. This work formulates the search for transition states as a shooting game from some configuration in the state space with randomized momenta for the two trajectories in opposite directions, expecting them to converge at the two minima representing the two sides of the reaction. It demonstrates the method on 4 two dimensional environments (in the three higher dimensional environments, two dimensions of interest have been chosen). The current work starts from a local minima and tries to learn a trajectory to another minima using reinforcement learning and read off the energy barrier of the transition from the energies along the generated trajectory. - Zhou, Zhenpeng, Xiaocheng Li, and Richard N. Zare. "Optimizing chemical reactions with deep reinforcement learning." ACS central science 3.12 (2017): 1337-1344. This work tries to optimize chemical reactions by perturbing the experimental conditions to achieve a better measure of selectivity, purity or cost for the reaction, using deep reinforcement learning, which has application is the laboratory. It does not estimate the energy barrier for a reaction. - Alexis W. Mills, et al. "Exploring Potential Energy Surfaces Using Reinforcement Machine Learning" Journal of Chemical Information and Modeling 2022 62 (13), 3169-3179, DOI: 10.1021/acs.jcim.2c00373 This work demonstrates the use of a modified DDPG algorithm to determine the minima on a potential energy surface. The current work assumes that the local minima are already known and attempts to estimate the energy barrier between the transition between two minima. I acknowledge that two-dimensional (simpler) environments have been as examples. Higher dimensional state spaces would require more computational resources and longer training-times for the agent to learn. I would like to point out that most state-of-the art works also use two-dimensional models. One of the references suggested bu the reviewer, Zhang, Jun, et al. Deep reinforcement learning of transition states." Physical Chemistry Chemical Physics 23.11 (2021): 6888-6895, uses four two dimensional models, all with two potential wells. For systems with multiple dimensions, two dimensions have been chosen (by expert knowledge, called order parameters), and the agent used only those two dimensions. To avoid this (human) choice, environments with only two-dimensions were used in the current work. In works where the state space has a higher number of dimensions, the state space is discrete. While I admit that the Mueller–Brown potential is a constructed artificial potential, nevertheless it had been used to show the effectiveness of the algorithms is use to determine minimum energy pathways: - Growing String Methods: Wolfgang Quapp; A growing string method for the reaction pathway defined by a Newton trajectory. J. Chem. Phys. 1 May 2005; 122 (17): 174106. (https://doi.org/10.1063/1.1885467) - Nudged Elastic Band: Graeme Henkelman, Hannes Jónsson; Improved tangent estimate in the nudged elastic band method for finding minimum energy paths and saddle points. J. Chem. Phys. 8 December 2000; 113 (22): 9978–9985. (https://doi.org/10.1063/1.1323224) used a two dimensional LEPS model potential with two minima only (Mueller–Brown potential has an intermediate third minima). - Baron Peters, Andreas Heyden et. al, A growing string method for determining transition states: Comparison to the nudged elastic band and string methods. J. Chem. Phys. 1 May 2004; 120 (17): 7877–7886. (https://doi.org/10.1063/1.1691018) - Accelerated Molecular Dynamics: Adaptively Accelerating Reactive Molecular Dynamics Using Boxed Molecular Dynamics in Energy Space, Robin J. Shannon, Silvia Amabilino et. al, Journal of Chemical Theory and Computation 2018 14 (9), 4541-4552. (https://doi.org/10.1021/acs.jctc.8b00515) - Artificial Force Induced Reaction: Quapp W, Bofill JM, Mechanochemistry on the Mueller–Brown surface by Newton trajectories. Int J Quantum Chem. 2018;118:e25522. (https://doi.org/10.1002/qua.25522) - Reinforcement Learning: Exploring Potential Energy Surfaces Using Reinforcement Machine Learning, Alexis W. Mills, Joshua J. Goings et. al, Journal of Chemical Information and Modeling 2022 62 (13), 3169-3179, (https://doi.org/10.1021/acs.jcim.2c00373) uses a RL agent to explore the potential energy surface. The reply to the reviewers has also been attached as the last 6 pages of the submission.

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Mon, 09/23/2024 - 02:49

Both reviewers agree that the revised manuscript is substatially improved from the initial submission. From these reviews and my reading of the author response to the initial reviews I am happy to accept the manuscript for publication, subject to one remaining minor revision. Please include in the manuscript itself an overview of the effect of varying the scaling parameter lambda; as mentioned here by reviewer #2 this belongs in the manuscript as well as the response to reviewers. I invite you also to consider addressing the remaining comment by reviewer #1 regarding presenting the result of applying the algorithm to other Müller-Brown surfaces. Please add results as suggested if this can be done straightforwardly and if you agree that it would improve the paper, but my recommendation of acceptance does not depend on this.

Richard Mann (https://orcid.org/0000-0003-0701-1274)

Data Science

Estimating Reaction Barriers with Deep Reinforcement Learning

Tracking #: 878-1858

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Previous Version:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

meta-review by editor