Estimating Reaction Barriers with Deep Reinforcement Learning

Tracking #: 858-1838

Authors:

	Name	ORCID
	Adittya Pal	https://orcid.org/0009-0005-0705-7768

Responsible editor:

Richard Mann

Submission Type:

Research Paper

Abstract:

Stable states in complex systems correspond to local minima on the associated potential energy surface. Transitions between these local minima govern the dynamics of such systems. Precisely determining the transition pathways in complex and high-dimensional systems is challenging because these transitions are rare events, and isolating the relevant species in experiments is difficult. Most of the time, the system remains near a local minimum, with rare, large fluctuations leading to transitions between minima. The probability of such transitions decreases exponentially with the height of the energy barrier, making the system's dynamics highly sensitive to the calculated energy barriers. This work aims to formulate the problem of finding the minimum energy barrier between two stable states in the system's state space as a cost-minimization problem. It is proposed to solve this problem using reinforcement learning algorithms. The exploratory nature of reinforcement learning agents enables efficient sampling and determination of the minimum energy barrier for transitions.

Manuscript:

ds-paper-858.pdf

Revised Version:

Estimating Reaction Barriers with Deep Reinforcement Learning

Data repository URLs:

GitHub repository for the code: https://github.com/AdittyaPal/energy_barrier_rl

.ipynb file for figures: https://github.com/AdittyaPal/energy_barrier_rl/blob/main/figures.ipynb

Zenodo repository for trajectories and plot data: Pal, A. (2024). Supporting Data for the submission Estimating Reaction Barriers using Deep Reinforcement Learning. Zenodo. https://doi.org/10.5281/zenodo.12783976

Date of Submission:

Friday, July 19, 2024

Date of Decision:

Thursday, August 29, 2024

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 06/Aug/2024

By Stephen Gow ORCID logo

https://orcid.org/0000-0003-0121-1697

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Excellent
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

Summary of paper in a few sentences:

This paper presents a method to identify the minimum energy barrier for transition pathways between local minima on a potential energy surface. The approach developed is based on deep reinforcement learning methods for maze solving, and is demonstrated on a reference potential energy surface. The author compares the performance of the established SAC and TD3 algorithms for reinforcement learning with a bespoke method drawing on ideas from both of these algorithms.

Reasons to accept:

The paper proposes a novel approach to a problem of reasonable real-world significance. It is generally well written and well presented, and clearly explains how the proposed method differs from previous work on the topic. Section 3.2 presents a sound justification for the changes made to the standard SAC algorithm. FAIR data guidelines are met and the available code shows no causes for concern.

Reasons to reject:

The computational methods should be described in greater technical detail. Of the parameters in Tables 1 and 2, only the choice of the entropy coefficient α is covered in depth. As a minimum the choice of values for the Polyak averaging parameter τ, discount factor γ and scaling factor λ should be explained. The reasons for the choice of neural network architecture could also be discussed.

My overall recommendation is to accept the paper subject to this issue being addressed.

Nanopublication comments:

Further comments:

I note a few smaller changes which could be made to improve the paper but do not substantially affect my recommendation:

The captions of the figures contain a lot of useful information which may be better presented in the body of the text. Examples include the sentence “In most reinforcement learning algorithms this (state, action, reward, next state, next action) tuple is stored while the agent is learning” from Figure 1, the final sentence under Figure 3, and all of the discussion on Figure 4 which is not referenced at all in the main text. The final sentence of the caption for Figure 2 could also be moved, but should also be rewritten for clarity and contains two typos (“course” and “ans”).

Page 5 paragraph 2 describes how the actor and critic functions are to be approximated without having introduced these terms first.

Figure 7 may benefit from clarifying which pathway each energy profile refers to, as it was not immediately obvious to me that the plotting characters alone (dotted/dashed/solid lines) identify this.

Review #2 submitted on 10/Aug/2024

By Chris Beeler ORCID logo

https://orcid.org/0000-0002-8603-3027

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Average
Presentation: Average
Reviewer`s confidence: High
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: This manuscript is too long for what it presents and should therefore be considerably shortened (below the general length limit)

Summary of paper in a few sentences:

This paper aims to estimate reaction barrier energies via a TD3-SAC hybrid reinforcement learning method. They first do this by constructing the potential energy landscapes as OpenAI gymnasium style environments where the agent can navigate through this space like a maze, creating some reaction pathway. The goal they set for the agent to learn is minimizing the sum of energies for each state visited in this pathway. They show that their algorithm is able to estimate near optimal reaction barriers in two examples.

Reasons to accept:

The problem this paper focuses on is interesting and a solution would have a clear application.
A new RL algorithm is presented that attempts to improve upon TD3 and SAC.

Reasons to reject:

A solution is only presented for a (likely unrealistically) simple version of the problem here.
While a new RL algorithm is presented, there are only limited discussions comparing it with the state-of-the-art existing methods.
The paper itself feels rushed and unrefined.

Nanopublication comments:

Further comments:

Below are extra comments/suggestions for improving the paper.

Introduction:
- Could use more citations when discussing what RL is. (Pg. 2 Ln. 3-18)
- Figure 1 doesn’t really add any value or insight (especially given how much space it takes up and that it is not an original figure to this manuscript). The introduction would benefit from just including the figure caption in the main text instead.
- Making the comparison of navigating a maze to navigating a potential energy landscape is useful for making the problem accessible to readers in either field but I feel it could be taken further. Currently the author connects start/end states in these two scenarios but the analogy could benefit from connecting other components as well (such as actions and rewards) to complete the analogy. (Pg. 2 Ln. 19-29)
- Typo in Figure 2 caption “ans solving then solving it using standard reinforcement”. (Pg. 4 Ln. 46)

Methods:
- There seems to be a typo in the equation for Rt (extra comma). (Pg. 6 Ln. 30)
- State-action function is missing a gamma. (Pg. 6 Ln. 38)
- In the methods section (2), there is a mix of presenting the basic ideas of RL in detail (which assumes the readers are not familiar with RL) and glossing over more advanced concepts such as actor-critic methods and target policy smoothing (which assumes the readers are extremely familiar with RL). The author defines these concepts much later but it should be done in this section (rather than in Experiments).
- There are some clear issues with how the MDP is defined and what the author’s intentions of an optimal solution are. The episode terminates when the current state and target state are equal (within some tolerance δ) giving the agent an immediate (and final) reward of 146.7. The reward function encourages the agent to minimize the sum of energies over the states it occupies at each time step. The agent could sit in some state δ away from the target state and receive rewards of 146.7-ε for infinite steps (the results shown in Figure 5a support this). While the point of this study is to determine reaction barriers (and not necessarily the entire optimal reaction pathway), the reward function isn’t necessarily encouraging the agent to find the minimum energy barrier. With this reward function, one could imagine a chemical landscape where it is more beneficial for the agent to pass through states with much higher energies than the optimal reaction barrier because it allows the agent to reach the minimum energy states much faster. Thus this set-up does not guarantee one would find the optimal reaction barrier energy nor the optimal reaction pathway.

Experiments:
- Figure 3b is not referenced in the text and seems unrelated to Figure 3a. It would be much better paired with the results shown in Figure 5. This figure claims that the agent is achieving an average return of ~55,000, however based on how the author defined the return in this environment, this is impossible. Even if the agent started in the state with a global minimum energy of -146.7 (which it does not), using the discount factor provided in Table 1 (1-10-2), the maximum possible theoretical return would be: ∑ 146.7 * 0.99^n <= 14670. This does not take into account that the episode ends if the agent reaches this state (which would lower the theoretical maximum even further).
- Figure 4 is not referenced in the text at all. This figure suggests the author did more experiments comparing TD3, SAC, and their hybrid algorithm of the two, however there are no other indications of this work.
- I appreciate the experiments shown in Figure 6. They provide some justifications to the modifications made to SAC but are not sufficient for the introduction of a new RL algorithm.

Discussions and Conclusion:
- Figure 7 is only offhandedly mentioned in the Conclusions, despite being arguably the most interesting result of the paper.
- This section should be split. There should be a discussion on the results presented separate from the conclusions of the paper.

General Comments
- There are several very relevant studies in the intersection of chemical reactions and RL not cited in this work that need to be addressed. At the very least, a discussion needs to be added that compares these works with the work presented here and explains the relative novelty of it. Some of these works include:
Khan, Ahmad, and Alexei Lapkin. "Searching for optimal process routes: A reinforcement learning approach." Computers & Chemical Engineering 141 (2020): 107027.
Zhang, Chonghuan, and Alexei A. Lapkin. "Reinforcement learning optimization of reaction routes on the basis of large, hybrid organic chemistry–synthetic biological, reaction network data." Reaction Chemistry & Engineering 8.10 (2023): 2491-2504.
Lan, Tian, and Qi An. "Discovering catalytic reaction networks using deep reinforcement learning from first-principles." Journal of the American Chemical Society 143.40 (2021): 16804-16812.
Zhang, Jun, et al. "Deep reinforcement learning of transition states." Physical Chemistry Chemical Physics 23.11 (2021): 6888-6895.
Zhou, Zhenpeng, Xiaocheng Li, and Richard N. Zare. "Optimizing chemical reactions with deep reinforcement learning." ACS central science 3.12 (2017): 1337-1344.
- There are only two examples of potential energy surfaces shown in this paper and both are 2-dimensional. While these are convenient for visualization purposes, I would imagine one of the major advantages of this approach is that it could be applied to higher dimensional systems with little modification.
- The paper feels rushed and disjointed. There's not a clear flow of the study as there are methods explained in Experiments and results shown Discussions and Conclusions. While the problem and algorithm are worth presenting, there is not enough results shown for either.

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Thu, 08/29/2024 - 15:00

Two expert reviewers have provided their assessments. Both consider the reearch to be addressing an interesting problem, but also identify substantial weaknesses. In particular, the reviewers have indicated a substantial lack of references to important prior work in the area, which must be addressed. This should not simply add the missing citations, but also clearly place the work in this mnauscript in the context of earlier work. Please also note and address the reviewers comments regarding the clarity of technical explanations.

Richard Mann (https://orcid.org/0000-0003-0701-1274)

Data Science

Estimating Reaction Barriers with Deep Reinforcement Learning

Tracking #: 858-1838

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

meta-review by editor