Reviewer has chosen not to be Anonymous
Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Average
Presentation: Average
Reviewer`s confidence: High
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: This manuscript is too long for what it presents and should therefore be considerably shortened (below the general length limit)
Summary of paper in a few sentences:
This paper aims to estimate reaction barrier energies via a TD3-SAC hybrid reinforcement learning method. They first do this by constructing the potential energy landscapes as OpenAI gymnasium style environments where the agent can navigate through this space like a maze, creating some reaction pathway. The goal they set for the agent to learn is minimizing the sum of energies for each state visited in this pathway. They show that their algorithm is able to estimate near optimal reaction barriers in two examples.
Reasons to accept:
The problem this paper focuses on is interesting and a solution would have a clear application.
A new RL algorithm is presented that attempts to improve upon TD3 and SAC.
Reasons to reject:
A solution is only presented for a (likely unrealistically) simple version of the problem here.
While a new RL algorithm is presented, there are only limited discussions comparing it with the state-of-the-art existing methods.
The paper itself feels rushed and unrefined.
Nanopublication comments:
Further comments:
Below are extra comments/suggestions for improving the paper.
Introduction:
- Could use more citations when discussing what RL is. (Pg. 2 Ln. 3-18)
- Figure 1 doesn’t really add any value or insight (especially given how much space it takes up and that it is not an original figure to this manuscript). The introduction would benefit from just including the figure caption in the main text instead.
- Making the comparison of navigating a maze to navigating a potential energy landscape is useful for making the problem accessible to readers in either field but I feel it could be taken further. Currently the author connects start/end states in these two scenarios but the analogy could benefit from connecting other components as well (such as actions and rewards) to complete the analogy. (Pg. 2 Ln. 19-29)
- Typo in Figure 2 caption “ans solving then solving it using standard reinforcement”. (Pg. 4 Ln. 46)
Methods:
- There seems to be a typo in the equation for Rt (extra comma). (Pg. 6 Ln. 30)
- State-action function is missing a gamma. (Pg. 6 Ln. 38)
- In the methods section (2), there is a mix of presenting the basic ideas of RL in detail (which assumes the readers are not familiar with RL) and glossing over more advanced concepts such as actor-critic methods and target policy smoothing (which assumes the readers are extremely familiar with RL). The author defines these concepts much later but it should be done in this section (rather than in Experiments).
- There are some clear issues with how the MDP is defined and what the author’s intentions of an optimal solution are. The episode terminates when the current state and target state are equal (within some tolerance δ) giving the agent an immediate (and final) reward of 146.7. The reward function encourages the agent to minimize the sum of energies over the states it occupies at each time step. The agent could sit in some state δ away from the target state and receive rewards of 146.7-ε for infinite steps (the results shown in Figure 5a support this). While the point of this study is to determine reaction barriers (and not necessarily the entire optimal reaction pathway), the reward function isn’t necessarily encouraging the agent to find the minimum energy barrier. With this reward function, one could imagine a chemical landscape where it is more beneficial for the agent to pass through states with much higher energies than the optimal reaction barrier because it allows the agent to reach the minimum energy states much faster. Thus this set-up does not guarantee one would find the optimal reaction barrier energy nor the optimal reaction pathway.
Experiments:
- Figure 3b is not referenced in the text and seems unrelated to Figure 3a. It would be much better paired with the results shown in Figure 5. This figure claims that the agent is achieving an average return of ~55,000, however based on how the author defined the return in this environment, this is impossible. Even if the agent started in the state with a global minimum energy of -146.7 (which it does not), using the discount factor provided in Table 1 (1-10-2), the maximum possible theoretical return would be: ∑ 146.7 * 0.99^n <= 14670. This does not take into account that the episode ends if the agent reaches this state (which would lower the theoretical maximum even further).
- Figure 4 is not referenced in the text at all. This figure suggests the author did more experiments comparing TD3, SAC, and their hybrid algorithm of the two, however there are no other indications of this work.
- I appreciate the experiments shown in Figure 6. They provide some justifications to the modifications made to SAC but are not sufficient for the introduction of a new RL algorithm.
Discussions and Conclusion:
- Figure 7 is only offhandedly mentioned in the Conclusions, despite being arguably the most interesting result of the paper.
- This section should be split. There should be a discussion on the results presented separate from the conclusions of the paper.
General Comments
- There are several very relevant studies in the intersection of chemical reactions and RL not cited in this work that need to be addressed. At the very least, a discussion needs to be added that compares these works with the work presented here and explains the relative novelty of it. Some of these works include:
Khan, Ahmad, and Alexei Lapkin. "Searching for optimal process routes: A reinforcement learning approach." Computers & Chemical Engineering 141 (2020): 107027.
Zhang, Chonghuan, and Alexei A. Lapkin. "Reinforcement learning optimization of reaction routes on the basis of large, hybrid organic chemistry–synthetic biological, reaction network data." Reaction Chemistry & Engineering 8.10 (2023): 2491-2504.
Lan, Tian, and Qi An. "Discovering catalytic reaction networks using deep reinforcement learning from first-principles." Journal of the American Chemical Society 143.40 (2021): 16804-16812.
Zhang, Jun, et al. "Deep reinforcement learning of transition states." Physical Chemistry Chemical Physics 23.11 (2021): 6888-6895.
Zhou, Zhenpeng, Xiaocheng Li, and Richard N. Zare. "Optimizing chemical reactions with deep reinforcement learning." ACS central science 3.12 (2017): 1337-1344.
- There are only two examples of potential energy surfaces shown in this paper and both are 2-dimensional. While these are convenient for visualization purposes, I would imagine one of the major advantages of this approach is that it could be applied to higher dimensional systems with little modification.
- The paper feels rushed and disjointed. There's not a clear flow of the study as there are methods explained in Experiments and results shown Discussions and Conclusions. While the problem and algorithm are worth presenting, there is not enough results shown for either.
1 Comment
meta-review by editor
Submitted by Tobias Kuhn on
Two expert reviewers have provided their assessments. Both consider the reearch to be addressing an interesting problem, but also identify substantial weaknesses. In particular, the reviewers have indicated a substantial lack of references to important prior work in the area, which must be addressed. This should not simply add the missing citations, but also clearly place the work in this mnauscript in the context of earlier work. Please also note and address the reviewers comments regarding the clarity of technical explanations.
Richard Mann (https://orcid.org/0000-0003-0701-1274)