In this notebook we rely on IBM qiskit [1], OpenAI gym [2] and the library stable-baselines [3] to setup a quantum game and have some artificial reinforcement learning agent play and learn them.
In a previous notebook we run a very simple game, qcircuit-v0. We now try out a more challenging version, qcircuit-v0, and we compare the performances of different agents playing it.
As before, it is necessary to setup the packages required for this simulation as explained in Setup.ipynb.
Next, we import some basic libraries.
import numpy as np
import gym
from IPython.display import display
The game we will run is provided in gym-qcircuit [4], and it is implemented complying with the standard OpenAI gym interface.
The game is a simple quantum circuit building game: given a fixed number of qubits and a desired final state for these qubits, the objective is to design a quantum circuit that takes the given qubits to the desired final state.
import qcircuit
The module qcircuit offers two versions of the game:
Details on the implementation of these games are available at https://github.com/FMZennaro/gym-qcircuit/blob/master/qcircuit/envs/qcircuit_env.py.
We start loading the first scenario and run agents on it.
env = gym.make('qcircuit-v1')
The game qcircuit-v1 is completely observed, and both its state space and action space are described below.
Remember that two qubits are described by $\alpha\left|00\right\rangle +\beta\left|01\right\rangle +\gamma\left|10\right\rangle +\delta\left|11\right\rangle$, where $\alpha, \beta, \gamma, \delta$ are complex numbers and $\left|00\right\rangle, \left|01\right\rangle, \left|10\right\rangle, \left|11\right\rangle$ are the measurement axes. The state space is then described by eight real numbers between -1 and 1 representing the real and complex part of $\alpha, \beta, \gamma, \delta$.
An agent plays the game interacting with a quantum circuit, adding and removing standard gates. In this version of the game there are seven actions available: add an X gate on the first or on the second qubit, add a Hadamard gate on the first or on the second qubit, add a CNOT gate on the first or on the second qubit, or remove the last inserted gate.
Again, details on the implementation of the state space and the action space are available at https://github.com/FMZennaro/gym-qcircuit/blob/master/qcircuit/envs/qcircuit_env.py.
First, we simply run a random agent. This allows us to test out the game and see its evolution.
A random agent selects a possible action from the action space at random and executes it. Given the number of actions, and the relative low probability of removing a gate ($\frac{1}{7}$) over adding a new gate ($\frac{1}{7}$), it is likely that the random agent will run for a very long time developing a long and complex circuit before stumbling in the correct solution.
env.reset()
display(env.render())
done = False
while(not done):
obs, _, done, info = env.step(env.action_space.sample())
display(info['circuit_img'])
env.close()
We now run a PPO2 agent, a more sophisticated agent picked from the library of stable_baselines.
First we import the agent.
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.
Then we train it.
env = DummyVecEnv([lambda: env])
modelPPO2 = PPO2(MlpPolicy, env, verbose=1)
modelPPO2.learn(total_timesteps=10000)
WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/tf_util.py:57: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/tf_util.py:66: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/policies.py:115: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/input.py:25: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/policies.py:562: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.flatten instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/tensorflow_core/python/layers/core.py:332: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/a2c/utils.py:156: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/distributions.py:323: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/distributions.py:324: The name tf.log is deprecated. Please use tf.math.log instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:193: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:201: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:209: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:243: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:245: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead. -------------------------------------- | approxkl | 1.5884034e-05 | | clipfrac | 0.0 | | explained_variance | 0.00191 | | fps | 24 | | n_updates | 1 | | policy_entropy | 1.945895 | | policy_loss | -0.0007621155 | | serial_timesteps | 128 | | time_elapsed | 3.58e-06 | | total_timesteps | 128 | | value_loss | 304.28973 | -------------------------------------- -------------------------------------- | approxkl | 2.2909548e-05 | | clipfrac | 0.0 | | explained_variance | 0.00177 | | fps | 32 | | n_updates | 2 | | policy_entropy | 1.9457854 | | policy_loss | -0.0013871952 | | serial_timesteps | 256 | | time_elapsed | 5.26 | | total_timesteps | 256 | | value_loss | 354.50415 | -------------------------------------- -------------------------------------- | approxkl | 2.010588e-05 | | clipfrac | 0.0 | | explained_variance | 0.00163 | | fps | 25 | | n_updates | 3 | | policy_entropy | 1.9455837 | | policy_loss | -0.0011787172 | | serial_timesteps | 384 | | time_elapsed | 9.26 | | total_timesteps | 384 | | value_loss | 551.58307 | -------------------------------------- -------------------------------------- | approxkl | 2.0156647e-05 | | clipfrac | 0.0 | | explained_variance | 0.00611 | | fps | 27 | | n_updates | 4 | | policy_entropy | 1.9453187 | | policy_loss | -0.0019203236 | | serial_timesteps | 512 | | time_elapsed | 14.3 | | total_timesteps | 512 | | value_loss | 670.7571 | -------------------------------------- -------------------------------------- | approxkl | 3.702501e-05 | | clipfrac | 0.0 | | explained_variance | 0.00576 | | fps | 27 | | n_updates | 5 | | policy_entropy | 1.9449025 | | policy_loss | -0.0022225466 | | serial_timesteps | 640 | | time_elapsed | 19 | | total_timesteps | 640 | | value_loss | 497.50574 | -------------------------------------- -------------------------------------- | approxkl | 3.691289e-05 | | clipfrac | 0.0 | | explained_variance | 0.00734 | | fps | 32 | | n_updates | 6 | | policy_entropy | 1.9444643 | | policy_loss | -0.0021257065 | | serial_timesteps | 768 | | time_elapsed | 23.6 | | total_timesteps | 768 | | value_loss | 229.97098 | -------------------------------------- -------------------------------------- | approxkl | 2.5588033e-05 | | clipfrac | 0.0 | | explained_variance | 0.00359 | | fps | 25 | | n_updates | 7 | | policy_entropy | 1.9437414 | | policy_loss | -0.0006487622 | | serial_timesteps | 896 | | time_elapsed | 27.5 | | total_timesteps | 896 | | value_loss | 610.44653 | -------------------------------------- -------------------------------------- | approxkl | 1.5525578e-05 | | clipfrac | 0.0 | | explained_variance | 0.00939 | | fps | 23 | | n_updates | 8 | | policy_entropy | 1.9435197 | | policy_loss | -0.0011929604 | | serial_timesteps | 1024 | | time_elapsed | 32.5 | | total_timesteps | 1024 | | value_loss | 468.35883 | -------------------------------------- -------------------------------------- | approxkl | 1.8419583e-05 | | clipfrac | 0.0 | | explained_variance | -0.00299 | | fps | 35 | | n_updates | 9 | | policy_entropy | 1.942427 | | policy_loss | -0.0010164734 | | serial_timesteps | 1152 | | time_elapsed | 37.9 | | total_timesteps | 1152 | | value_loss | 334.41132 | -------------------------------------- --------------------------------------- | approxkl | 1.0285833e-05 | | clipfrac | 0.0 | | explained_variance | 0.00987 | | fps | 22 | | n_updates | 10 | | policy_entropy | 1.9424908 | | policy_loss | -0.00080490555 | | serial_timesteps | 1280 | | time_elapsed | 41.5 | | total_timesteps | 1280 | | value_loss | 717.072 | --------------------------------------- -------------------------------------- | approxkl | 4.757797e-05 | | clipfrac | 0.0 | | explained_variance | 0.006 | | fps | 32 | | n_updates | 11 | | policy_entropy | 1.9414101 | | policy_loss | -0.0017191665 | | serial_timesteps | 1408 | | time_elapsed | 47.3 | | total_timesteps | 1408 | | value_loss | 261.5157 | -------------------------------------- -------------------------------------- | approxkl | 2.5375e-05 | | clipfrac | 0.0 | | explained_variance | 0.00357 | | fps | 35 | | n_updates | 12 | | policy_entropy | 1.9406744 | | policy_loss | 0.00019657891 | | serial_timesteps | 1536 | | time_elapsed | 51.2 | | total_timesteps | 1536 | | value_loss | 429.21405 | -------------------------------------- -------------------------------------- | approxkl | 1.7320363e-05 | | clipfrac | 0.0 | | explained_variance | 0.00448 | | fps | 20 | | n_updates | 13 | | policy_entropy | 1.9401853 | | policy_loss | -0.0007757377 | | serial_timesteps | 1664 | | time_elapsed | 54.9 | | total_timesteps | 1664 | | value_loss | 164.3076 | -------------------------------------- ------------------------------------- | approxkl | 6.397137e-05 | | clipfrac | 0.0 | | explained_variance | 0.0105 | | fps | 34 | | n_updates | 14 | | policy_entropy | 1.9391428 | | policy_loss | -0.002846446 | | serial_timesteps | 1792 | | time_elapsed | 61.3 | | total_timesteps | 1792 | | value_loss | 381.38333 | ------------------------------------- -------------------------------------- | approxkl | 0.00012407018 | | clipfrac | 0.0 | | explained_variance | -0.00471 | | fps | 32 | | n_updates | 15 | | policy_entropy | 1.9376421 | | policy_loss | -0.0018057548 | | serial_timesteps | 1920 | | time_elapsed | 65 | | total_timesteps | 1920 | | value_loss | 330.12433 | -------------------------------------- -------------------------------------- | approxkl | 2.4927474e-05 | | clipfrac | 0.0 | | explained_variance | 0.0215 | | fps | 18 | | n_updates | 16 | | policy_entropy | 1.9367133 | | policy_loss | -0.0010947302 | | serial_timesteps | 2048 | | time_elapsed | 68.9 | | total_timesteps | 2048 | | value_loss | 434.34946 | -------------------------------------- -------------------------------------- | approxkl | 0.00011265186 | | clipfrac | 0.0 | | explained_variance | 0.0115 | | fps | 32 | | n_updates | 17 | | policy_entropy | 1.9320261 | | policy_loss | -0.005025226 | | serial_timesteps | 2176 | | time_elapsed | 75.9 | | total_timesteps | 2176 | | value_loss | 459.16666 | -------------------------------------- -------------------------------------- | approxkl | 0.00010576048 | | clipfrac | 0.0 | | explained_variance | 0.00794 | | fps | 31 | | n_updates | 18 | | policy_entropy | 1.926846 | | policy_loss | 0.00025757286 | | serial_timesteps | 2304 | | time_elapsed | 79.9 | | total_timesteps | 2304 | | value_loss | 613.7107 | -------------------------------------- -------------------------------------- | approxkl | 2.5211524e-05 | | clipfrac | 0.0 | | explained_variance | 0.0199 | | fps | 29 | | n_updates | 19 | | policy_entropy | 1.9233967 | | policy_loss | -0.0016552964 | | serial_timesteps | 2432 | | time_elapsed | 83.9 | | total_timesteps | 2432 | | value_loss | 1025.4532 | -------------------------------------- -------------------------------------- | approxkl | 8.3442166e-05 | | clipfrac | 0.0 | | explained_variance | 0.0137 | | fps | 16 | | n_updates | 20 | | policy_entropy | 1.9176646 | | policy_loss | -0.0021406934 | | serial_timesteps | 2560 | | time_elapsed | 88.3 | | total_timesteps | 2560 | | value_loss | 636.50574 | -------------------------------------- ------------------------------------- | approxkl | 6.68434e-05 | | clipfrac | 0.0 | | explained_variance | 0.00411 | | fps | 33 | | n_updates | 21 | | policy_entropy | 1.9144856 | | policy_loss | -0.002023064 | | serial_timesteps | 2688 | | time_elapsed | 96.3 | | total_timesteps | 2688 | | value_loss | 376.06577 | ------------------------------------- --------------------------------------- | approxkl | 6.0928684e-05 | | clipfrac | 0.0 | | explained_variance | 0.00601 | | fps | 29 | | n_updates | 22 | | policy_entropy | 1.9110223 | | policy_loss | -0.00038641796 | | serial_timesteps | 2816 | | time_elapsed | 100 | | total_timesteps | 2816 | | value_loss | 600.3182 | --------------------------------------- ------------------------------------- | approxkl | 5.581788e-05 | | clipfrac | 0.0 | | explained_variance | 0.0205 | | fps | 34 | | n_updates | 23 | | policy_entropy | 1.9110854 | | policy_loss | -0.002564559 | | serial_timesteps | 2944 | | time_elapsed | 104 | | total_timesteps | 2944 | | value_loss | 506.4913 | ------------------------------------- -------------------------------------- | approxkl | 4.134916e-05 | | clipfrac | 0.0 | | explained_variance | 0.0158 | | fps | 29 | | n_updates | 24 | | policy_entropy | 1.8988805 | | policy_loss | -0.0013897291 | | serial_timesteps | 3072 | | time_elapsed | 108 | | total_timesteps | 3072 | | value_loss | 941.0134 | -------------------------------------- -------------------------------------- | approxkl | 5.6479952e-05 | | clipfrac | 0.0 | | explained_variance | 0.00793 | | fps | 14 | | n_updates | 25 | | policy_entropy | 1.895507 | | policy_loss | -0.0009734394 | | serial_timesteps | 3200 | | time_elapsed | 112 | | total_timesteps | 3200 | | value_loss | 657.02515 | -------------------------------------- -------------------------------------- | approxkl | 1.5475653e-05 | | clipfrac | 0.0 | | explained_variance | 0.0134 | | fps | 30 | | n_updates | 26 | | policy_entropy | 1.8943511 | | policy_loss | -0.0008585951 | | serial_timesteps | 3328 | | time_elapsed | 122 | | total_timesteps | 3328 | | value_loss | 1054.0438 | -------------------------------------- -------------------------------------- | approxkl | 8.295983e-05 | | clipfrac | 0.0 | | explained_variance | 0.0169 | | fps | 31 | | n_updates | 27 | | policy_entropy | 1.8943397 | | policy_loss | -0.0020506172 | | serial_timesteps | 3456 | | time_elapsed | 126 | | total_timesteps | 3456 | | value_loss | 274.5761 | -------------------------------------- -------------------------------------- | approxkl | 7.5044154e-05 | | clipfrac | 0.0 | | explained_variance | 0.0296 | | fps | 30 | | n_updates | 28 | | policy_entropy | 1.895614 | | policy_loss | -0.0002788771 | | serial_timesteps | 3584 | | time_elapsed | 130 | | total_timesteps | 3584 | | value_loss | 712.16614 | -------------------------------------- --------------------------------------- | approxkl | 0.000105085644 | | clipfrac | 0.0 | | explained_variance | 0.0272 | | fps | 31 | | n_updates | 29 | | policy_entropy | 1.8843951 | | policy_loss | -0.0034806936 | | serial_timesteps | 3712 | | time_elapsed | 134 | | total_timesteps | 3712 | | value_loss | 772.5188 | --------------------------------------- -------------------------------------- | approxkl | 0.00011768955 | | clipfrac | 0.0 | | explained_variance | 0.0348 | | fps | 12 | | n_updates | 30 | | policy_entropy | 1.8758233 | | policy_loss | -0.0033539014 | | serial_timesteps | 3840 | | time_elapsed | 138 | | total_timesteps | 3840 | | value_loss | 978.59985 | -------------------------------------- ------------------------------------- | approxkl | 5.9856e-05 | | clipfrac | 0.0 | | explained_variance | 0.0232 | | fps | 32 | | n_updates | 31 | | policy_entropy | 1.8640102 | | policy_loss | -0.002186644 | | serial_timesteps | 3968 | | time_elapsed | 148 | | total_timesteps | 3968 | | value_loss | 827.60266 | ------------------------------------- -------------------------------------- | approxkl | 4.8030866e-05 | | clipfrac | 0.0 | | explained_variance | 0.0212 | | fps | 33 | | n_updates | 32 | | policy_entropy | 1.8431563 | | policy_loss | -0.0008569873 | | serial_timesteps | 4096 | | time_elapsed | 152 | | total_timesteps | 4096 | | value_loss | 1450.6925 | -------------------------------------- -------------------------------------- | approxkl | 2.5180281e-05 | | clipfrac | 0.0 | | explained_variance | 0.00811 | | fps | 32 | | n_updates | 33 | | policy_entropy | 1.8455621 | | policy_loss | -0.0011510032 | | serial_timesteps | 4224 | | time_elapsed | 156 | | total_timesteps | 4224 | | value_loss | 856.6877 | -------------------------------------- -------------------------------------- | approxkl | 1.982256e-05 | | clipfrac | 0.0 | | explained_variance | 0.0209 | | fps | 30 | | n_updates | 34 | | policy_entropy | 1.8410442 | | policy_loss | -0.0010599534 | | serial_timesteps | 4352 | | time_elapsed | 160 | | total_timesteps | 4352 | | value_loss | 1003.9776 | -------------------------------------- -------------------------------------- | approxkl | 3.6075995e-05 | | clipfrac | 0.0 | | explained_variance | 0.0212 | | fps | 32 | | n_updates | 35 | | policy_entropy | 1.839669 | | policy_loss | -0.0019737522 | | serial_timesteps | 4480 | | time_elapsed | 164 | | total_timesteps | 4480 | | value_loss | 1037.9608 | -------------------------------------- -------------------------------------- | approxkl | 6.053823e-05 | | clipfrac | 0.0 | | explained_variance | 0.0256 | | fps | 31 | | n_updates | 36 | | policy_entropy | 1.8407198 | | policy_loss | -0.0023057328 | | serial_timesteps | 4608 | | time_elapsed | 168 | | total_timesteps | 4608 | | value_loss | 781.1861 | -------------------------------------- -------------------------------------- | approxkl | 4.262956e-05 | | clipfrac | 0.0 | | explained_variance | 0.0242 | | fps | 10 | | n_updates | 37 | | policy_entropy | 1.8401663 | | policy_loss | -0.0013190813 | | serial_timesteps | 4736 | | time_elapsed | 172 | | total_timesteps | 4736 | | value_loss | 1053.6825 | -------------------------------------- -------------------------------------- | approxkl | 2.7102764e-05 | | clipfrac | 0.0 | | explained_variance | 0.0134 | | fps | 32 | | n_updates | 38 | | policy_entropy | 1.8290623 | | policy_loss | -0.0017685738 | | serial_timesteps | 4864 | | time_elapsed | 184 | | total_timesteps | 4864 | | value_loss | 1681.4944 | -------------------------------------- -------------------------------------- | approxkl | 6.9206784e-05 | | clipfrac | 0.0 | | explained_variance | 0.0223 | | fps | 32 | | n_updates | 39 | | policy_entropy | 1.8066614 | | policy_loss | -0.0025333911 | | serial_timesteps | 4992 | | time_elapsed | 188 | | total_timesteps | 4992 | | value_loss | 1580.4724 | -------------------------------------- -------------------------------------- | approxkl | 9.368379e-05 | | clipfrac | 0.0 | | explained_variance | 0.0254 | | fps | 35 | | n_updates | 40 | | policy_entropy | 1.8165164 | | policy_loss | -0.0015186564 | | serial_timesteps | 5120 | | time_elapsed | 192 | | total_timesteps | 5120 | | value_loss | 844.2538 | -------------------------------------- -------------------------------------- | approxkl | 5.399082e-05 | | clipfrac | 0.0 | | explained_variance | 0.0076 | | fps | 34 | | n_updates | 41 | | policy_entropy | 1.8055782 | | policy_loss | -0.0019963083 | | serial_timesteps | 5248 | | time_elapsed | 195 | | total_timesteps | 5248 | | value_loss | 725.58435 | -------------------------------------- -------------------------------------- | approxkl | 3.9073806e-05 | | clipfrac | 0.0 | | explained_variance | 0.0235 | | fps | 34 | | n_updates | 42 | | policy_entropy | 1.8104093 | | policy_loss | -0.0010099054 | | serial_timesteps | 5376 | | time_elapsed | 199 | | total_timesteps | 5376 | | value_loss | 1061.5419 | -------------------------------------- -------------------------------------- | approxkl | 0.00028749614 | | clipfrac | 0.0 | | explained_variance | 0.0123 | | fps | 32 | | n_updates | 43 | | policy_entropy | 1.7929943 | | policy_loss | -0.0028197626 | | serial_timesteps | 5504 | | time_elapsed | 203 | | total_timesteps | 5504 | | value_loss | 197.30269 | -------------------------------------- --------------------------------------- | approxkl | 7.898419e-05 | | clipfrac | 0.0 | | explained_variance | 0.0361 | | fps | 32 | | n_updates | 44 | | policy_entropy | 1.8026001 | | policy_loss | -0.00043669326 | | serial_timesteps | 5632 | | time_elapsed | 207 | | total_timesteps | 5632 | | value_loss | 729.86597 | --------------------------------------- -------------------------------------- | approxkl | 2.6466723e-05 | | clipfrac | 0.0 | | explained_variance | 0.0224 | | fps | 31 | | n_updates | 45 | | policy_entropy | 1.7909173 | | policy_loss | -0.001732644 | | serial_timesteps | 5760 | | time_elapsed | 211 | | total_timesteps | 5760 | | value_loss | 1133.4182 | -------------------------------------- -------------------------------------- | approxkl | 0.00012678545 | | clipfrac | 0.0 | | explained_variance | 0.0309 | | fps | 9 | | n_updates | 46 | | policy_entropy | 1.7720149 | | policy_loss | -0.0011669645 | | serial_timesteps | 5888 | | time_elapsed | 215 | | total_timesteps | 5888 | | value_loss | 1062.8923 | -------------------------------------- -------------------------------------- | approxkl | 7.0968585e-05 | | clipfrac | 0.0 | | explained_variance | 0.0396 | | fps | 33 | | n_updates | 47 | | policy_entropy | 1.778688 | | policy_loss | -0.0022358913 | | serial_timesteps | 6016 | | time_elapsed | 228 | | total_timesteps | 6016 | | value_loss | 1487.3807 | -------------------------------------- -------------------------------------- | approxkl | 0.00020320652 | | clipfrac | 0.0 | | explained_variance | 0.0282 | | fps | 32 | | n_updates | 48 | | policy_entropy | 1.749349 | | policy_loss | -0.004993585 | | serial_timesteps | 6144 | | time_elapsed | 232 | | total_timesteps | 6144 | | value_loss | 1002.5958 | -------------------------------------- -------------------------------------- | approxkl | 0.00042819922 | | clipfrac | 0.0 | | explained_variance | 0.00813 | | fps | 30 | | n_updates | 49 | | policy_entropy | 1.7123698 | | policy_loss | -0.005672829 | | serial_timesteps | 6272 | | time_elapsed | 236 | | total_timesteps | 6272 | | value_loss | 1046.5302 | -------------------------------------- --------------------------------------- | approxkl | 0.00013379454 | | clipfrac | 0.0 | | explained_variance | 0.00907 | | fps | 28 | | n_updates | 50 | | policy_entropy | 1.7152259 | | policy_loss | -8.6367596e-05 | | serial_timesteps | 6400 | | time_elapsed | 240 | | total_timesteps | 6400 | | value_loss | 1215.6829 | --------------------------------------- --------------------------------------- | approxkl | 2.1924472e-05 | | clipfrac | 0.0 | | explained_variance | 0.0368 | | fps | 29 | | n_updates | 51 | | policy_entropy | 1.7218428 | | policy_loss | -0.00071638706 | | serial_timesteps | 6528 | | time_elapsed | 245 | | total_timesteps | 6528 | | value_loss | 1421.1805 | --------------------------------------- --------------------------------------- | approxkl | 3.9770104e-05 | | clipfrac | 0.0 | | explained_variance | 0.0406 | | fps | 26 | | n_updates | 52 | | policy_entropy | 1.7098793 | | policy_loss | -0.00050403504 | | serial_timesteps | 6656 | | time_elapsed | 249 | | total_timesteps | 6656 | | value_loss | 1449.4645 | --------------------------------------- -------------------------------------- | approxkl | 0.00013161713 | | clipfrac | 0.0 | | explained_variance | 0.0284 | | fps | 30 | | n_updates | 53 | | policy_entropy | 1.6992989 | | policy_loss | -0.0037235608 | | serial_timesteps | 6784 | | time_elapsed | 254 | | total_timesteps | 6784 | | value_loss | 1116.0957 | -------------------------------------- -------------------------------------- | approxkl | 0.00016224293 | | clipfrac | 0.0 | | explained_variance | 0.0256 | | fps | 24 | | n_updates | 54 | | policy_entropy | 1.6677443 | | policy_loss | -0.0020994307 | | serial_timesteps | 6912 | | time_elapsed | 258 | | total_timesteps | 6912 | | value_loss | 1610.2294 | -------------------------------------- -------------------------------------- | approxkl | 8.50085e-05 | | clipfrac | 0.0 | | explained_variance | 0.0418 | | fps | 24 | | n_updates | 55 | | policy_entropy | 1.6844364 | | policy_loss | -0.0029015616 | | serial_timesteps | 7040 | | time_elapsed | 263 | | total_timesteps | 7040 | | value_loss | 1394.1884 | -------------------------------------- -------------------------------------- | approxkl | 8.778148e-05 | | clipfrac | 0.0 | | explained_variance | 0.0174 | | fps | 25 | | n_updates | 56 | | policy_entropy | 1.6462625 | | policy_loss | -0.0012754094 | | serial_timesteps | 7168 | | time_elapsed | 268 | | total_timesteps | 7168 | | value_loss | 1377.0343 | -------------------------------------- -------------------------------------- | approxkl | 5.3626245e-05 | | clipfrac | 0.0 | | explained_variance | 0.0195 | | fps | 7 | | n_updates | 57 | | policy_entropy | 1.6674904 | | policy_loss | -0.0027412288 | | serial_timesteps | 7296 | | time_elapsed | 273 | | total_timesteps | 7296 | | value_loss | 1218.1082 | -------------------------------------- -------------------------------------- | approxkl | 0.00013577589 | | clipfrac | 0.0 | | explained_variance | 0.0091 | | fps | 33 | | n_updates | 58 | | policy_entropy | 1.6370174 | | policy_loss | -0.0023144195 | | serial_timesteps | 7424 | | time_elapsed | 290 | | total_timesteps | 7424 | | value_loss | 1628.3422 | -------------------------------------- -------------------------------------- | approxkl | 0.00010879507 | | clipfrac | 0.0 | | explained_variance | 0.0177 | | fps | 31 | | n_updates | 59 | | policy_entropy | 1.6247839 | | policy_loss | -0.0029689218 | | serial_timesteps | 7552 | | time_elapsed | 294 | | total_timesteps | 7552 | | value_loss | 1387.0997 | -------------------------------------- --------------------------------------- | approxkl | 0.000101132115 | | clipfrac | 0.0 | | explained_variance | 2.88e-05 | | fps | 32 | | n_updates | 60 | | policy_entropy | 1.6153951 | | policy_loss | -0.0010824468 | | serial_timesteps | 7680 | | time_elapsed | 298 | | total_timesteps | 7680 | | value_loss | 1266.4176 | --------------------------------------- -------------------------------------- | approxkl | 9.941224e-06 | | clipfrac | 0.0 | | explained_variance | 0.00992 | | fps | 31 | | n_updates | 61 | | policy_entropy | 1.6221255 | | policy_loss | -0.0008568058 | | serial_timesteps | 7808 | | time_elapsed | 302 | | total_timesteps | 7808 | | value_loss | 1107.5613 | -------------------------------------- -------------------------------------- | approxkl | 4.4428856e-05 | | clipfrac | 0.0 | | explained_variance | 0.00487 | | fps | 29 | | n_updates | 62 | | policy_entropy | 1.5875471 | | policy_loss | 0.00012629665 | | serial_timesteps | 7936 | | time_elapsed | 306 | | total_timesteps | 7936 | | value_loss | 1427.987 | -------------------------------------- -------------------------------------- | approxkl | 2.9562718e-05 | | clipfrac | 0.0 | | explained_variance | 0.0334 | | fps | 31 | | n_updates | 63 | | policy_entropy | 1.6306133 | | policy_loss | -0.0014668173 | | serial_timesteps | 8064 | | time_elapsed | 310 | | total_timesteps | 8064 | | value_loss | 1847.7845 | -------------------------------------- -------------------------------------- | approxkl | 0.00013762948 | | clipfrac | 0.0 | | explained_variance | 0.00831 | | fps | 27 | | n_updates | 64 | | policy_entropy | 1.586098 | | policy_loss | -0.0009872088 | | serial_timesteps | 8192 | | time_elapsed | 314 | | total_timesteps | 8192 | | value_loss | 1298.9132 | -------------------------------------- -------------------------------------- | approxkl | 0.00010181793 | | clipfrac | 0.0 | | explained_variance | 0.0501 | | fps | 28 | | n_updates | 65 | | policy_entropy | 1.6185954 | | policy_loss | -0.0028534413 | | serial_timesteps | 8320 | | time_elapsed | 319 | | total_timesteps | 8320 | | value_loss | 1583.869 | -------------------------------------- -------------------------------------- | approxkl | 0.0002465874 | | clipfrac | 0.0 | | explained_variance | 0.0481 | | fps | 31 | | n_updates | 66 | | policy_entropy | 1.6411582 | | policy_loss | -0.0016201262 | | serial_timesteps | 8448 | | time_elapsed | 323 | | total_timesteps | 8448 | | value_loss | 1832.3912 | -------------------------------------- -------------------------------------- | approxkl | 0.00012862496 | | clipfrac | 0.0 | | explained_variance | 0.0041 | | fps | 26 | | n_updates | 67 | | policy_entropy | 1.595502 | | policy_loss | -0.0016494679 | | serial_timesteps | 8576 | | time_elapsed | 327 | | total_timesteps | 8576 | | value_loss | 1128.1818 | -------------------------------------- -------------------------------------- | approxkl | 0.00012446745 | | clipfrac | 0.0 | | explained_variance | 0.0162 | | fps | 25 | | n_updates | 68 | | policy_entropy | 1.6092746 | | policy_loss | -0.0021886777 | | serial_timesteps | 8704 | | time_elapsed | 332 | | total_timesteps | 8704 | | value_loss | 1229.1382 | -------------------------------------- -------------------------------------- | approxkl | 0.00019569931 | | clipfrac | 0.0 | | explained_variance | 0.00843 | | fps | 29 | | n_updates | 69 | | policy_entropy | 1.5853233 | | policy_loss | -0.002774232 | | serial_timesteps | 8832 | | time_elapsed | 337 | | total_timesteps | 8832 | | value_loss | 1888.6075 | -------------------------------------- -------------------------------------- | approxkl | 0.0005179059 | | clipfrac | 0.0 | | explained_variance | 0.0137 | | fps | 6 | | n_updates | 70 | | policy_entropy | 1.5566128 | | policy_loss | -0.0013805616 | | serial_timesteps | 8960 | | time_elapsed | 342 | | total_timesteps | 8960 | | value_loss | 1391.2072 | -------------------------------------- -------------------------------------- | approxkl | 3.206292e-05 | | clipfrac | 0.0 | | explained_variance | 0.0114 | | fps | 29 | | n_updates | 71 | | policy_entropy | 1.5555453 | | policy_loss | -0.0013167775 | | serial_timesteps | 9088 | | time_elapsed | 361 | | total_timesteps | 9088 | | value_loss | 1698.6151 | -------------------------------------- -------------------------------------- | approxkl | 0.00015032938 | | clipfrac | 0.0 | | explained_variance | 0.022 | | fps | 31 | | n_updates | 72 | | policy_entropy | 1.5699382 | | policy_loss | -0.0038755746 | | serial_timesteps | 9216 | | time_elapsed | 365 | | total_timesteps | 9216 | | value_loss | 1398.7964 | -------------------------------------- -------------------------------------- | approxkl | 0.00038285612 | | clipfrac | 0.0 | | explained_variance | 0.0105 | | fps | 29 | | n_updates | 73 | | policy_entropy | 1.5486213 | | policy_loss | -0.0057686158 | | serial_timesteps | 9344 | | time_elapsed | 369 | | total_timesteps | 9344 | | value_loss | 1049.9211 | -------------------------------------- -------------------------------------- | approxkl | 0.00043470136 | | clipfrac | 0.0 | | explained_variance | 0.0196 | | fps | 31 | | n_updates | 74 | | policy_entropy | 1.5498748 | | policy_loss | -0.0051575904 | | serial_timesteps | 9472 | | time_elapsed | 373 | | total_timesteps | 9472 | | value_loss | 1127.1415 | -------------------------------------- -------------------------------------- | approxkl | 0.00016276767 | | clipfrac | 0.0 | | explained_variance | 0.0289 | | fps | 30 | | n_updates | 75 | | policy_entropy | 1.5488236 | | policy_loss | -0.0029246937 | | serial_timesteps | 9600 | | time_elapsed | 378 | | total_timesteps | 9600 | | value_loss | 1371.421 | -------------------------------------- --------------------------------------- | approxkl | 2.517243e-05 | | clipfrac | 0.0 | | explained_variance | -0.00129 | | fps | 29 | | n_updates | 76 | | policy_entropy | 1.5047842 | | policy_loss | -0.00013768359 | | serial_timesteps | 9728 | | time_elapsed | 382 | | total_timesteps | 9728 | | value_loss | 1410.051 | --------------------------------------- --------------------------------------- | approxkl | 1.304726e-05 | | clipfrac | 0.0 | | explained_variance | 0.0304 | | fps | 31 | | n_updates | 77 | | policy_entropy | 1.515545 | | policy_loss | -0.00078836584 | | serial_timesteps | 9856 | | time_elapsed | 386 | | total_timesteps | 9856 | | value_loss | 1565.2759 | --------------------------------------- -------------------------------------- | approxkl | 0.00014238954 | | clipfrac | 0.0 | | explained_variance | 0.00227 | | fps | 29 | | n_updates | 78 | | policy_entropy | 1.502316 | | policy_loss | -0.0016405442 | | serial_timesteps | 9984 | | time_elapsed | 390 | | total_timesteps | 9984 | | value_loss | 1184.733 | --------------------------------------
<stable_baselines.ppo2.ppo2.PPO2 at 0x7f8ecee40210>
Last, we test it by letting it play the game; we run ten steps of the game (notice, though, that the agent could reach the solution before the tenth step, which would cause the game to restart).
obs = env.reset()
display(env.render())
for _ in range(10):
action, _states = modelPPO2.predict(obs)
obs, _, done, info = env.step(action)
display(info[0]['circuit_img'])
env.close()
As expected, the agent easily learned the optimal circuit.
For comparison, we now run an A2C agent, another agent from the library of stable_baselines.
First we import the agent.
from stable_baselines import A2C
We train it.
modelA2C = A2C(MlpPolicy, env, verbose=1)
modelA2C.learn(total_timesteps=10000)
WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/tf_util.py:312: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/tf_util.py:312: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/a2c/a2c.py:159: The name tf.train.RMSPropOptimizer is deprecated. Please use tf.compat.v1.train.RMSPropOptimizer instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/tensorflow_core/python/training/rmsprop.py:119: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor --------------------------------- | explained_variance | -0.0111 | | fps | 11 | | nupdates | 1 | | policy_entropy | 1.95 | | total_timesteps | 5 | | value_loss | 12.5 | --------------------------------- --------------------------------- | explained_variance | 0.0309 | | fps | 32 | | nupdates | 100 | | policy_entropy | 1.95 | | total_timesteps | 500 | | value_loss | 10.4 | --------------------------------- --------------------------------- | explained_variance | -0.0657 | | fps | 33 | | nupdates | 200 | | policy_entropy | 1.95 | | total_timesteps | 1000 | | value_loss | 11.1 | --------------------------------- --------------------------------- | explained_variance | 0.0445 | | fps | 18 | | nupdates | 300 | | policy_entropy | 1.95 | | total_timesteps | 1500 | | value_loss | 2.92 | --------------------------------- --------------------------------- | explained_variance | 0.0629 | | fps | 20 | | nupdates | 400 | | policy_entropy | 1.95 | | total_timesteps | 2000 | | value_loss | 10.2 | --------------------------------- --------------------------------- | explained_variance | -0.037 | | fps | 21 | | nupdates | 500 | | policy_entropy | 1.95 | | total_timesteps | 2500 | | value_loss | 10.4 | --------------------------------- --------------------------------- | explained_variance | 0.00122 | | fps | 22 | | nupdates | 600 | | policy_entropy | 1.94 | | total_timesteps | 3000 | | value_loss | 9.58e+03 | --------------------------------- --------------------------------- | explained_variance | 0 | | fps | 22 | | nupdates | 700 | | policy_entropy | 1.94 | | total_timesteps | 3500 | | value_loss | 10 | --------------------------------- --------------------------------- | explained_variance | -0.00539 | | fps | 22 | | nupdates | 800 | | policy_entropy | 1.94 | | total_timesteps | 4000 | | value_loss | 4.24e+03 | --------------------------------- --------------------------------- | explained_variance | 0.00213 | | fps | 22 | | nupdates | 900 | | policy_entropy | 1.93 | | total_timesteps | 4500 | | value_loss | 4.33e+03 | --------------------------------- --------------------------------- | explained_variance | 0.0761 | | fps | 23 | | nupdates | 1000 | | policy_entropy | 1.93 | | total_timesteps | 5000 | | value_loss | 4.59 | --------------------------------- --------------------------------- | explained_variance | -0.014 | | fps | 23 | | nupdates | 1100 | | policy_entropy | 1.93 | | total_timesteps | 5500 | | value_loss | 5.29 | --------------------------------- --------------------------------- | explained_variance | 0.000819 | | fps | 23 | | nupdates | 1200 | | policy_entropy | 1.89 | | total_timesteps | 6000 | | value_loss | 4.46e+03 | --------------------------------- --------------------------------- | explained_variance | -0.432 | | fps | 22 | | nupdates | 1300 | | policy_entropy | 1.92 | | total_timesteps | 6500 | | value_loss | 12.2 | --------------------------------- --------------------------------- | explained_variance | 0.462 | | fps | 23 | | nupdates | 1400 | | policy_entropy | 1.88 | | total_timesteps | 7000 | | value_loss | 7.03 | --------------------------------- --------------------------------- | explained_variance | 0.364 | | fps | 23 | | nupdates | 1500 | | policy_entropy | 1.86 | | total_timesteps | 7500 | | value_loss | 5.83 | --------------------------------- ---------------------------------- | explained_variance | -0.000234 | | fps | 23 | | nupdates | 1600 | | policy_entropy | 1.83 | | total_timesteps | 8000 | | value_loss | 6.63e+03 | ---------------------------------- --------------------------------- | explained_variance | -0.0783 | | fps | 23 | | nupdates | 1700 | | policy_entropy | 1.8 | | total_timesteps | 8500 | | value_loss | 8.55e+03 | --------------------------------- --------------------------------- | explained_variance | 0.27 | | fps | 23 | | nupdates | 1800 | | policy_entropy | 1.53 | | total_timesteps | 9000 | | value_loss | 6.1 | --------------------------------- --------------------------------- | explained_variance | -0.00153 | | fps | 23 | | nupdates | 1900 | | policy_entropy | 1.55 | | total_timesteps | 9500 | | value_loss | 4.04e+03 | --------------------------------- --------------------------------- | explained_variance | 0.000107 | | fps | 23 | | nupdates | 2000 | | policy_entropy | 1.32 | | total_timesteps | 10000 | | value_loss | 5.53e+03 | ---------------------------------
<stable_baselines.a2c.a2c.A2C at 0x7f8d5816b110>
And we test it by letting it play ten steps of the game (as before the agent may reach a solution before the tenth step).
obs = env.reset()
display(env.render())
for _ in range(10):
action, _states = modelA2C.predict(obs)
obs, _, done, info = env.step(action)
display(info[0]['circuit_img'])
env.close()
Finally, we compare the agents quantitavely by contrasting their average reward computed running 1000 episodes of the game. We rely on the evaluation module that provides simple and standard routines to evaluate the agents.
import evaluation
n_episodes = 1000
PPO2_perf, _ = evaluation.evaluate_model(modelPPO2, env, num_steps=n_episodes)
A2C_perf, _ = evaluation.evaluate_model(modelA2C, env, num_steps=n_episodes)
env = gym.make('qcircuit-v1')
rand_perf, _ = evaluation.evaluate_random(env, num_steps=n_episodes)
print('Mean performance of random agent (out of {0} episodes): {1}'.format(n_episodes,rand_perf))
print('Mean performance of PPO2 agent (out of {0} episodes): {1}'.format(n_episodes,PPO2_perf))
print('Mean performance of A2C agent (out of {0} episodes): {1}'.format(n_episodes,A2C_perf))
Mean performance of random agent (out of 1000 episodes): 19.333 Mean performance of PPO2 agent (out of 1000 episodes): 72.541 Mean performance of A2C agent (out of 1000 episodes): 90.753
The reinforcement learning agents (PPO2, A2C) learned to play the game to different degrees. On the opposite, the random agent performed very badly, showing that even with this limited state and action space a random policy rarely finds the right solution.
[1] IBM qiskit, https://qiskit.org/
[2] OpenAI gym, http://gym.openai.com/docs/
[3] stable-baselines, https://github.com/hill-a/stable-baselines
[4] gym-qcircuit, https://github.com/FMZennaro/gym-qcircuit