# Grid2Op integration with existing frameworks¶

Try me out interactively with:

objectives This notebooks briefly explains how to use grid2op with commonly used RL frameworks. It also explains the main methods / class of the grid2op.gym_compat module that ease grid2op integration with these frameworks.

The structure is always very similar:

1. Create a grid2op environment
2. Convert it to a gym environment
3. (optional) Customize the action space and observation space
4. Use the framework to train an agent
5. Embed the trained agent into a grid2op Agent to take valid grid2op actions.

In this notebook, we will demonstrate its usage with 3 different framework. The code provided here are given as examples and we do not assume anything on their performance or fitness of use. More detailed example will be provided in the l2rpn-baselines repository in due time (work in progress at the time of writing this notebook). The 3 framework we will demonstrate an example of are:

Other RL frameworks are not cover here. If you already use them, let us know !

Note also that there is still the possibility to use past codes in the l2rpn-baselines repository: https://github.com/rte-france/l2rpn-baselines . This repository contains code snippets that can be reuse to make really nice agents on the l2rpn competitions. You can try it out :-)

Execute the cell below by removing the # characters if you use google colab !

Cell will look like:

import sys
!$sys.executable install grid2op[optional] # for use with google colab (grid2Op is not installed by default) !$sys.executable install tensorflow pytorch stable-baselines3 'ray[rllib]' tf_agents


It might take a while

In [ ]:
import sys
# !$sys.executable install grid2op[optional] # for use with google colab (grid2Op is not installed by default) # !$sys.executable -m pip install stable-baselines3 'ray[rllib]' tf_agents

In [ ]:
# because this notebook is part of some tests, we train the agent for only a small number of steps
nb_step_train = 0


## Organisation of this notebook¶

For the organisation of this notebook we decided to first detail features closer to grid2op to go later on "higher level" feature that are closer to "standard" gym representation (eg Box and Discrete space).

Note the closer you are to grid2op the more grid2op feature you can use. For example, in gym environment, it is not possible to use the "simulate" function (remember, this function allow to use a simulator that has a behaviour close to the one of the environment) at all. Also, grid2op observation and action comes with a lot of different feature (capacity to add them, to retrieve the graph of the grid etc.) which is not possible to use directly in gym.

That being said, this notebook is organized as follow:

• Convert it to a gym environment: basic use of the gym_compat grid2op module allowing to convert a grid2op environment to a gym environment.
• Action space: basic usage of the action space, by removing redundant feature (gym_env.observation_space.ignore_attr) or transforming feature from a continuous space to a discrete space (ContinuousToDiscreteConverter)
• Observation space: basic usage of the observation space, by removing redunddant features (keep_only_attr) or to scale the data on between a certain range (ScalerAttrConverter)
• Making the grid2op agent explains how to make a grid2op agent once trained. Note that a more "agent focused" view is provided in the notebook 04_TrainingAnAgent !
• 1) RLLIB-RLLIB): more advance usage for customizing the observation space (gym_env.observation_space.reencode_space and gym_env.observation_space.add_key) or modifying the type of gym attribute (MultiToTupleConverter) as well as an example of how to use RLLIB framework
• 2)-Stable baselines-Stable-baselines): even more advanced usage for customizing the observation space by concatenating it to a single "Box" (instead of a dictionnary) thanks to BoxGymObsSpace and to use BoxGymActSpace if you are more focus on continuous actions and MultiDiscreteActSpace for discrete actions (NB in both case there will be loss of information as compared to regular grid2op actions! for example it will be harder to have a representation of the graph of the grid there)
• 3) Tf Agents-Tf-Agents) explains how to convert the action space into a "Discrete" gym space thanks to DiscreteActSpace

On each sections, we also explain concisely how to train the agent. Note that we did not spend any time on customizing the default agents and training scheme. It is then less than likely that these agents there

### Create a grid2op environment¶

This is a rather standard step, with lots of inspiration drawn from openAI gym framework, and there is absolutely no specificity here.

In [ ]:
import grid2op
env_name = "l2rpn_case14_sandbox"
env_glop = grid2op.make(env_name, test=True)  # NOTE: do not set the flag "test=True" for a real usage !
# This flag is here for testing purpose !!!
obs_glop = env_glop.reset()
obs_glop


### Convert it to a gym environment¶

To that end, we recommend using the "gym_compat" module. More information is given in the official grid2op documentation

In [ ]:
import gym
import numpy as np
from grid2op.gym_compat import GymEnv
env_gym = GymEnv(env_glop)
print(f"The \"env_gym\" is a gym environment: {isinstance(env_gym, gym.Env)}")
obs_gym = env_gym.reset()
# obs_gym


### Customize the action space and observation space¶

This step is optional, but highly recommended.

By default, grid2op actions and observations are huge. Even for this very simplistic example, you have really important sizes:

In [ ]:
dim_act_space = np.sum([np.sum(env_gym.action_space[el].shape) for el in env_gym.action_space.spaces])
print(f"The size of the action space is : "
f"{dim_act_space}")
dim_obs_space = np.sum([np.sum(env_gym.observation_space[el].shape).astype(int)
for el in env_gym.observation_space.spaces])
print(f"The size of the observation space is : "
f"{dim_obs_space}")


#### Action space¶

This is partly due because in grid2op, you can represent the same concept (eg reconnect a powerline) in different manners (in this case: either you "toggle a switch" - if the said powerline was connected, it will disconnect it, otherwise it will reconnect it- or you can say "i want this line connected whatever its original state"). This behaviour is detailed in the official grid2op documentation.

To (in general) reduce the action space by a factor of 2, you can represent these actions only using the change method (for example). You can do that with:

In [ ]:
# example: ignore the "set_status" and "set_bus" type of actions, that are covered by the "change_status" and
# "change_bus"

env_gym.action_space = env_gym.action_space.ignore_attr("set_bus").ignore_attr("set_line_status")

new_dim_act_space = np.sum([np.sum(env_gym.action_space[el].shape) for el in env_gym.action_space.spaces])
print(f"The new size of the action space is : {new_dim_act_space}")


Grid2op environments allow for both continuous and discrete action. For the sake of the example, let's "convert" the continuous actions in discrete ones (this is done with "binning" the values as explained in more details in the documentation )

In [ ]:
# example: convert the continuous action type "redispatch" to a discrete action type
from grid2op.gym_compat import ContinuousToDiscreteConverter
env_gym.action_space = env_gym.action_space.reencode_space("redispatch",
ContinuousToDiscreteConverter(nb_bins=11)
)

In [ ]:
# And now our action space looks like:
env_gym.action_space


#### Observation space¶

For the obsevation space, we will remove lots of useless attributes (remember, it is for the sake of the example here, and rescale some other so that they have numbers between rougly 0. and 1., which stabilizes the learning process.

In [ ]:
# first let's see which are the attributes in the observation space:
# and
env_gym.observation_space


Let's keep only the information about the flow on the powerlines: rho, the generation gen_p, the load load_p and the representation of the topology topo_vect (for the sake of the example, once again)

In [ ]:
env_gym.observation_space = env_gym.observation_space.keep_only_attr(["rho", "gen_p", "load_p", "topo_vect",
"actual_dispatch"])
new_dim_obs_space = np.sum([np.sum(env_gym.observation_space[el].shape).astype(int)
for el in env_gym.observation_space.spaces])
print(f"The new size of the observation space is : "
f"{new_dim_obs_space} (it was {dim_obs_space} before!)")


One other detail here, the generation and loads are not scaled (they are given in MW). We recommend to scale them to have number roughly between 0 and 1 for stability during learning.

This can be done pretty easily with the code below:

In [ ]:
from grid2op.gym_compat import ScalerAttrConverter
from gym.spaces import Box
ob_space = env_gym.observation_space
ob_space = ob_space.reencode_space("actual_dispatch",
ScalerAttrConverter(substract=0.,
divide=env_glop.gen_pmax
)
)
ob_space = ob_space.reencode_space("gen_p",
ScalerAttrConverter(substract=0.,
divide=env_glop.gen_pmax
)
)
)
)

# for even more customization, you can use any functions you want !
shape_ = (env_glop.dim_topo, env_glop.dim_topo)
lambda obs: obs.connectivity_matrix(),  # can be any function returning a gym space
Box(shape=shape_,
low=np.zeros(shape_),
high=np.ones(shape_),
)  # this "Box" should represent the return type of the above function
)
env_gym.observation_space = ob_space
env_gym.observation_space


## Making the grid2op agent¶

In this subsection we briefly explain how to wrapped the trained agent (see below for training methods depending on the framework you want to use). The goal is to make this "tutorial" complete, in the sense that you will be able to use the trained agent in regular grid2op framework, for example using the Runner

This subsection is compatible with all code that is explained in this notebook, even though we demonstrate it with the env created above.

The basic idea is really simple, you create an grid2op agent, initialize it with the gym_env (you got from the gym_compat module) and use the "gym_env.action_space.from_gym" and "gym_env.observation_space.to_gym" function to convert the action and the observation.

In [ ]:
from grid2op.Agent import BaseAgent

class AgentFromGym(BaseAgent):
def __init__(self, gym_env, trained_agent):
self.gym_env = gym_env
BaseAgent.__init__(self, gym_env.init_env.action_space)
self.trained_aget = trained_agent
def act(self, obs, reward, done):
gym_obs = self.gym_env.observation_space.to_gym(obs)
gym_act = self.trained_agent.act(gym_obs, reward, done)
grid2op_act = self.gym_env.action_space.from_gym(gym_act)
return act


And this is it. You are done ;-)

## 1) RLLIB¶

This part is not a tutorial on how to use rllib. Please refer to their documentation for more detailed information.

As explained in the header of this notebook, we will follow the recommended usage:

1. Create a grid2op environment (see section 0) Recommended initial steps-Recommended-initial-steps))
2. Convert it to a gym environment (see section 0) Recommended initial steps-Recommended-initial-steps))
3. (optional) Customize the action space and observation space (see section 0) Recommended initial steps-Recommended-initial-steps))
4. Use the framework to train an agent (only this part is framework specific)

The issue with rllib is that it does not take into account MultiBinary nor MultiDiscrete action space (see see https://github.com/ray-project/ray/issues/1519) so we need some way to encode these types of actions. This can be done automatically with the MultiToTupleConverter provided in grid2op (as always, more information in the documentation ).

We will then use this to customize our environment previously defined:

In [ ]:
import copy
env_rllib = copy.deepcopy(env_gym)
from grid2op.gym_compat import MultiToTupleConverter
env_rllib.action_space = env_rllib.action_space.reencode_space("change_bus", MultiToTupleConverter())
env_rllib.action_space = env_rllib.action_space.reencode_space("change_line_status", MultiToTupleConverter())
env_rllib.action_space = env_rllib.action_space.reencode_space("redispatch", MultiToTupleConverter())
env_rllib.action_space


Another specificity of RLLIB is that it handles creation of environments "on its own". This implies that you need to create a custom class representing an environment, rather a python object.

And finally, you ask it to use this class, and learn a specific agent. This is really well explained in their documentation: https://docs.ray.io/en/master/rllib-env.html#configuring-environments.

In [ ]:
# gym specific, we simply do a copy paste of what we did in the previous cells, wrapping it in the
# MyEnv class, and train a Proximal Policy Optimisation based agent
import gym
import ray
import gym
import numpy as np

class MyEnv(gym.Env):
def __init__(self, env_config):
import grid2op
from grid2op.gym_compat import GymEnv
from grid2op.gym_compat import ScalerAttrConverter, ContinuousToDiscreteConverter, MultiToTupleConverter

# 1. create the grid2op environment
if not "env_name" in env_config:
raise RuntimeError("The configuration for RLLIB should provide the env name")
nm_env = env_config["env_name"]
del env_config["env_name"]
self.env_glop = grid2op.make(nm_env, **env_config)

# 2. create the gym environment
self.env_gym = GymEnv(self.env_glop)
obs_gym = self.env_gym.reset()

## customize action space
self.env_gym.action_space = self.env_gym.action_space.ignore_attr("set_bus").ignore_attr("set_line_status")
self.env_gym.action_space = self.env_gym.action_space.reencode_space("redispatch",
ContinuousToDiscreteConverter(nb_bins=11)
)
self.env_gym.action_space = self.env_gym.action_space.reencode_space("change_bus", MultiToTupleConverter())
self.env_gym.action_space = self.env_gym.action_space.reencode_space("change_line_status",
MultiToTupleConverter())
self.env_gym.action_space = self.env_gym.action_space.reencode_space("redispatch", MultiToTupleConverter())
## customize observation space
ob_space = self.env_gym.observation_space
ob_space = ob_space.keep_only_attr(["rho", "gen_p", "load_p", "topo_vect", "actual_dispatch"])
ob_space = ob_space.reencode_space("actual_dispatch",
ScalerAttrConverter(substract=0.,
divide=self.env_glop.gen_pmax
)
)
ob_space = ob_space.reencode_space("gen_p",
ScalerAttrConverter(substract=0.,
divide=self.env_glop.gen_pmax
)
)
)
)
self.env_gym.observation_space = ob_space

# 4. specific to rllib
self.action_space = self.env_gym.action_space
self.observation_space = self.env_gym.observation_space

def reset(self):
obs = self.env_gym.reset()
return obs

def step(self, action):
obs, reward, done, info = self.env_gym.step(action)
return obs, reward, done, info

In [ ]:
test = MyEnv({"env_name": "l2rpn_case14_sandbox"})


And now you can train it :

In [ ]:
if nb_step_train:  # remember: don't forge to change this number to perform an actual training !
from ray.rllib.agents import ppo  # import the type of agents
# fist initialize ray
ray.init()
try:
# then define a "trainer"
trainer = ppo.PPOTrainer(env=MyEnv, config={
"env_config": {"env_name":"l2rpn_case14_sandbox"},  # config to pass to env class
})
# and then train it for a given number of iteration
for step in range(nb_step_train):
trainer.train()
finally:
# shutdown ray
ray.shutdown()


NB We want to emphasize here that:

• This encoding is far from being suitable here. It is shown as an example, mainly to demonstrate the use of some of the gym_compat module
• The actions in particular are not really suited here. Actions in grid2op are relatively complex and encoding them this way does not seem like a great idea. For example, with this encoding, the agent will have to learn that it cannot act on more than 2 lines or two substations at the same time...
• The "PPO" agent shown here, with some default parameters is unlikely to lead to a good agent. You might want to read litterature on past L2RPN agents or draw some inspiration from L2RPN baselines packages for more information.

## 2) Stable baselines¶

This part is not a tutorial on how to use stable baselines. Please refer to their documentation for more detailed information.

As explained in the header of this notebook, we will follow the recommended usage:

1. Create a grid2op environment (see section 0) Recommended initial steps-Recommended-initial-steps))
2. Convert it to a gym environment (see section 0) Recommended initial steps-Recommended-initial-steps))
3. (optional) Customize the action space and observation space (see section 0) Recommended initial steps-Recommended-initial-steps))
4. Use the framework to train an agent (only this part is framework specific)

The issue with stable beselines 3 is that it expects standard action / observation types as explained there: https://stable-baselines3.readthedocs.io/en/master/guide/algos.html#rl-algorithms

Non-array spaces such as Dict or Tuple are not currently supported by any algorithm.

Unfortunately, it's not possible to convert without any "loss of information" an action space of dictionnary type to a vector.

It is possible to use the grid2op framework in such cases, and in this section, we will explain how.

First, as always, we convert the grid2op environment in a gym environment.

In [ ]:
env_sb = GymEnv(env_glop)  # sb for "stable baselines"
glop_obs = env_glop.reset()


Then, we need to convert everything into a "Box" as it is the only things that stable baselines seems to digest at time of writing (March 20201).

### Observation Space¶

We explain here how we convert an observation as a single Box. This step is rather easy, you just need to specify which attributes of the observation you want to keep and if you want so scale them (with the keword subtract and divide)

In [ ]:
from grid2op.gym_compat import BoxGymObsSpace
env_sb.observation_space = BoxGymObsSpace(env_sb.init_env.observation_space,
"rho", "actual_dispatch", "connectivity_matrix"],
divide={"gen_p": env_glop.gen_pmax,
"actual_dispatch": env_glop.gen_pmax},
functs={"connectivity_matrix": (
lambda grid2obs: grid2obs.connectivity_matrix().flatten(),
0., 1., None, None,
)
}
)
obs_gym = env_sb.reset()

In [ ]:
obs_gym in env_sb.observation_space


NB: the above code is equivalent to something like:

from gym.spaces import Box
class BoxGymObsSpaceExample(Box):
def __init__(self, observation_space)
shape = observation_space.n_gen + \     # dimension of gen_p
observation_space.dim_topo + \  # topo_vect
observation_space.n_line + \    # rho
observation_space.n_gen + \     # actual_dispatch
observation_space.dim_topo ** 2 # connectivity_matrix

ob_sp = observation_space
# lowest value the attribute can take (see doc for more information)
low = np.concatenate((np.full(shape=(ob_sp.n_gen,), fill_value=0., dtype=dt_float),  # gen_p
np.full(shape=(ob_sp.dim_topo,), fill_value=-1., dtype=dt_float),  # topo_vect
np.full(shape=(ob_sp.n_line,), fill_value=0., dtype=dt_float),  # rho
np.full(shape=(ob_sp.n_line,), fill_value=-ob_sp.gen_pmax, dtype=dt_float),  # actual_dispatch
np.full(shape=(ob_sp.dim_topo**2,), fill_value=0., dtype=dt_float),  #  connectivity_matrix
))

# highest value the attribute can take
high = np.concatenate((np.full(shape=(ob_sp.n_gen,), fill_value=np.inf, dtype=dt_float),  # gen_p
np.full(shape=(ob_sp.dim_topo,), fill_value=2., dtype=dt_float),  # topo_vect
np.full(shape=(ob_sp.n_line,), fill_value=np.inf, dtype=dt_float),  # rho
np.full(shape=(ob_sp.n_line,), fill_value=ob_sp.gen_pmax, dtype=dt_float),  # actual_dispatch
np.full(shape=(ob_sp.dim_topo**2,), fill_value=1., dtype=dt_float),  #  connectivity_matrix
))
Box.__init__(self, low=low, high=high, shape=shape)

def to_gym(self, observation):
res = np.concatenate((obs.gen_p / obs.gen_pmax,
obs.topo_vect.astype(float),
obs.rho,
obs.actual_dispatch / env_glop.gen_pmax,
obs.connectivity_matrix().flatten()
))
return res


So if you want more customization, but making less generic code (the BoxGymObsSpace works for all the attribute of the observation) you can customize it by adapting the snippet above or read the documentation here (TODO).

Only the "to_gym" function, and this exact signature is important in this case. It should take an observation in a grid2op format and return this same observation compatible with the gym Box (so a numpy array with the right shape and in the right range)

### Action space¶

Converting the grid2op actions in something that is not a Tuple, nor a Dict. The main restriction in these frameworks is that they do not allow for easy integration of environment where both discrete actions and continuous actions are possible.

#### Using a BoxGymActSpace¶

We can use the same kind of method explained above with the use of the class BoxGymActSpace. In this case, you need to provide a way to convert a numpy array (an element of a gym Box) into a grid2op action.

NB This method is particularly suited if you want to focus on CONTINUOUS part of the action space, for example redispatching, curtailment or action on storage unit.

Though we made it possible to also use discrete action, we do not recommend to use it. Prefer using the MultiDiscreteActSpace for such purpose.

In [ ]:
from grid2op.gym_compat import BoxGymActSpace
scale_gen =  env_sb.init_env.gen_max_ramp_up + env_sb.init_env.gen_max_ramp_down
scale_gen[~env_sb.init_env.gen_redispatchable] = 1.0
env_sb.action_space = BoxGymActSpace(env_sb.init_env.action_space,
attr_to_keep=["redispatch"],
multiply={"redispatch": scale_gen},
)
obs_gym = env_sb.reset()


NB: the above code is equivalent to something like:

from gym.spaces import Box
class BoxGymActSpace(Box):
def __init__(self, action_space)
shape = observation_space.n_gen  # redispatch

ob_sp = observation_space
# lowest value the attribute can take (see doc for more information)
low = np.full(shape=(ob_sp.n_gen,), fill_value=-1., dtype=dt_float)

# highest value the attribute can take
high = np.full(shape=(ob_sp.n_gen,), fill_value=1., dtype=dt_float)

Box.__init__(self, low=low, high=high, shape=shape)

self.action_space = action_space

def from_gym(self, gym_observation):
res = self.action_space()
res.redispatch = gym_observation * scale_gen
return res


So if you want more customization, but making less generic code (the BoxGymActSpace works for all the attribute of the action) you can customize it by adapting the snippet above or read the documentation here (TODO). The only important method you need to code is the "from_gym" one that should take into account an action as sampled by the gym Box and return a grid2op action.

#### Using a MultiDiscreteActSpace¶

We can use the same kind of method explained above with the use of the class BoxGymActSpace, but which is more suited to the discrete type of actions.

In this case, you need to provide a way to convert a numpy array of integer (an element of a gym MultiDiscrete) into a grid2op action.

NB This method is particularly suited if you want to focus on DISCRETE part of the action space, for example set_bus or change_line_status.

In [ ]:
from grid2op.gym_compat import MultiDiscreteActSpace
reencoded_act_space = MultiDiscreteActSpace(env_sb.init_env.action_space,
attr_to_keep=["set_line_status", "set_bus", "redispatch"])
env_sb.action_space = reencoded_act_space
obs_gym = env_sb.reset()


### Wrapping all up and starting the training¶

First, let's make sure our environment is compatible with stable baselines, thanks to their helper function.

This means that

In [ ]:
from stable_baselines3.common.env_checker import check_env
check_env(env_sb)


So as we see, the environment seems to be compatible with stable baselines. Now we can start the training.

In [ ]:
from stable_baselines3 import PPO
model = PPO("MlpPolicy", env_sb, verbose=1)
if nb_step_train:
model.learn(total_timesteps=nb_step_train)
# model.save("ppo_stable_baselines3")


Again, the goal of this section was not to demonstrate how to train a state of the art algorithm, but rather to demonstrate how to use grid2op with the stable baselines repository.

Most importantly, the neural networks there are not customized for the environment, default parameters are used. This is unlikely to work at all !

For more information and to use tips and tricks to get started with RL agents, the devs of "stable baselines" have done a really nice job. You can have some tips for training RL agents here https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html and consult any of the resources listed there https://stable-baselines3.readthedocs.io/en/master/guide/rl.html

## 3) Tf Agents¶

Lastly, the RL frameworks we will use is tf agents.

Compared to the previous one, this framework is more verbose. In this notebook we will mimic what has been done in the https://github.com/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb

To that end, we will introduce the last "gym transformer" available in grid2op at time of writing. This function will transform the action space in a Discrete one. With this modeling, the agent can take an action on a substation, or act on a powerline or perform redispatching. But, as opposed to what is done previously, it cannot act on, say, a substation and a powerline at the same time.

This limitation does not come from tf agents. But this limitation is necessary to run the tutorial of the DQN provided with tf agents.

First we will build the observation space as for the stable baselines repository. See section 2) Stable baselines-Stable-baselines) for more information.

### Observation space¶

In [ ]:
# create the gym environment
env_tfa = GymEnv(env_glop)  # tfa for "tf agents"
glop_obs = env_glop.reset()

In [ ]:
# customize the observation space
env_tfa.observation_space = BoxGymObsSpace(env_tfa.init_env.observation_space,
"rho", "actual_dispatch", "connectivity_matrix"],
divide={"gen_p": env_glop.gen_pmax,
"actual_dispatch": env_glop.gen_pmax},
functs={"connectivity_matrix": (
lambda grid2obs: grid2obs.connectivity_matrix().flatten(),
0., 1., None, None,
)
}
)
obs_gym = env_tfa.reset()


Again, the observation space might need to be customize. We don't assume here that everything here is relevant, nor that any information that would be needed for an agent is here.

This example is only here to demonstrate how to use grid2op with openai gym framework.

### Action space¶

As opposed to the previous action space, to use the tutorial of tf agents, we need to customize the action space to ouput a single number (the id of the action you want to take).

This can be done with the DiscreteActSpace gym converter, that behave approximately the same way as MultiDiscreteActSpace does.

In [ ]:
from grid2op.gym_compat import DiscreteActSpace
reencoded_act_space = DiscreteActSpace(env_sb.init_env.action_space,
attr_to_keep=["set_line_status", "set_bus", "redispatch"])
env_tfa.action_space = reencoded_act_space
obs_gym = env_sb.reset()
print(env_tfa.action_space.from_gym(env_tfa.action_space.sample()))

In [ ]:
print(env_tfa.action_space.from_gym(env_tfa.action_space.sample()))


### Wrapping up and start training¶

And that is it. All the rest is done thanks to tf agents.

tf agents is a lot more verbose than ray and stable baselines, but it allows for more control on what you want to do, we will, for the sake of the example, only show the step without detailing them.

and the notebook that inspired this one: https://colab.research.google.com/github/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb

Note: the above code, once again, only aims at showing how to integrate grid2op with tf agents. Its aim is not to showcase the best use of tensorflow, tf agents or grid2op.

It is only an example for demonstration purpose and do not aim at providing an interesting agent at all. For that you might want to use something different than DQN, tune the hyper parameters (including size of each neural networks, number of step for which you train, learning rate, etc. etc.), define in a better fasshion the action space and observation space etc.

In [ ]:
import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.specs import tensor_spec
from tf_agents.utils import common

# initialize the environment
from tf_agents.environments.gym_wrapper import GymWrapper
tf_env_train = tf_py_environment.TFPyEnvironment(GymWrapper(env_tfa))
eval_env = tf_py_environment.TFPyEnvironment(GymWrapper(copy.deepcopy(env_tfa)))

# meta parameters
num_iterations = nb_step_train

initial_collect_steps = 100
collect_steps_per_iteration = 1
replay_buffer_max_length = 100000
batch_size = 64
learning_rate = 1e-3
log_interval = 200
num_eval_episodes = 10
eval_interval = 1000

# neural nets (for the agents)
fc_layer_params = (100, 50)
action_tensor_spec = tensor_spec.from_spec(tf_env_train.action_spec())
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1

# Define a helper function to create Dense layers configured with the right
# activation and kernel initializer.
def dense_layer(num_units):
return tf.keras.layers.Dense(
num_units,
activation=tf.keras.activations.relu,
kernel_initializer=tf.keras.initializers.VarianceScaling(
scale=2.0, mode='fan_in', distribution='truncated_normal'))

# QNetwork consists of a sequence of Dense layers followed by a dense layer
# with num_actions units to generate one q_value per available action as
# it's output.
dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]
q_values_layer = tf.keras.layers.Dense(
num_actions,
activation=None,
kernel_initializer=tf.keras.initializers.RandomUniform(
minval=-0.03, maxval=0.03),
bias_initializer=tf.keras.initializers.Constant(-0.2))
q_net = sequential.Sequential(dense_layers + [q_values_layer])

# optimizer (for training)

# just a variable to count the number of "env.step" performed
train_step_counter = tf.Variable(0)

# create the agent
agent = dqn_agent.DqnAgent(
tf_env_train.time_step_spec(),
tf_env_train.action_spec(),
q_network=q_net,
optimizer=optimizer,
td_errors_loss_fn=common.element_wise_squared_loss,
train_step_counter=train_step_counter)
agent.initialize()

# for exploration
random_policy = random_tf_policy.RandomTFPolicy(tf_env_train.time_step_spec(),
tf_env_train.action_spec())

# replay buffer (to store the past actions / states / rewards)
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=tf_env_train.batch_size,
max_length=replay_buffer_max_length)
def collect_step(environment, policy, buffer):
time_step = environment.current_time_step()
action_step = policy.action(time_step)
next_time_step = environment.step(action_step.action)
traj = trajectory.from_transition(time_step, action_step, next_time_step)
# Add trajectory to the replay buffer
def collect_data(env, policy, buffer, steps):
for _ in range(steps):
collect_step(env, policy, buffer)
collect_data(tf_env_train, random_policy, replay_buffer, initial_collect_steps)

# generate the datasets
# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
num_parallel_calls=3,
sample_batch_size=batch_size,
num_steps=2).prefetch(3)
iterator = iter(dataset)

# train it
# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
def compute_avg_return(environment, policy, num_episodes=10):
total_return = 0.0
for _ in range(num_episodes):
time_step = environment.reset()
episode_return = 0.0

while not time_step.is_last():
action_step = policy.action(time_step)
time_step = environment.step(action_step.action)
episode_return += time_step.reward
total_return += episode_return

avg_return = total_return / num_episodes
return avg_return.numpy()[0]

# See also the metrics module for standard implementations of different metrics.
# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics

avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):
# Collect a few steps using collect_policy and save to the replay buffer.
collect_data(tf_env_train, agent.collect_policy, replay_buffer, collect_steps_per_iteration)

# Sample a batch of data from the buffer and update the agent's network.
experience, unused_info = next(iterator)
trainer = agent.train(experience)
train_loss = trainer.loss

step = agent.train_step_counter.numpy()

if step % log_interval == 0:
print('step = {0}: loss = {1}'.format(step, train_loss))

if step % eval_interval == 0:
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
print('step = {0}: Average Return = {1}'.format(step, avg_return))
returns.append(avg_return)
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)

if num_iterations:
print('Final Average return aftre training for {} steps: {}'.format(step, avg_return))
returns.append(avg_return)