Agent, RL and MultiEnvironment

Try this notebook out interactively with: Binder

It is recommended to have a look at the 00_Introduction, 01_Grid2opFramework and 03_Action and especially 04_TrainingAnAgent notebooks before getting into this one.

Objectives

In this notebook we will introduce :

  • what is a "SingleEnvMultiProcess" (previously MultiEnvironment)
  • how can it be used with an agent
  • how can it be used to train a agent that uses different environments (using the l2rpn-baselines python package)

Execute the cell below by removing the # character if you use google colab !

Cell will look like:

!pip install grid2op[optional]  # for use with google colab (grid2Op is not installed by default)

In [ ]:
# !pip install grid2op[optional]  # for use with google colab (grid2Op is not installed by default)
In [ ]:
res = None
try:
    from jyquickhelper import add_notebook_menu
    res = add_notebook_menu()
except ModuleNotFoundError:
    print("Impossible to automatically add a menu / table of content to this notebook.\nYou can download \"jyquickhelper\" package with: \n\"pip install jyquickhelper\"")
res
In [ ]:
import grid2op
from grid2op.Reward import ConstantReward, FlatReward
from tqdm.notebook import tqdm
from grid2op.Runner import Runner
import sys
import os
import numpy as np
TRAINING_STEP = 100

I) Make a regular environment and agent

By default we use the test environment. But by passing test=False in the following function will automatically download approximately 300MB from the internet and give you 1000 chronics instead of 2 used for this example.

In [ ]:
env = grid2op.make("rte_case14_realistic", test=True)

A lot of data have been made available for the default "rte_case14_realistic" environment. Including this data in the package is not convenient.

We chose instead to release them and make them easily available with a utility. To download them in the default directory ("~/data_grid2op/case14_redisp") just pass the argument "test=False" (or don't pass anything else) as local=False is the default value. It will download approximately 300Mo of data.

II) Train a standard RL Agent

Make sure you are using a computer with at least 4 cores if you want to notice some speed-ups.

In [ ]:
from grid2op.Environment import SingleEnvMultiProcess
from grid2op.Agent import DoNothingAgent
NUM_CORE = 2

IIIa) Using the standard open AI gym loop

Here we demonstrate how to use the multi environment class. First let's create a multi environment.

In [ ]:
# create a simple agent
agent = DoNothingAgent(env.action_space)

# create the multi environment class
multi_envs = SingleEnvMultiProcess(env=env, nb_env=NUM_CORE)

A multi environment is just like using a regular environment but instead of dealing with one action, and one observation, it requires to be sent multiple actions, and returns a list of observations as well.

It requires a grid2op environment to be initialized and creates some specific "workers", each is a replication of the initial environment. None of the "worker"s can be accessed directly. Supported methods are:

  • multi_env.reset
  • multi_env.step
  • multi_env.close

These methods have similar behaviour as "env.step", "env.close" or "env.reset".

It can be used the following way:

In [ ]:
# initiliaze some variable with the proper dimension
obss = multi_envs.reset()
rews = [env.reward_range[0] for i in range(NUM_CORE)]
dones = [False for i in range(NUM_CORE)]
obss
In [ ]:
dones

As you can see, obs is not a single obervation, but a list (numpy nd array to be precise) of 4 observations (if 4 cores were used, otherwise the length of the array is the number of cores), each one being an observation of a given "worker" environment.

Worker environments are always called in the same order. It means the first observation of this vector will always correspond to the first worker environment.

Similarly to Observation, the "step" function of a multi_environment takes as input a list of multiple actions, each action will be implemented in its own environment. It returns a list of observations, a list of rewards, and boolean list of whether or not the worker environment suffers from a game over (in that case this worker environment is automatically restarted using the "reset" method.)

Because orker environments are always called in the same order, the first action sent to the "multi_env.step" function will also be applied on this first environment.

It is possible to use it as follow:

In [ ]:
# initialize the vector of actions that will be processed by each worker environment.
acts = [None for _ in range(NUM_CORE)]
for env_act_id in range(NUM_CORE):
    acts[env_act_id] = agent.act(obss[env_act_id], rews[env_act_id], dones[env_act_id])
    
# feed them to the multi_env
obss, rews, dones, infos = multi_envs.step(acts)

# as explained, this is a vector of Observation (as many as NUM_CORE in this example)
obss

The multi environment loop is really close to the "gym" loop:

In [ ]:
# performs the appropriated steps
for i in range(10):
    acts = [None for _ in range(NUM_CORE)]
    for env_act_id in range(NUM_CORE):
        acts[env_act_id] = agent.act(obss[env_act_id], rews[env_act_id], dones[env_act_id])
    obss, rews, dones, infos = multi_envs.step(acts)

    # DO SOMETHING WITH THE AGENT IF YOU WANT
    ## agent.train(obss, rews, dones)
    

# close the environments created by the multi_env
multi_envs.close()

On the above example, TRAINING_STEP steps are performed on NUM_CORE environments in parrallel. The agent has then acted TRAINING_STEP * NUM_CORE (=10 * 4 = 40 by default) times on NUM_CORE different environments.

III.b) Practical example

We reuse the code of the Notebook 04_TrainingAnAgent to train a new agent we strongly recommend you to have a look at it if it is not done already.

In this notebook, we focus on how to make your agent interact with multiple environments at the same time (eg it means that the batch of data the agent receives comes from different instance of the same environment, this is a technique used in Asynchronous Actor Critic - A3C type of models for example).

You will see with this method you can learn the same baseline with minimal (actually without any) changes in the above way.

Note on the implementation

The most common code for A3C agent can be summarized in the image below (image credit this blog post):

In this image you see different version of the agent that interacts with different versions of the (same) environment (represented here in the different workers). And, from time to time, each "worker" will send its weights to the "global network" and an update procedure will be run there. Though this framework can perfectly be implemented in grid2op, the SingleEnvMultiProcess class works a bit differently.

Actually, in this SingleEnvMultiProcess class, there exists only one copy of the "global network" that is kept into the main "thread" (which to be precise is a process) and that interacts (send actions to and receives observations from) with different independant environments. It is really similar to the SubprocVecEnv of open ai gym.

This is especially suited in the case of powersystem operations, as it can be quite computationnally expensive to solve for the powerflow equations at each nodes of the grid (also called Kirchhoff's laws)

What you have to do

We recall here the code that we used in the relevant notebook to train the agent:

# create an environment
env = make(env_name, test=True)  
# don't forget to set "test=False" (or remove it, as False is the default value) for "real" training

# import the train function and train your agent
from l2rpn_baselines.DuelQSimple import train
from l2rpn_baselines.utils import NNParam, TrainingParam
agent_name = "test_agent"
save_path = "saved_agent_DDDQN_{}".format(train_iter)
logs_dir="tf_logs_DDDQN"


# we then define the neural network we want to make (you may change this at will)
## 1. first we choose what "part" of the observation we want as input, 
## here for example only the generator and load information
## see https://grid2op.readthedocs.io/en/latest/observation.html#main-observation-attributes
## for the detailed about all the observation attributes you want to have
li_attr_obs_X = ["gen_p", "gen_v", "load_p", "load_q"]
# this automatically computes the size of the resulting vector
observation_size = NNParam.get_obs_size(env, li_attr_obs_X) 

## 2. then we define its architecture
sizes = [300, 300, 300]  # 3 hidden layers, of 300 units each, why not...
activs =  ["relu" for _ in sizes]  # all followed by relu activation, because... why not
## 4. you put it all on a dictionnary like that (specific to this baseline)
kwargs_archi = {'observation_size': observation_size,
                'sizes': sizes,
                'activs': activs,
                "list_attr_obs": li_attr_obs_X}

# you can also change the training parameters you are using
# more information at https://l2rpn-baselines.readthedocs.io/en/latest/utils.html#l2rpn_baselines.utils.TrainingParam
tp = TrainingParam()
tp.batch_size = 32  # for example...
tp.update_tensorboard_freq = int(train_iter / 10)
tp.save_model_each = int(train_iter / 3)
tp.min_observation = int(train_iter / 5)
train(env,
      name=agent_name,
      iterations=train_iter,
      save_path=save_path,
      load_path=None, # put something else if you want to reload an agent instead of creating a new one
      logs_dir=logs_dir,
      kwargs_archi=kwargs_archi,
      training_param=tp)

Here, you will see in the next cell how to (not really) change it to train a agent on different environments:

In [ ]:
train_iter = TRAINING_STEP
# import the train function and train your agent
from l2rpn_baselines.DuelQSimple import train
from l2rpn_baselines.utils import NNParam, TrainingParam, make_multi_env
agent_name = "test_agent_multi"
save_path = "saved_agent_DDDQN_{}_multi".format(train_iter)
logs_dir="tf_logs_DDDQN"

# just add the relevant import (see above) and this line
my_envs = make_multi_env(env_init=env, nb_env=NUM_CORE)
# and that's it !


# we then define the neural network we want to make (you may change this at will)
## 1. first we choose what "part" of the observation we want as input, 
## here for example only the generator and load information
## see https://grid2op.readthedocs.io/en/latest/observation.html#main-observation-attributes
## for the detailed about all the observation attributes you want to have
li_attr_obs_X = ["gen_p", "gen_v", "load_p", "load_q"]
# this automatically computes the size of the resulting vector
observation_size = NNParam.get_obs_size(env, li_attr_obs_X) 

## 2. then we define its architecture
sizes = [300, 300, 300]  # 3 hidden layers, of 300 units each, why not...
activs =  ["relu" for _ in sizes]  # all followed by relu activation, because... why not
## 4. you put it all on a dictionnary like that (specific to this baseline)
kwargs_archi = {'observation_size': observation_size,
                'sizes': sizes,
                'activs': activs,
                "list_attr_obs": li_attr_obs_X}

# you can also change the training parameters you are using
# more information at https://l2rpn-baselines.readthedocs.io/en/latest/utils.html#l2rpn_baselines.utils.TrainingParam
tp = TrainingParam()
tp.batch_size = 32  # for example...
tp.update_tensorboard_freq = int(train_iter / 10)
tp.save_model_each = int(train_iter / 3)
tp.min_observation = int(train_iter / 5)
train(my_envs,
      name=agent_name,
      iterations=train_iter,
      save_path=save_path,
      load_path=None, # put something else if you want to reload an agent instead of creating a new one
      logs_dir=logs_dir,
      kwargs_archi=kwargs_archi,
      training_param=tp)

II c) Assess the performance of the trained agent

Nothing is changing... Like really, it's the same code as in notebook 4 (we told you to have a look ;-) )

In [ ]:
from l2rpn_baselines.DuelQSimple import evaluate
path_save_results = "{}_results_multi".format(save_path)

evaluated_agent, res_runner = evaluate(env,
                                       name=agent_name,
                                       load_path=save_path,
                                       logs_path=path_save_results,
                                       nb_episode=2,
                                       nb_process=1,
                                       max_steps=100,
                                       verbose=True,
                                       save_gif=False)

II d) That is it ?

Yes there is nothing more to say about it. As long as you use one of the compatible baselines (at the date of writing):

You do not have anything more to do :-)

If you want to use another baseline that does not support this feature, feel free to add an issue in the l2rpn-baselines official github at this adress https://github.com/rte-france/l2rpn-baselines/issues