Facebook’s Open-Source Reinforcement Learning Platform — A Deep Dive

Opening up the run_gym.py file located in ml/rl/test/gym/ shows us the following in the main() method:def main(args): parser = argparse.ArgumentParser( description="Train a RL net to play in an OpenAI Gym environment." ) parser.add_argument("-p", "–parameters", help="Path to JSON parameters file.")Which shows us that when running the command python ml/rl/test/gym/run_gym.py that we are able to see the usage of our script in the console, running this results in:Traceback (most recent call last): File "ml/rl/test/gym/run_gym.py", line 611, in <module> + " [-s <score_bar>] [-g <gpu_id>] [-l <log_level>] [-f <filename>]"Exception: Usage: python run_gym.py -p <parameters_file> [-s <score_bar>] [-g <gpu_id>] [-l <log_level>] [-f <filename>]Explaining us that if we give the parameter file defined by our -p parameter it will load this JSON file and load it into a variable called params, while if we add the -f parameter, we will be able to save the collected samples as an RLDataSet to the provided file.main() MethodThe main method will now continue doing a couple of things:# Load our parameters from the jsonwith open(args.parameters, "r") as f: params = json.load(f)# Initialize a dataset variable of type `RLDataset` if the `file_path` parameter is set# `file_path`: If set, save all collected samples as an RLDataset to this file.dataset = RLDataset(args.file_path) if args.file_path else None# Call the method `run_gym` with the parameters and arguments providedreward_history, timestep_history, trainer, predictor = run_gym( params, args.score_bar, args.gpu_id, dataset, args.start_saving_from_episode)# Save our dataset if provided through the -f parameterif dataset: dataset.save()# Save the results to a csv if the `results_file_path` parameter is setif args.results_file_path: write_lists_to_csv(args.results_file_path, reward_history, timestep_history)# Return our reward historyreturn reward_historyAfter running our command shown in the usage document python ml/rl/test/gym/run_gym.py -p ml/rl/test/gym/discrete_dqn_cartpole_v0_100_eps.json -f cartpole_discrete/training_data.json we can see the following structure in the training_data.json file which was defined by the -f parameter.{ "ds": "2019-01-01", "mdp_id": "0", "sequence_number": 10, "state_features": { "0": -0.032091656679586175, "1": -0.016310561477682117, "2": -0.01312794549150956, "3": -0.04438365281404494 }, "action": "1", "reward": 1.0, "action_probability": 1.0, "possible_actions": [ "0", "1" ], "metrics": { "reward": 1.0 }}RLDataset ClassThis is generated by the -f parameter that will be saving the results in a format provided by the RLDataset class to the file provided..Let’s first take a look at our run_gym() method in general:env_type = params["env"]# Initialize the OpenAI Gym Environmentenv = OpenAIGymEnvironment( env_type, rl_parameters.epsilon, rl_parameters.softmax_policy, rl_parameters.gamma,)replay_buffer = OpenAIGymMemoryPool(params["max_replay_memory_size"])model_type = params["model_type"]use_gpu = gpu_id != USE_CPU# Use the "training" {} parameters and "model_type": "<MODEL>" model_type# to create a trainer as the ones listed in /ml/rl/training/*_trainer.py# The model_type is defined in /ml/rl/test/gym/open_ai_gym_environment.pytrainer = create_trainer(params["model_type"], params, rl_parameters, use_gpu, env)# Create a GymDQNPredictor based on the ModelType and Trainer above# This is located in /ml/rl/test/gym/gym_predictor.pypredictor = create_predictor(trainer, model_type, use_gpu)c2_device = core.DeviceOption( caffe2_pb2.CUDA if use_gpu else caffe2_pb2.CPU, int(gpu_id))# Train using SGD (stochastic gradient descent)# This just passess the parameters given towards a method called train_gym_online_rl which will train our algorithmreturn train_sgd( c2_device, env, replay_buffer, model_type, trainer, predictor, "{} test run".format(env_type), score_bar, **params["run_details"], save_timesteps_to_dataset=save_timesteps_to_dataset, start_saving_from_episode=start_saving_from_episode,)The run_gym method appears to be using the parameters that we loaded from our JSON file to initialize the OpenAI Gym Environment..Next to that the run_gym method will also initialize a replay_buffer, create the trainer and create predictor..Whereafter it will run the train_sgd method.Since we now know our run_gym() method, let’s look further at how our dataset variable is passed further:run_gym() will get the method passed by the main() method as the save_timesteps_to_dataset parameterrun_gym() will pass this to the train_sgd() methodtrain_sgd() will pass it to the train_gym_online_rl() method.train_gym_online_rl() MethodWhen this parameter is now defined, the train_gym_online_rl() method will save several variables through the insert() method defined in the RLDataset class:Remember that the RLDataset class was defined in the file: ml/rl/training/rl_dataset.py The insert method is defined as: RLDataset::insert(mdp_id, sequence_number, state, action, reward, terminal, possible_actions, time_diff, action_probability)Source: run_gym.py#L208Source Variable from run_gym.py Output Variable Type Description i mdp_id string A unique ID for the episode (e.g. an entire playthrough of a game) ep_timesteps – 1 sequence_number integer Defines the ordering of states in an MDP (e.g. the timestamp of an event) state.tolist() state_features map<integer, float> A set of features describing the state..ds string A unique ID for this datasetOur run_gym.py will now run for a certain amount of episodes specified in the -p file (e.g. ml/rl/test/gym/discrete_dqn_cartpole_v0_100_eps.json) where it will (while it’s able to) use the train_gym_online_rl() method to:Get the possible actionsTake an action (based on if it’s a DISCRETE action_type or not)Step through the Gym Environment and retrieve the next_state, reward and terminal variablesDefine the next_action to take based on the policy in the gym_env.policy variableIncrease the reward receivedInsert the observed behaviour in the replay bufferEvery train_every_ts take num_train_batches from the replay_buffer and train the trainer with theseNote: this trainer is created in the create_trainer() method which will create a DDPGTrainer, SACTrainer, ParametericDQNTrainer or DQNTrainerEvery test_every_ts log how our model is performing to the logger, avg_reward_history and timestep_historyLog when the episode ended2..Converting our Training data to the Timeline formatIn step 2 we will be converting our earlier training data that was saved in the format:{ "ds": "2019-01-01", "mdp_id": "0", "sequence_number": 10, "state_features": { "0": -0.032091656679586175, "1": -0.016310561477682117, "2": -0.01312794549150956, "3": -0.04438365281404494 }, "action": "1", "reward": 1.0, "action_probability": 1.0, "possible_actions": [ "0", "1" ], "metrics": { "reward": 1.0 }}Towards something they call the timeline format, this is a format that given a table (state, action, mdp_id, sequence_number, reward, possible_next_actions) returns the table needed for Reinforcement Learning (mdp_id, state_features, action, reward, next_state_features, next_action, sequence_number, sequence_number_ordinal, time_diff, possible_next_actions) defined in Timeline.scala which we can represent as:This will execute a Spark Job, run a query through Hive and return the results in a different file.# Build timeline package (only need to do this first time)mvn -f preprocessing/pom.xml clean package# Clear last run's spark data (in case of interruption)rm -Rf spark-warehouse derby.log metastore_db preprocessing/spark-warehouse preprocessing/metastore_db preprocessing/derby.log# Run timelime on pre-timeline data/usr/local/spark/bin/spark-submit –class com.facebook.spark.rl.Preprocessor preprocessing/target/rl-preprocessing-1.1.jar "`cat ml/rl/workflow/sample_configs/discrete_action/timeline.json`"# Merge output data to single filemkdir training_datamv cartpole_discrete_timeline/part* training_data/cartpole_training_data.json# Remove the output data folderrm -Rf cartpole_discrete_timelineAfter execution we can now watch the created file by running: head -n1 training_data/cartpole_training_data.json:{ "mdp_id": "31", "sequence_number": 5, "propensity": 1.0, "state_features": { "0": -0.029825548651835395, "1": 0.19730168855281788, "2": 0.013065490574540607, "3": -0.29148843030554333 }, "action": 0, "reward": 1.0, "next_state_features": { "0": -0.02587951488077904, "1": 0.0019959027899765502, "2": 0.00723572196842974, "3": 0.005286388581067669 }, "time_diff": 1, "possible_next_actions": [1, 1], "metrics": { "reward": 1.0 }}Interesting here is that the Spark engine will allow us to utilize a distributed cluster running completely with CPU operations..Horizon includes a tool that automatically analyzes the training dataset and determines the best transformation function and corresponding normalization parameters for each feature.To run this, the following command can be used:python ml/rl/workflow/create_normalization_metadata.py -p ml/rl/workflow/sample_configs/discrete_action/dqn_example.jsonOpening up the /ml/rl/workflow/sample_configs/discrete_action/dqn_example.json file we can see a similar config file as passed to the main function of our Gym Environment:{ "training_data_path": "training_data/cartpole_training_data.json", "state_norm_data_path": "training_data/state_features_norm.json", "model_output_path": "outputs/", "use_gpu": true, "use_all_avail_gpus": true, "norm_params": { "output_dir": "training_data/", "cols_to_norm": [ "state_features" ], "num_samples": 1000 }, "actions": [ "0", "1" ], "epochs": 100, "rl": { "gamma": 0.99, "target_update_rate": 0.2, "reward_burnin": 1, "maxq_learning": 1, "epsilon": 0.2, "temperature": 0.35, "softmax_policy": 0 }, "rainbow": { "double_q_learning": true, "dueling_architecture": false }, "training": { "layers": [-1, 128, 64, -1 ], "activations": [ "relu", "relu", "linear" ], "minibatch_size": 256, "learning_rate": 0.001, "optimizer": "ADAM", "lr_decay": 0.999, "warm_start_model_path": null, "l2_decay": 0, "use_noisy_linear_layers": false }, "in_training_cpe": null}So let’s open up the ml/rl/workflow/create_normalization_metadata.py file where we can instantly see in its main method that it starts with a function called: create_norm_table.The create_norm_table() method will take in the parameters (which is the json above) and utilize the norm_params, training_data_path, cols_to_norm and output_dir configs to create the normalization table.This normalization table is build by checking the columns to normalize (which in the case of the json above is the column state_features) which will get the metadata through the get_norm_metadata() function.. More details

Leave a Reply