Affordable Intelligence

Machine Learning | Michael Amundson

Reinforcement Learning

Udacity Project: Training a Smartcab to Drive.

Environment:

In this project, I constructed an optimized Q-Learning driving agent that navigated a Smartcab through a virtual environment towards a goal. A snapshot from the environment is included below. The white car is the car controlled by my Q-learning algorithm. The blue U is the destination.

At each intersection the cab had to make a choice between one of four actions (Forward, Left, Right, Stop). The actions were informed by the state the car was in.

The state information available in the environment consisted of the following inputs.

  • ‘waypoint’, is the direction the Smartcab should drive leading to the destination, relative to the Smartcab‘s heading.
  • ‘inputs’, which is the sensor data from the Smartcab. It includes
    • ‘light’, the color of the light.
    • ‘left’, the intended direction of travel for a vehicle to the Smartcab‘s left. Returns None if no vehicle is present.
    • ‘right’, the intended direction of travel for a vehicle to the Smartcab‘s right. Returns None if no vehicle is present.
    • ‘oncoming’, the intended direction of travel for a vehicle across the intersection from the Smartcab. Returns None if no vehicle is present.
  • ‘deadline’, which is the number of actions remaining for the Smartcab to reach the destination before running out of time.

The waypoint feature is most relevant for efficiency since if the smartcab doesn’t know what direction it is supposed to go then it will not be able to efficiently arrive at its destination. All of the input features are relevant for safety since it is important to know what direction the cars around you are moving and the color of the traffic light in order to avoid accidents. However, in this environment cars always stop at red lights which means oncoming traffic is the only direction the smartcab needs to consider. I experimented with including the left direction so that the car could learn to go right on red but the increase to the size of the state space made the amount of training trials needed to achieve high safety and reliability prohibitively high so I dropped the left direction.

I also did not include the ‘deadline’ feature because it didn’t provide useful information to the cab. We want the cab to take the most direct route while maintaining safety. This is true regardless of what the deadline is.

This leaves us with a state space of waypoint(3) x light(2) x oncoming(4) = 24 states. Multiply this by 4 possible actions for each state and we find that we have 96 rewards to learn.

Training the smartcab.

When the smartcab takes an action it receives a reward. It receives positive rewards for moving towards the destination in a safe manner, small negative rewards for minor traffic violations and large negative rewards for accidents and major traffic violations.

The reward for taking a certain action in a given state is added to the reward table. The reward table is a dictionary of possible states with each state containing a nested dictionary of the four possible actions at that state and the Q-value or the cumulative rewards the agent has received for taking that action while in the given state.

{ ‘state-1’: {
‘action-1’ : Qvalue-1,
‘action-2’ : Qvalue-2,

},
‘state-2’: {
‘action-1’ : Qvalue-1,

Initially the table is blank so to begin training we pick random actions for the agent. As training progresses and the agent learns we want to shift from random actions to learned actions. To do this we use a decaying exploration factor EQUATION: Equation. The exploration factor is the probability that the agent will pick a random action instead of the action with the highest Q-value. After sufficient training we have a filled in rewards table we call a policy. The policy can tell us what to do in any given state.

Example state policy action combinations from the reward table after training.

State Policy Action
(‘left’, ‘red’, ‘right’) forward : -8.46 — None : 2.10 — right : 0.77 — left : -20.09 Stop at red light
(‘forward’, ‘green’, ‘left’) — forward : 1.61 — None : 0.11 — right : 0.62 — left : 0.69 Proceed forward to waypoint
(‘right’, ‘green’, ‘forward’) — forward : 0.63 — None : -3.77 — right : 1.56 — left : -15.18 Proceed right to waypoint

Metrics

The driving agent was evaluated on two metrics: Safety and Reliability.

Safety and Reliability are measured using a letter-grade system as follows:

Grade

Safety

Reliability

A+

Agent commits no traffic violations, and always chooses the correct action.

Agent reaches the destination in time for 100% of trips.

A

Agent commits few minor traffic violations, such as failing to move on a green light.

Agent reaches the destination on time for at least 90% of trips.

B

Agent commits frequent minor traffic violations, such as failing to move on a green light.

Agent reaches the destination on time for at least 80% of trips.

C

Agent commits at least one major traffic violation, such as driving through a red light.

Agent reaches the destination on time for at least 70% of trips.

D

Agent causes at least one minor accident, such as turning left on green with oncoming traffic.

Agent reaches the destination on time for at least 60% of trips.

F

Agent causes at least one major accident, such as driving through a red light with cross-traffic.

Agent fails to reach the destination on time for at least 60% of trips.

I experimented with several exploration factor decay functions. Here are the results using the exploration decay function of

Here are the results for exploration factor decay function of ε=1/t^0.8

Results:

I’m not ready to hop in a cab driven solely by Q-learning but in situations where there is a limited state space the Q-learning algorithm was very effective at learning the rules in a short amount of time.

Advertisements