📑 Reflections. Introduction To Reinforcement Learning. Lessons 1-2

🤖 Links

🤖 Lesson 1. Smoov & Curly's Bogus Journey

Modeling for decision making involves two distinct parties, one is the decision-maker and the other is the model-builder.

Systems are formed with parts put together in a particular manner in order to pursuit an objective. The relationship between the parts determines what the system does and how it functions as a whole.

A system that does not change is a static (i.e., deterministic) system. Many of the systems we are part of are dynamic systems, which are they change over time.

In deterministic models, a good decision is judged by the outcome alone. However, in probabilistic models, the decision-maker is concerned not only with the outcome value but also with the amount of risk each decision carries.

Strategy of the Decision-Making Process:

$\mathit {\color{red} {Identify \ the \ decision \ situation \ and \ understand \ objectives \mapsto Identify \ alternatives \mapsto Decompose \ and \ model \ the \ problem \mapsto Choose \ the \ best\ alternative \mapsto Sensitivity \ Analysis \mapsto (Is \ further \ analysis \ needed?) \mapsto (if \ NO) \ Implement \ the \ chosen \ alternative}}$¶

Probabilistic Modeling: from Data to Information, from Information to Facts, and finally, from Facts to Knowledge.

Source of Errors in Decision Making: false assumptions, not having an accurate estimation of the probabilities, relying on expectations, difficulties in measuring the utility function, and forecast errors.

The deficiencies about our knowledge of the future may be divided into three domains, each with rather murky boundaries:

Risk: One might be able to enumerate the outcomes and figure the probabilities. However, one must lookout for non-normal distributions, especially those with “fat tails”, as illustrated in the stock market by the rare events.

Uncertainty: One might be able to enumerate the outcomes but the probabilities are murky. Most of the time, the best one can do is to give a rank order to possible outcomes and then be careful that one has not omitted one of significance.

Uncertainty is the fact of processes or business; probability is the guide for "good" events in processes or business.

Black Swans: The name comes from an Australian genetic anomaly. This is the domain of events which are either “extremely unlikely” or “inconceivable” but when they happen, and they do happen, they have serious consequences, usually bad.

Two fundamental problems in sequential decision making

Reinforcement Learning:

The environment is initially unknown
The agent interacts with the environment
The agent improves its policy

Planning:

A model of the environment is known
The agent performs computations with its model (without any external interaction)
The agent improves its policy

Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial Intelligence. It allows machines and software agents to automatically determine the ideal behaviour within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behaviour; this is known as the reinforcement signal.

There are three fundamental problems that RL must tackle: the exploration-exploitation tradeoff, the problem of delayed reward (credit assignment), and the need to generalize.

A Markov Decision Process (MDP) is just like a Markov Chain, except the transition matrix depends on the action taken by the decision maker (agent) at each time step. The agent receives a reward, which depends on the action and the state. The goal is to find a function, called a policy, which specifies which action to take in each state, so as to maximize some function (e.g., the mean or expected discounted sum) of the sequence of rewards. One can formalize this in terms of Bellman's equation, which can be solved iteratively using policy iteration. The unique fixed point of this equation is the optimal policy.

A Markov Decision Process (MDP) model contains:

A set of possible world states S
A set of possible actions A
A real valued reward function R(s,a)
A description T of each action’s effects in each state.

Markov Property: the effects of an action taken in a state depend only on that state and not on the prior history.

Deterministic Actions: For each state and action we specify a new state.

Stochastic Actions: For each state and action we specify a probability distribution over next states

Let's define the transition matrix and reward functions. We are assuming states, actions and time are discrete.

T(s,a,s') = Pr[S(t+1)=s' | S(t)=s, A(t)=a]¶

R(s,a,s') = E[R(t+1)| S(t)=a, A(t)=a, S(t+1)=s']¶

Continuous MDPs can also be defined, but are usually solved by discretization.

We define the value of performing action a in state s as follows:

Q(s,a) = sum_s' T(s,a,s') [ R(s,a,s') + g V(s') ]¶

where 0 < g <= 1 is the amount by which we discount future rewards, and V(s) is overall value of state s, given by Bellman's equation:

V(s) = max_a Q(s,a) = max_a sum_s' T(s,a,s') [ R(s,a,s') + g V(s) ]¶

In words, the value of a state is the maximum expected reward we will get in that state, plus the expected discounted value of all possible successor states, s'. If we define

R(s,a) = E[ R(s,a,s') ] = sum_{s'} T(s,a,s') R(s,a,s')¶

the above equation simplifies to the more common form

V(s) = max_a R(s,a) + sum_s' T(s,a,s') g V(s')¶

which, for a fixed policy and a tabular (non-parametric) representation of the V/Q/T/R functions, can be rewritten in matrix-vector form as V = R + g T V.

Solving these n simultaneous equations is called value determination (n is the number of states).

If V/Q satisfies the Bellman equation, then the greedy policy

p(s) = argmax_a Q(s,a)¶

is optimal. If not, we can set p(s) to argmax_a Q(s,a) and re-evaluate V (and hence Q) and repeat. This is called policy iteration, and is guaranteed to converge to the unique optimal policy.

The best theoretical upper bound on the number of iterations needed by policy iteration is exponential in n, but in practice, the number of steps is O(n). By formulating the problem as a linear program, it can be proved that one can find the optimal policy in polynomial time.

http://stats.stackexchange.com/questions/145122/real-life-examples-of-markov-decision-processes

MDP=⟨S,A,T,R,γ⟩¶

where S are the states, A the actions, T the transition probabilities (i.e. the the probabilities Pr(s′|s,a) to go from one state to another given an action), R the rewards (given a certain state, and possibly action), and γ is a discount factor that is used to reduce the importance of the of future rewards.

So in order to use it, you need to have predefined:

States: these can refer to for example grid maps in robotics, or for example door open and door closed.

Actions: a fixed set of actions, such as for example going north, south, east, etc for a robot, or opening and closing a door.

Transition probabilities: the probability of going from one state to another given an action. For example what is the probability of an open door if the action is open. In a perfect world the later could be 1.0, but if it is a robot, it could have failed in handling the door knob correctly. Another example in the case of a moving robot, would be the action north, which in most cases would bring it in the grid cell north of it, but in some cases could have moved too much and reached the next cell for example.

Rewards: these are used to guide the planning. In the case of the grid example we might want to go to a certain cell, and the reward will be higher if we get closer. In the case of the door example an open door might give a high reward.

Examples of Applications of MDPs¶

Harvesting: how much members of a population have to be left for breeding.
Agriculture: how much to plant based on weather and soil state.
Water resources: keep the correct water level at reservoirs.
Inspection, maintenance and repair: when to replace/inspect based on age, condition, etc.
Purchase and production: how much to produce based on demand.
Queues: reduce waiting time. ...

Finance: deciding how much to invest in stock.
Robotics:
- A dialogue system to interact with people.
- Robot bartender.
- Robot exploration for navigation. ...

Exact Methods:¶

Value Iteration
Policy Iteration
Linear Programming

Policy iteration consists of two steps: policy evaluation and policy improvement.

Evaluate a given policy π by iterative application of Bellman expectation backup.

v1 → v2 → ... → vπ¶

Using synchronous backups:

At each iteration k + 1
For all states s ∈ S
Update vk+1(s) from vk(s') where s' is a successor state of s

Improve the policy by acting greedily with respect to vπ: π' = greedy(vπ).

Any optimal policy can be subdivided into two components:

An optimal first action A∗
Followed by an optimal policy from successor state S'

Most Markov reward and decision processes are discounted.

Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate reward
It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate.

Dynamic Programming is a very general solution method for problems which have two properties:

Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused

Markov decision processes satisfy both properties

Bellman equation gives recursive decomposition
Value function stores and reuses solutions

Dynamic programming is used for planning in an MDP

For prediction:
- Input: MDP (S, A, P, R, γi) and policy π
- or: MRP (S, Pπ, Rπ, γi)
- Output: value function vπ
Or for control:
- Input: MDP (S, A, P, R, γi)
- Output: optimal value function v∗
- and: optimal policy π∗

What makes reinforcement learning different from other machine learning paradigms?

There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives

Canonical Example: Grid World¶

The agent lives in a grid
Walls block the agent’s path
The agent’s actions do not always go as planned:
- 80% of the time, the action North takes the agent North (if there is no wall there)
- 10% of the time, North takes the agent West; 10% East
- If there is a wall in the direction the agent would have been taken, the agent stays put
Big rewards come at the end

Quizes¶

import math
# Actions: U - Upper, R - right, L - left, D - down
# U-U-R-R-R (5 times "the right way" with p=0.8) + 
# R-U-U-R-R(1 time "the right way" with p=0.8, 4 times "the possible way" with p=0.1)
p_r, p_p = 0.8, 0.1
math.pow(p_r, 5) + math.pow(p_p, 4)*p_r

0.3277600000000001

# Finding Policies
# Three possible steps in this point : R, L, D with probabilities = 0.8, 0.1, 0.1
gamma, r_s, u, end, p_r, p_p = 0.5, -0.04, 0, 1, 0.8, 0.1
u1 = r_s + gamma*(u*p_p + u*p_p + end*p_r)
print (gamma, r_s, u, end, p_r, p_p)
print (u1)
u2 = r_s + gamma*(u1*p_p + r_s*p_p + end*p_r)
print (u2)

0.5 -0.04 0 1 0.8 0.1
0.36000000000000004
0.37600000000000006

from IPython.display import Image
Image('01.png')

🤖 Lesson 1.T Introduction To BURLAP

A transition graph with state and action nodes is a useful way to summarize the dynamics of a finite MDP. There is a state node for each possible state (usually a circle labeled by the name of the state), and an action node for each state-action pair (usually a circle labeled by the name of the action and connected by a line to the state node).

Image('02.png')

# 1 Generating A Domain Object
import burlap.domain.singleagent.graphdefined.GraphDefinedDomain;
import burlap.oomdp.auxiliary.DomainGenerator;
import burlap.oomdp.core.*;

public class FirstMDP {

    DomainGenerator gd;
    Domain domain;

    public FirstMDP() {
        this.numStates = 6;
        
        this.gd = new GraphDefinedDomain(numStates);

        this.domain = this.gd.generateDomain();
    }
}

# 2 Defining The Initial State
import burlap.domain.singleagent.graphdefined.GraphDefinedDomain;
import burlap.oomdp.auxiliary.DomainGenerator;
import burlap.oomdp.core.*;

public class FirstMDP {

    DomainGenerator gd;
    Domain domain;
    State initialState;
    
    public FirstMDP() {
        this.numStates = 6;
        
        this.gd = new GraphDefinedDomain(numStates);

        this.domain = this.gd.generateDomain();
        
        this.initialState = GraphDefinedDomain.getState(this.domain, 0)
    }
}

# 3 Adding Transitions
import burlap.domain.singleagent.graphdefined.GraphDefinedDomain;
import burlap.oomdp.auxiliary.DomainGenerator;
import burlap.oomdp.core.*;

public class FirstMDP {

    DomainGenerator dg;
    Domain domain;
    State initialState;

    public FirstMDP() {
        int numStates = 6;
        this.dg = new GraphDefinedDomain(numStates);
        
        //actions from initial state 0
        ((GraphDefinedDomain) this.dg).setTransition(0, 0, 1, 1.);
        ((GraphDefinedDomain) this.dg).setTransition(0, 1, 2, 1.);
        ((GraphDefinedDomain) this.dg).setTransition(0, 2, 3, 1.);
        
        // Please enter transitions in the space below to correspond to the 
        // state diagram earlier. Make sure to use action ID 0 for each transition 
        // in order for the grader to check them correctly.
        
        ((GraphDefinedDomain) this.dg).setTransition(1, 0, 1, 1.);

        ((GraphDefinedDomain) this.dg).setTransition(2, 0, 4, 1.);
        ((GraphDefinedDomain) this.dg).setTransition(4, 0, 2, 1.);

        ((GraphDefinedDomain) this.dg).setTransition(3, 0, 5, 1.);
        ((GraphDefinedDomain) this.dg).setTransition(5, 0, 5, 1.);

        // Please add your transitions above this line.
        
        this.domain = this.dg.generateDomain();
        this.initialState = GraphDefinedDomain.getState(this.domain,0);
    }
    
    // This method has only been added for the purpose of grading this quiz. DO NOT include 
    // it in an actual project, since it violates the principle of encapsulation by exposing
    // the private member domain.
    public Domain getDomain() {
        return this.domain;
    }
    
    public static void main(String[] args) {
        // You can add test code here that will be executed when you click "Test Run".
        FirstMDP fmdp = new FirstMDP();
    }
}

# 4 Adding The Reward Function
import burlap.domain.singleagent.graphdefined.GraphDefinedDomain; 
import burlap.oomdp.core.*;
import burlap.oomdp.singleagent.GroundedAction;
import burlap.oomdp.singleagent.RewardFunction;

public class FirstMDP {

    GraphDefinedDomain gdd;
    Domain domain;
    State initialState;
    RewardFunction rf;

    public FirstMDP(double p1, double p2, double p3, double p4) {
        int numStates = 6;
        this.gdd = new GraphDefinedDomain(numStates);
        
        //actions from initial state 0
        this.gdd.setTransition(0, 0, 1, 1.);
        this.gdd.setTransition(0, 1, 2, 1.);
        this.gdd.setTransition(0, 2, 3, 1.);
        
        //transitions from action "a" outcome state
        this.gdd.setTransition(1, 0, 1, 1.);

        //transitions from action "b" outcome state
        this.gdd.setTransition(2, 0, 4, 1.);
        this.gdd.setTransition(4, 0, 2, 1.);

        //transitions from action "c" outcome state
        this.gdd.setTransition(3, 0, 5, 1.);
        this.gdd.setTransition(5, 0, 5, 1.);
        
        this.domain = this.gdd.generateDomain();
        this.initialState = GraphDefinedDomain.getState(this.domain,0);
        this.rf = new FourParamRF(p1,p2,p3,p4);
    }
    
    public static class FourParamRF implements RewardFunction {
        double p1;
        double p2;
        double p3;
        double p4;
 
        public FourParamRF(double p1, double p2, double p3, double p4) {
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
            this.p4 = p4;
        }
 
        @Override
        // TODO:
        // Override the reward method to match the reward scheme from the state diagram.
        // See the documentation for the RewardFunction interface for the proper signature.
        // You may find the getNodeId method from GraphDefinedDomain class helpful.
        
        public double reward(State s, GroundedAction a, State sprime) {
            int sid = GraphDefinedDomain.getNodeId(s);
            double r;

            if(sid == 0 || sid == 3) { r = 0; } 
            else if(sid == 1) { r = this.p1; } 
            else if(sid == 2) { r = this.p2; } 
            else if(sid == 4) { r = this.p3; } 
            else if(sid == 5) { r = this.p4; } 
            else { throw new RuntimeException("State error: " + sid);}

            return r;
        }
    }
    
    // This method has only been added for the purpose of grading this quiz. DO NOT include 
    // it in an actual project, since it violates the principle of encapsulation by exposing
    // the private member domain.
    public Domain getDomain() {
        return this.domain;
    }
    
    public static void main(String[] args) {
        // You can add test code here that will be executed when you click "Test Run".
        FirstMDP fmdp = new FirstMDP(5,3,6,7);
    }
}

# 5 Adding The Terminal Function
import burlap.domain.singleagent.graphdefined.GraphDefinedDomain; 
import burlap.oomdp.core.*;
import burlap.oomdp.singleagent.GroundedAction;
import burlap.oomdp.singleagent.RewardFunction;

public class FirstMDP {

    GraphDefinedDomain gdd;
    Domain domain;
    State initialState;
    RewardFunction rf;
    TerminalFunction tf;

    public FirstMDP(double p1, double p2, double p3, double p4) {
        int numStates = 6;
        this.gdd = new GraphDefinedDomain(numStates);
        
        //actions from initial state 0
        this.gdd.setTransition(0, 0, 1, 1.);
        this.gdd.setTransition(0, 1, 2, 1.);
        this.gdd.setTransition(0, 2, 3, 1.);
        
        //transitions from action "a" outcome state
        this.gdd.setTransition(1, 0, 1, 1.);

        //transitions from action "b" outcome state
        this.gdd.setTransition(2, 0, 4, 1.);
        this.gdd.setTransition(4, 0, 2, 1.);

        //transitions from action "c" outcome state
        this.gdd.setTransition(3, 0, 5, 1.);
        this.gdd.setTransition(5, 0, 5, 1.);
        
        this.domain = this.gdd.generateDomain();
        this.initialState = GraphDefinedDomain.getState(this.domain,0);
        this.rf = new FourParamRF(p1,p2,p3,p4);
        this.tf = new NullTermination();
    }
    
    public static class FourParamRF implements RewardFunction {
        double p1;
        double p2;
        double p3;
        double p4;
 
        public FourParamRF(double p1, double p2, double p3, double p4) {
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
            this.p4 = p4;
        }
 
        @Override
        // TODO:
        // Override the reward method to match the reward scheme from the state diagram.
        // See the documentation for the RewardFunction interface for the proper signature.
        // You may find the getNodeId method from GraphDefinedDomain class helpful.
        
        public double reward(State s, GroundedAction a, State sprime) {
            int sid = GraphDefinedDomain.getNodeId(s);
            double r;

            if(sid == 0 || sid == 3) { r = 0; } 
            else if(sid == 1) { r = this.p1; } 
            else if(sid == 2) { r = this.p2; } 
            else if(sid == 4) { r = this.p3; } 
            else if(sid == 5) { r = this.p4; } 
            else { throw new RuntimeException("State error: " + sid);}

            return r;
        }        
    }
    
    // This method has only been added for the purpose of grading this quiz. DO NOT include 
    // it in an actual project, since it violates the principle of encapsulation by exposing
    // the private member domain.
    public Domain getDomain() {
        return this.domain;
    }
    
    public static void main(String[] args) {
        // You can add test code here that will be executed when you click "Test Run".
        FirstMDP fmdp = new FirstMDP(5,3,6,7);
    }
}

# 6 Performing Value Iteration

vi.planFromState(this.initialState);

# 7 Finding The Optimal Action

import burlap.behavior.singleagent.planning.stochastic.valueiteration.ValueIteration;
import burlap.behavior.statehashing.DiscreteStateHashFactory;
import burlap.domain.singleagent.graphdefined.GraphDefinedDomain;
import burlap.oomdp.auxiliary.DomainGenerator;
import burlap.oomdp.auxiliary.common.NullTermination;
import burlap.oomdp.core.*;
import burlap.oomdp.singleagent.GroundedAction;
import burlap.oomdp.singleagent.RewardFunction;

public class FirstMDP {

    DomainGenerator dg;
    Domain domain;
    State initState;
    RewardFunction rf;
    TerminalFunction tf;
    DiscreteStateHashFactory hashFactory;
    int numStates;

    public FirstMDP(double p1, double p2, double p3, double p4) {
        this.numStates = 6;
        this.dg = new GraphDefinedDomain(numStates);
        
        //actions from initial state 0
        ((GraphDefinedDomain) this.dg).setTransition(0, 0, 1, 1.);
        ((GraphDefinedDomain) this.dg).setTransition(0, 1, 2, 1.);
        ((GraphDefinedDomain) this.dg).setTransition(0, 2, 3, 1.);
        
        //transitions from action "a" outcome state
        ((GraphDefinedDomain) this.dg).setTransition(1, 0, 1, 1.);

        //transitions from action "b" outcome state
        ((GraphDefinedDomain) this.dg).setTransition(2, 0, 4, 1.);
        ((GraphDefinedDomain) this.dg).setTransition(4, 0, 2, 1.);

        //transitions from action "c" outcome state
        ((GraphDefinedDomain) this.dg).setTransition(3, 0, 5, 1.);
        ((GraphDefinedDomain) this.dg).setTransition(5, 0, 5, 1.);
        
        this.domain = this.dg.generateDomain();
        this.initState = GraphDefinedDomain.getState(this.domain,0);
        this.rf = new FourParamRF(p1,p2,p3,p4);
        this.tf = new NullTermination();
        this.hashFactory = new DiscreteStateHashFactory();
    }
    
    public static class FourParamRF implements RewardFunction {
        double p1;
        double p2;
        double p3;
        double p4;
 
        public FourParamRF(double p1, double p2, double p3, double p4) {
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
            this.p4 = p4;
        }
 
        @Override
        public double reward(State s, GroundedAction a, State sprime) { 
            int sid = GraphDefinedDomain.getNodeId(s);
            double r;
   
            if( sid == 0 || sid == 3 ) { // initial state or c1
                r = 0;
            }
            else if( sid == 1 ) { // a
                r = this.p1;
            }
            else if( sid == 2 ) { // b1
                r = this.p2;
            }
            else if( sid == 4 ) { // b2
                r = this.p3;
            }
            else if( sid == 5 ) { // c2
                r = this.p4;
            }
            else {
                throw new RuntimeException("Unknown state: " + sid);
            }

            return r;
        }
    }
    
    private ValueIteration computeValue(double gamma) {
        double maxDelta = 0.0001;
        int maxIterations = 1000;
        ValueIteration vi = new ValueIteration(this.domain, this.rf, this.tf, gamma, 
                                               this.hashFactory, maxDelta, maxIterations);
        vi.planFromState(this.initState);
        return vi;
    }
    
    public String bestFirstAction(double gamma) {
        // Return "action a" if a is the best action based on the discount factor given.
        // Return "action b" if b is the best action based on the discount factor given.
        // Return "action c" if c is the best action based on the discount factor given.
        // If there is a tie between actions, give preference to the earlier action in the alphabet:
        //   e.g., if action a and action c are equally good, return "action a".
        ValueIteration vi = computeValue(gamma);

        double[] V = new double[this.numStates];
        for(int i = 0; i < this.numStates; i++) {
            State s = GraphDefinedDomain.getState(this.domain, i);
            V[i] = vi.value(s);
        }

        String actionName = null;
        if(V[1] >= V[2] && V[1] >= V[3]) actionName = "action a";
        else if(V[2] >= V[1] && V[2] >= V[3]) actionName = "action b";
        else if(V[3] >= V[1] && V[3] >= V[2]) actionName = "action c";
        return actionName;
        
    }
 
    public static void main(String[] args) {
        double p1 = 5.;
        double p2 = 6.;
        double p3 = 3.;
        double p4 = 7.;
        FirstMDP mdp = new FirstMDP(p1,p2,p3,p4);
 
        double gamma = 0.6;
        System.out.println("Best initial action: " + mdp.bestFirstAction(gamma));
    }
}

# 8 TODO: Second MDP Problem
public class SecondMDP {

    public SecondMDP(double p1, double p2) {
        
    }
    
    public String bestActions(double gamma) {
        // Return one of the following Strings
        // "a,c"
        // "a,d"
        // "b,c" 
        // "b,d"
        // based on the optimal actions at states S0 and S1. If 
        // there is a tie, break it in favor of the letter earlier in
        // the alphabet (so if "a,c" and "a,d" would both be optimal, 
        // return "a,c").
    }
   
    public static void main(String[] args) {
        double p1 = 0.5;
        double p2 = 0.5;
        SecondMDP mdp = new SecondMDP(p1,p2);
 
        double gamma = 0.5;
        System.out.println("Best actions: " + mdp.bestActions(gamma));
    }
}

#9 TODO: Self-Assessment: MDPSolver¶

Self-Assessment: MDPSolver Before working on this assessment, first watch the Introduction to BURLAP tutorial lesson to learn how to build and solve an MDP using the Java BURLAP framework.

For this assessment, you will write a Java class called MDPSolver. The constructor for the MDPSolver class should accept the following arguments:

An int numStates. The states in the MDP are identified by an int from 0 to numStates-1, inclusive. An int numActions. The actions in the MDP are identified by an int from 0 to numActions-1, inclusive. A double[][][] probabilitiesOfTransitions. probabilitiesOfTransitions[i][j][k] is the probability of ending in state k by taking action j in state i. For all i and j, summing probabilitiesOfTransitions[i][j][k] over the index k will equal 1. A double[][][] rewards. rewards[i][j][k] is the reward obtained by ending in state k by taking action j in state i. These all represent elements of an MDP. Note that there will be no explicitly defined terminal states.

MDPSolver should also contain a method called valueOfState that accepts two arguments, an int state that represents a state in the MDP, and a double gamma that represents the discount factor to apply. It should output a double giving the value of the appropriate state using the given discount factor. The value given should be accurate within a tolerance of 0.001, inclusive.

We recommend that you use the BURLAP framework to build an MDP object out of the information in the arguments and then use value iteration to compute the value of the states in the MDP. START QUIZ

###########################
# FirstMDPGraderEx4.java #
##########################
import burlap.behavior.statehashing.DiscreteStateHashFactory;
import burlap.domain.singleagent.graphdefined.GraphDefinedDomain;
import burlap.oomdp.core.*;
import burlap.oomdp.singleagent.Action;

import java.util.List;

public class FirstMDPGraderEx4 {

	public static void main(String[] args) {
        // get the domain
        FirstMDP fmdp = new FirstMDP();
        Domain d = fmdp.getDomain();
        boolean isCorrect = true;
        
        Attribute attNode = d.getAttribute("node");
        if( attNode.lowerLim != 0 || attNode.upperLim != 5 ) {
        	System.out.println("There do not appear to be 6 states in the Domain object. Note that further problems may be\n" + 
                              "present, but you will need to correct this error before we can check further.");
        	isCorrect = false;
            return;
        }
        
		String[] correctActions = { "action0 from state 0 to state 1 with probability 1.0",
				"action0 from state 1 to state 1 with probability 1.0",
				"action0 from state 2 to state 4 with probability 1.0",
				"action0 from state 3 to state 5 with probability 1.0",
				"action0 from state 4 to state 2 with probability 1.0",
				"action0 from state 5 to state 5 with probability 1.0",
				"action1 from state 0 to state 2 with probability 1.0",
				"action2 from state 0 to state 3 with probability 1.0"};
		boolean[] actionFound = { false, false, false, false, false, false, false, false }; 

        List<Action> actions = d.getActions();
    	for( Action a : actions ) {
    		int maxSid = -1;
    		String actionName = a.getName();
    		
    		// only state0 will have action1 or action2; all other states have just action0
    		if( actionName.equals("action1") || actionName.equals("action2") ) {
    			maxSid = 1;
    		}
    		else {
    			maxSid = (int)attNode.upperLim + 1;
    		}
    		
    		// check the applicable states for the action
    		for( int sid = 0; sid < maxSid; sid++ ) {
    			try {
	    			State s = GraphDefinedDomain.getState(d, sid);
	        		List<TransitionProbability> tpr = a.getTransitions(s, "");
	        		for( TransitionProbability pr : tpr ) {
	        			State t = pr.s;
	    				int tid = GraphDefinedDomain.getNodeId(t);
	    				String actionStr = a.getName() + " from state " + sid + " to state " + tid + 
							" with probability " + pr.p;
	    			
	        			// check that actionStr is in correctActions
	        			boolean isValidAction = false;
	        			for( int i = 0; i < correctActions.length; i++ ) {
	        				if( correctActions[i].equals(actionStr) ) {
	        					actionFound[i] = true;
	        					isValidAction = true;
	        				}
	        			}
	        			if( !isValidAction ) {
	        				System.out.println(actionStr + " was found in your domain, but it is not\n" +
	        						"one of the actions in the state diagram.");
	        				isCorrect = false;
	        			}
	        		}
    			}
	        	catch (NullPointerException npe) {
	        		// do nothing
	        	}

    		}
    	}
        
    	// check that all actions have been found
    	for( int i = 0; i < correctActions.length; i++ ) {
    		if( !actionFound[i] ) {
    			System.out.println(correctActions[i] + " was not found in your domain, but it is\n" + 
    					"one of the actions in the state diagram.");
    			isCorrect = false;
    		}
    	}
        
    	if( isCorrect ) {
    		System.out.println("Good job! Your code generates a domain with six states, all " +
                               "the deterministic actions given in the state diagram are present, " +
                               "and no additional actions are present.");
    	}
    	
    }
}

🤖 Lesson 2. Reinforcement Learning

Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

In apprenticeship learning, one assumes that an expert demonstrating the ideal behavior, and tries to recover the policy directly using the observations from the expert.

Reinforcement learning: http://www.cis.upenn.edu/~cis519/fall2015/lectures/14_ReinforcementLearning.pdf ¶

Still assume an MDP:
- A set of states s ∈ S
- A set of actions (per state) A
- A model T(s,a,s’)
- A reward function R(s,a,s’)
Still looking for a policy π(s)
New twist: don’t know T or R
- i.e. don’t know which states are good or what the actions do
- Must actually try actions and states out to learn

passive learner: watches world go by¶

active learner: act using information learned so far, use problem generator to explore environment¶

# Quiz: Evaluating A Policy
######################################
# p(o)=0.8
# 0.6 | o -> o+1 -> o -> o -> o
# 0.1 | o -> o -> o -> o+1 -> o
# 0.3 | o -> o+1 -> o -> o -> o-0.2
######################################
0.6*0.8 + 0.1*math.pow(0.8,3) + 0.3*(0.8 + math.pow(0.8,4)*(-0.2))

0.746624