Building a POMDP model from scratch
It is recommended to read the tutorial on active inference before reading this page.
All active inference models rely upon a partially observable Markov decision process or POMDP. A POMDP is a special kind of probabilistic model with several required components needed for executing active inference. This tutorial will explain the basic structure of the POMDP and describe how to think about building POMDP models for use in Genius.
Components of POMDP models
As explained in the POMDP tutorial, the structure of the POMDP model is based on the following assumptions:
Partial observability - system states are unobservable
Agents can take actions to affect the environment or at least plan possible actions it could take
States are dynamic - they can change over time
The Markov property - agents assume that all the information they need to predict future states is contained in the most recent past state
The POMDP model is a simplification or "representation" of the process being modeled that relies on the specific assumptions. As with any model, the assumptions will greatly affect the model's performance. The POMDP assumptions are very general which means that they will enable good performance in a large variety of modeling settings provided that variables, actions, and factor probabilities are well-suited to the process being modeled.
Like other probabilistic models, POMDP models are represented in Genius as factor graphs which consist of a set of variable nodes and factor nodes connected by edges. POMDP models must contain the following variable and factor nodes:
Variable nodes:
States - Unobserved states of the environment or real world process being modeled that the agent does not have specific data about. These states are assumed to change over time because POMDP models are dynamic. States are latent variables in this probabilistic model. The model assumes that observations are directly related to specific states.
Observations - Also known as data or evidence. Observations represent a collection of all data available to the agent. Observations are used to infer states because states are assumed to be unobserved or unknowable to the agent. POMDP models assume observations are directly dependent on states. That is, if the system being modeled is in a particular state, then a particular observation is generated. Therefore, it should be possible to reason in reverse through Bayesian inference to infer the probability of the system being in a particular state to lead to a particular observation.
Actions - Also known as controls, the action variable encodes a set of possible actions that the agent can take to affect its environment. Alternatively, if the agent cannot directly control the environment, it may still be able to imagine possible action sequence it could take to affect the environment (e.g. planning or decision-making).
Factor nodes:
Likelihood - The likelihood captures the agent's assumptions about how states are connected to observations. In other words, the likelihood contains the probabilities in the agent's model for what observations it expects are likely when the environment is in a particular state. If the states of the environment were
rain=yesorrain=nothen the resulting observations could bewet_grass=yesorwet_grass=no. The agent must enumerate the probabilities associated with each combination of categories. With two states and two observations, this is a total of four possible combinations. Although the agent may not have access to the true state of the world (raining or not raining in this example) the power of Bayesian inference allows the agent to go backward from the data (grass is wet or not) to determine a belief about what state of the world is most likely.State-transition - The state transition captures the agent's assumptions about how states will change over time. For example, suppose we have four states: cloudy, rainy, sunny, and snowy. Weather is dynamic and will change over time. The state-transition captures the probability of the next state based on the current one. For example, if it is currently cloudy, there may be a high probability that the next state, at the next time step, is rainy and a low probability that it will be sunny. When we build this state-transition factor, we are responsible for selecting probabilities that accurately reflect the process we are attempting to model or to learn these probabilities from the data. In the example given here, there are four possible states for the present and next time step so we have 4×4=16 possible probabilities to fill in.
Initial state prior - The initial state prior represents the agent's belief about which state is most likely at the beginning of a simulation (time step 0). For example, in our example of four states - cloudy, rainy, sunny, and snowy - the agent needs to have a probability associated with each state at the start of the simulation. In the absence of any other information, one could use equal (uniform) probabilities for each (0.25 for each state in the case of four states, following the principle of indifference).
Preference - The agent's goal encoded in the observations that it prefers. If the agent takes an action it will affect its environment and change the state of the environment. This new state will result in a new observation that the agent will receive. The preference factor should encode the agents preference for each observation as a probability. In practice, one can assign negative and positive values to preferences which Genius will convert internally to probabilities which must sum to one. For example, if an agent prefers to observe rain and is very averse to it not raining we could use a preference of 6 to
rain=yesand a preference of -8 torain=no. Technically, this represents an unnormalized probability, known as a logit, which is normalized into a probability distribution.
Steps to building a POMDP model
Designing a POMDP model requires choosing states, observations, actions, preferences, and factor probabilities. Once these model components are defined in a factor graph, Genius can be used to run action selection to determine the optimal sequence of actions from a particular time step given how we have defined the agent's preferences.
As emphasized in the section on building probabilistic models, it is important to choose model variable names (states, observations, and actions) in a way that is clearly defined and documented. It is also important to carefully choose the right categories that apply to each model variable. Since the POMDP structure is already defined, the step described previously on choosing model structure no longer applies.
Choosing states
One strategy for choosing states is to put yourself in the position of the agent and ask: "What, specifically, are all the details I would need to know about this dynamic process in order to make a decision about how to act?" This question should be answered as explicitly and as detailed as possible.
Select states based on what is most relevant to the goal the agent must accomplish: Ultimately, the reason we design active inference agents is that we need to make decisions about some kind of dynamic process. This dynamic process is something in the real world we care about that changes over time. That is, at one point in time the process exists in a particular state and then at another point in time the state is different. If knowing this information is relevant for an agent to make a decision then it would be important to include these states in the model.
Below are some examples of a dynamic processes and their associated goal. In all these cases, the system exists in a particular state at a particular time point and this state changes to a new state at the next time point.
location
plot_0, plot_1, plot_2, plot_3, plot_4, plot_5
The agent's physical location in plots of an irrigation system
Move to a plot that needs to be watered
position
shelf, shelf2, shelf3, shelf4, shelf5
The position of a moving object in distinct shelf locations in a warehouse
Move to the correct shelf to stack an item
weather
sunny, rainy, cloudy, snowy
Weather changes from day to day
Determine the most likely sequence of weather over a 5 day forecast
disease_cases
high, low, same
Long-term trends in levels of disease cases
Determine the most likely sequence of case number changes over a multiple day forecast
population
high, low, same
Population fluctuations of an animal species
Determine the most likely population level changes over a multiple season forecast
Note that in some cases the agent does not need to physically act on the environment but instead forecast its belief about the right decision. This can be useful in cases where an agent is built to give recommendations to a human rather than an agent that physically acts in the real world.
Select states whose fluctuations are relatively stable over time: Since POMDP models assume categorical data, we may not be able to capture trends that occur on very fast time scales. For example, models of wind turbulence or second-by-second trends in financial markets may not be well-suited for POMDP models because they are best represented by continuous, real-valued data rather than categories. States that fit well in category are relatively stable for longer periods of time and tend to change from one state to the next without much of a smooth transition in between. For example, a light switch is either in the state of ON or OFF. There are no graded states in between, assuming the light switch does not operate on a dimmer. However, POMDP models can still be used effectively with binned continuous data as demonstrated in the insulin pump example.
Select state categories that are relatively distinct from one another: It is important that the states captured in the model represent distinct categories that do not overlap in definition. In the weather example in the table above, the four state categories are arguably distinct processes with distinct qualities. However, there may be cases where it is hard to distinguish between rainy and cloudy. This might suggest the need for more categories. However, if you find yourself needing to add an excessive number of categories it may indicate that this is not an appropriate problem to model in the discrete domain and would be better suited in the continuous domain.
Choosing observations
Often, the states we wish to actually measure and track over time in the real world are not directly measurable or observable. This means that we cannot gather good data about this state in order to make a prediction or forecast. However, states are usually associated with other kind of measurable quantity that a physical robot could sense or for which data could be gathered. Here are some example observations that might pair with the state examples listed above:
weather
conditions
cold, wet, hot
marine_population
eggs
none, few, many
In the case of weather, the agent my not know directly what the actual weather is like but if has sensors that tell it if is cold, wet, or hot, it may be able to infer what the weather is. Likewise, if we are measuring population fluctuations in marine animals, we might not be able to count the actual number of individuals in the population. However, there is often evidence associated with it such as the number of eggs which could be measurable by an ecologist.
As with probabilistic models it is important to choose the right granularity and have precise definitions for what each category refers to. More importantly, we should choose observations that are uniquely associated with the states in the model to reduce ambiguity. In the agent warehouse navigation example, shelf is used as an observation of what the agent is likely to observe if it is in the storage position (state). However, shelf could also be indicative of the office position (state) since both the storage area and office areas have shelves. It may not always be possible to define unique observations in the real world but it is always preferred if possible.
Note that observation choice is also directly linked to preferences as explained in the section below.
Choosing actions
When choosing an action consider whether or not the agent has the ability to perform these actions. If the agent is a robot or has ways to interact with the world, such as a control system in a power plant, it may have a limited number of operations it can perform. This will directly limit the types of actions you can choose for your agent.
In cases where the agent does not physically alter the world but only provides suggestions of possible courses of action it is important to select actions that could be performed were they to be carried out in the real world. If the agent is planning or informing business decisions it is important that the actions is recommend are not just theoretical but actually actionable information.
Choosing preferences
Preferences are directly defined in terms of observations that the agent prefers to receive as a result of an action or sequence of actions. When choosing preferences it is important to consider each observation that has been defined and the relative preference for that observation. Since preferences are often assigned in a relative manner ensure that the values you pick are properly associated with the scale.
For example, if you chose a scale of -10 (aversive) to 10 (rewarding) then 0 would represent a neutral preference. On the other hand, one could just as easily pick a scale of 0 (aversive) to 10 (rewarding) with 5 representing a neutral preference. With this in mind, consider the following:
The first scale has a granularity of 20 entries while the second is more compressed and can only represent 10 entries. The level of granularity depends how specific you would like to be when assigning preferences.
The usage of negative values in a scale is entirely a personal preference if it intuitively makes sense to the model builder to represent aversive preferences as a negative value. Ultimately, the scale is converted into a probability so these values are purely for the convenience of construction that makes sense to the modeler and makes no difference to the agent.
Finally, since preferences and observations go hand-in-hand, it is important to take preferences into account when choosing observations. The observations chosen must be indicative of data the agent could observe with its sensors (whether real or simulated) that it prefers to receive from different environment states.
Choosing factor probabilities
In many cases, we may wish to use the data available to the agent to learn factor probabilities in the POMDP setting. However, we are not required to do so and sometimes adding our own factor probabilities using domain knowledge can be very helpful to get the model started in early time steps.
Choosing factor probabilities relies on the same types of heuristics mentioned in the tutorial on building probabilistic models from scratch. In addition to these heuristics, consider that POMDP models are dynamic in nature so the past influences the future and some states are more likely to generate particular observations than other states. With this in mind, consider the following:
The initial prior probabilities should be chosen based on either the modeler's experience with the process being modeled or as a equal probabilities across all states (once again conforming to the principle of indifference). For example, if you are building a weather model in a hot climate then the state of "sunny" is likely to be more probable than other states like "cloudy" or "rainy".
The state transition probabilities should be chosen based on the modeler's experience with how certain states follow other states over time. For example, if you observe that cloudy days are almost always followed by rainy days then there would be a much higher probability of transitioning from cloudy to rainy than other states. Likewise if rainy days are usually followed by sunny days, then this transition is most likely.
The likelihood probabilities should be chosen on the basis of the modeler's experience with how often certain observations are associated with specific states. For example, in the agent warehouse navigation example we associate the "convey" state (agent's state or position is in the conveyer belt room) with the following observations: conveyer belts, lights, and boxes. We could assign a high probability to boxes and conveyer belts because when the agent is in this room it would be most likely to pick up the presence of these objects with its sensors. On the other hand, lights are still possible but much less likely. Likewise, the other observations in this example such as doors, shelves, and pallets, are not likely to be found in this room of the warehouse at all.
Multiple independent types of states and observations
In more complex active inference examples, like the multi-armed bandit example, it may be possible to include multiple types of independent states (state factors) or observations (observation modalities). The most important principle to keep in mind is to ensure that these factors or modalities are actually independent from one another.
For example, in the multi-armed bandit (dual slot machine) example, there are two state factors:
Context factor - Includes categories of "left slot machine is better" or "right slot machine is better".
Choice factor - Includes categories for "the simulation has started", "the agent took a hint", "the agent activated the left machine", and "the agent activated the right machine".
As you can see, "context" and "choice" are essentially independent types of states whose transitions do not directly influence each other. The agent keeps track of the current context - which machine it thinks is better at a given time step - and choice - which choice it has just observed itself making - and these states may change over time without affecting one another.
In this example there are also three observation modalities:
Hint modality - Includes categories for "no hint was used", "hint indicates left machine is probably better" and "right machine is probably better".
Reward modality - Includes categories for "no reward received", "a loss was received", and "a reward was received".
Choice modality - Includes categories for "agent starts the simulation", "the agent took a hint", "the agent activated the left machine", and "the agent activated the right machine".
Here, "hint", "reward", and "choice" are all separate observations that do not influence one another. In fact, they only depend on each state factor.
Last updated