Building probabilistic models from scratch

Building a good probabilistic model can be challenging because of the ambiguities and messiness of the real world. This tutorial will provide some general tips and advice on how to think about building probabilistic models for your particular domain of interest. If good-quality data is already available, then one can proceed with parameter learning. Otherwise, this data must be gathered first, if available, or the model must be constructed by hand. The following flowchart outlines the possibilities in more detail.

In this tutorial we focus on the situation of building a model by hand. We divide the process into four steps and describe some heuristics that will enable successful model design.

Choosing model variables

The first step is to pick model variables of interest. This is perhaps the most important step because all other steps depend on the choice of variables.

Use clearly defined and documented variables: It is important to select variables with precise and unambiguous definitions. For example, in the sprinkler dataset (see the Bayesian network tutorial), the "cloudy" variable could have many different interpretations depending on what it means for it to be "cloudy". Often, Bayesian networks are created after some data has been collected so the definition of "cloudy" should have been specified during the data collection stage, perhaps by some measurement that creates the decision boundary between whether or not it cloudy. "Rain" may also be ambiguous without further description. If it is sprinkling, does this count as rain? Having a quantitative measurement associated with a variable will make them more clearly defined. For example, rain is usually measured in either milliliters or inches and having a defined cutoff of when it is raining versus sprinkling can make the variable's definition more precise.

In many cases, the level of specificity depends on what the model is being used for. In the simple sprinkler example the purpose is to teach readers to understand how Bayesian networks work and so the precise definitions do not matter. But if the variables of interest were part of a medical diagnosis tool then the variable names and description should be documented thoroughly and precisely defined based on quantitative measurements.

It is especially important to get the description and specificity of a variable correct because the dependency relationships to other child variable may hinge upon how the parent variable is defined. Furthermore, when we reason with our model, the end result and interpretation can be affected by poorly defined variables. If the variables come from a pre-existing dataset then it is important to document and understand what these variables mean before building the network.

Models are a simplification of reality: A good model simplifies as much as one possibly can (lower model complexity) while still preserving predictive accuracy. Simpler models are easier to interpret and also less prone to overfitting (i.e. a model that incorrectly generalizes in its reasoning). In practice this may be difficult to accomplish because of the subjective nature of selecting variables. When picking which variables to use you should ask yourself the following questions:

What variables are the most relevant, impactful, and influential for the types of predictions you wish to perform?
Which variables cover the broadest set of possible outcomes?
Is any information in one variable already covered by another variable?

These three questions highlight the general rule that you should include the least number of variables you think you will need in your model. You can always add in more variables if it turns out your model does not have the appropriate predictive power for the modeling task at hand. The questions also emphasize that you do not need to include variables to cover every edge case.

Stick to what you can observe (usually): In many cases we want to choose variables that we can directly observe and gather data on to help learn the model's probabilities. This is usually the best place to start when selecting variables of interest.

However, there cases where we may want to include latent variables in our model which include unobserved effects for which we do not have data but believe play a significant role in the model. These variables may be intervening factors that we suspect play a role in the process but we cannot directly gather data on them. Latent variable inference allows us to effectively treat these variables as "unobserved data" that can be inferred on the basis of the data. While latent variables may give our models more explanatory power, using them may require some careful consideration.

Choosing categories within each variable

Each variable we choose must be broken down into categories which are assigned probabilities. However, it can sometimes be difficult to determine how to do this properly and at what granularity. Granularity also matters for more technical reasons, such as conditional independence assumptions we have built into our model. In many cases we may need to bin a continuous random variable into a discrete random variable.

For example, blood sugar is often measured in milligrams per deciliter (mg/dL) (see insulin pump example). Normal blood sugar ranges from $60$ (fasting) to $140$ (shortly after a meal). How should we bin this measurement? To answer this question we need to know about what question we are trying to answer and what level of granularity is needed to capture the relevant information for prediction. If we are interested in the nature of the blood sugar spike that typically follows a meal perhaps we want to bin in increments of five mg/dL to capture the these details in the model. On the other hand, this means there are more probabilities to represent in the model which can be more computationally expensive. What if the patient is diabetic? Now the normal blood sugar range may actually go to $200$ mg/dL or more and we may decide we need a different level of granularity to capture the effect we are studying.

Another example concerns discrete variables like time of day. Should the categories be morning, evening, night? What is the cutoff for the these categories? Should the categories by divided by hour? AM/PM? Answering this question again depends on the level of granularity needed for the task at hand.

Choosing model structure

Since the world is itself causal it can be helpful to build our causal assumptions into the model. The easiest way to accomplish this is to start with a variable one is interested in inferring and work backward to determine which other variables may cause this variable of interest. Preferably, the included variables included are those for which it is possible to collect data so the probabilities can be learned. Once this relationship is established, one can continue in this manner for each new variable and determine what other variable might cause it. When picking variables this way, try to select variables that you think would be the primary cause of the effect under consideration to avoid overcomplicating the model.

Choosing probabilities

When the model variables and their structure are set, the next step is to assign all probabilities to each factor in the graph. This process can be difficult since we need to take subjective beliefs and/or expert knowledge (which may be in our own heads or in the head of a subject matter expert) and translate them into a probability. There are no hard and fast rules for this process except perhaps for avoiding the inclusion of zero probabilities unless you are absolutely certain that a particular event is impossible. Since zero probabilities cannot be updated on the basis of evidence, if we are wrong about assigning zero probabilities, it will affect the rest of the model. Unless you are absolutely positive about assigning a zero probability, choose a smaller, near-zero value instead.

Sometimes assigning probabilities is tricky because we may not know how variables we did not include in our model could affect a probability. For example, in the sprinkler example, if the sprinkler is off and there is no rain then the probability of wet grass is $0.0$ . While is does make sense in the context of the example, the grass could be wet because of other variables we did not include in our model. Therefore perhaps this is not really an impossible event with zero probability but an improbable event with near-zero probability.

In some cases we may not know the right probabilities to include in our model. However, if we are able to gather data we could use the data itself to learn the probabilities. This process is known as parameter learning.

Sanity checking

After a model is fully created, the final step is to do some sanity checking to make sure the queries make sense. For example, taking the sprinkler example from the previous notebooks, we can see that if it is cloudy, then rain has a probability of $0.9$ which make sense. Likewise, if the sprinkler is on and there is rain, then the probability of wet grass (from the variables in our model alone) is $0.99$ . These probabilities make sense based on how we understand the world operates and the Bayesian network captures this understanding in a model that we can query.

PreviousBuilding models from scratch NextBuilding a POMDP model from scratch

Last updated 5 months ago

hashtagChoosing model variables

hashtagChoosing categories within each variable

hashtagChoosing model structure

hashtagChoosing probabilities

hashtagSanity checking

Choosing model variables

Choosing categories within each variable

Choosing model structure

Choosing probabilities

Sanity checking