Binning data

Computation time for queries made with Genius models can increase significantly when there is a large number of categories per variable. One strategy to overcome this issue is to bin your data so that there are a smaller number of categories per variable. Binning data also allows you to represent continuous data in your model. This page explains the basic ideas behind binning for both the discrete and continuous cases and various binning strategies offered in the model editor.

circle-exclamation

Binning discrete data

Discrete data may be binned by combining one or more categories together into a single category. This is possible if the categories belong to a parent category. For example, suppose we have a discrete variable called "Product Type". The diagram below shows how we could bin the categories so that we move from eight categories to three for this variable:

Example of discrete data binning

Note that this approach requires specific domain knowledge and may not be easily, or correctly, automated. For a complete list of automated binning strategies for discrete data offered in Genius see Supported binning strategies.

Drawbacks: Discretizing, or binning, data results in less granularity as a tradeoff for the reduction in computation time. If the categories are only loosely related to one another then binning the data will introduce simplifications into the model which may have non-trivial effects on the ability to interpret the results of inference.

Binning continuous data

If you have continuous measurements in your data (e.g. weight, height, or other measurements that do not fit into categories) then you must bin your data into discrete categories before using Genius. At this time there is no support for continuous variables in Bayesian networks.

Binning continuous data proceeds by dividing up a continuous distribution into specific bins or categories of interest. For example, if you are measuring temperature data on a scale from 0-100 degrees Fahrenheit, you could divide these continuous measurements into "Cold" (0-45), "Mild" (45-70), and "Hot" (70-100).

Drawbacks: As with discrete data, binning always results in a reduction of granularity which may or may not have consequences for the inference problem at hand. If the effect you wish to measure is very precise and depends on degree-to-degree changes then binning the data in this way may miss such effects. However, the benefit is an decrease in computation time.

Supported binning strategies

The model editor provides a few binning strategies. This section will help you choose between the different options and understand how they work.

Binning strategies for continuous data

Fixed-width binning: This is the simplest binning strategy for continuous data. It bins the data equally across the support of the distribution. This strategy is best employed in cases where the data is already uniformly distributed and you would like to maintain the same bin width across bins. If the data is not uniformly distributed this strategy may produce bins without data in it.

Quantile-binning: This strategy bins data in equal-frequency such that each bin will contain roughly the same number of observations.

K-means clustering: This strategy attempts to group datapoints with similar values together in bins, called clusters, by the distance from the mean of the group of datapoints. This strategy may be useful in cases where your distribution has a complex shape with distinct peaks.

Binning strategies for discrete data

K-modes clustering: This strategy attempts to group categorical data by frequency of occurrence (mode). Data points that frequently co-occur will be binned together.

Last updated