Frequently asked questions, tips and tricks

Designing models can be a challenging. This page collects various strategies to ensure high quality models, dos and don'ts, and other items that may be useful for troubleshooting.

Imported data

This section concerns best practices for data imported from CSV. Note that at this time it is not possible to import a CSV with any missing entries. These values will need to be removed manually or imputed before import.

See also: CSV data format specifications

Data cleaning

CSV data is used to learn probabilities for model used in Genius. However, the success of this learning process relies heavily on the quality of the input data. Please follow these best practices to ensure Genius is able to produce the best model possible given the data and avoid errors:

Ensure that non-empty headers exist in the first row for all columns.
Remove any invalid characters.
Remove any rows with null or empty values (i.e. missing data).
Be aware that although columns containing dates can be included in models they are not the best fit for current Genius models and may produce unexpected results.
Avoid mixing columns with numeric and categorical data.
Bin continuous values into discrete categories (see section on binning data below).

Samples per category

The quality of a model is only as good as the data. This implies that the data-gathering step is perhaps the most important step in the modeling process and care and consideration is needed to ensure that the imported data is free of sampling bias and representative of the actual distribution that you wish to model. In practical terms this means that you should import data for which samples have sufficient coverage among the different categories. As a general rule, the more samples you have the more accurate your model will be and the better the results produced by Genius.

See the example on continuous learning for a demonstration of the bias that results when there is too little data or unrepresentative data and how the situation can be improved with a larger sample size.

While Genius can handle several thousand rows of data, we recommend that you do not exceed more than 25 categories per variable and limit the number of variables (CSV columns) for learning to no more than 15.

Inspecting model probabilities after import

After you import the dataset inspect the probabilities in the model yourself as a sanity check. As a domain expert you may be able to spot if probabilities in a particular category does not make sense, suggesting that the data may be misrepresentative of the true data distribution.

Model building

This section outlines some basic heuristics to follow when designing models. For more specific details and advice on this topic see the building models from scratch page and the example for a medical diagnosis model.

Limit the number of variables

Runtime will increase significantly if the number of variables exceeds 256. In general, Bayesian networks are easier to use and understand the less variable that are in the model.

Less connections is better

While it is possible to have a fully connected Bayesian network this is not recommended. In general, you should create a model with the fewest numbers of relationships you need to make an accurate inference. Simpler models are also more adaptable and are easier to interpret. Try to limit the in-degree (number directed edges pointing to a variable) to less than 8 for interpretability and reducing model complexity. Genius can handle higher numbers of connections, even beyond 100, but the performance hit may be noticeable for certain types of inference operations.

Avoid cyclical relationships

Technically speaking, Bayesian networks are directed acyclic graphs (DAGs). As the name implies, cyclical relationships are not permitted as this implies circular causality which is problematic to calculate in a Bayesian network. Genius models do not support any cyclic dependencies between variables. Here is an example of a cyclic relationship between three variables:

Here we can imagine that the popularity of some content leads to an increases probability of user engagement. This in turn could cause the boosting algorithm to boost the content for greater visibility because it is trending. This leads to a feedback loop among the variables. Such cycles should not be included in any Genius models and their inclusion will lead to an error message.

In some cases, it is possible to rewrite a model with a cyclical relationship in terms of a dynamic Bayesian network and in other cases a Markov random field model may be a better choice. Such models are not currently supported in the Darwin release of Genius.

Number of categories per variable

Although there is no specific limitation for the number of categories in Genius, we highly recommend that you have no more than 15-20 categories to ensure optimal performance. Computation time can increase significantly as more categories are added. Furthermore, models with a large number of categories may become difficult to interpret.

If you have continuous data, or many discrete categories, we recommend binning the data before proceeding further. You can read more about this process in the section below as well as the Knowledge Center page on binning.

Adding latent (hidden, unobserved) variables

In many cases there are variables we wish to infer a probability over in our model for which we have no data. As long as we believe such variables are connected to data that we can observe, we can build these latent variables into our model and perform Bayesian inference to approximately determine the probability distribution over these latent variables. There are other advantages and uses for latent variables which are described in more detail on the page for probabilistic inference.

Binning categorical or discrete data

As mentioned above, Bayesian networks can become slow when there are two many categories per variable. Data can be binned to avoid this issue. For discrete data, you could take a large number of categories and attempt to group them if you believe the categories are related by a parent category. In the case of continuous variables, you could bin the data over equally or differently sized intervals of interest. Since Genius does not yet support continuous variable Bayesian networks this is the only way to use continuous data in the modeling process. For more information see the Knowledge Center article on binning data.

PreviousPython SDK NextProbabilistic modeling

Last updated 10 months ago

hashtagImported data

hashtagData cleaning

hashtagSamples per category

hashtagInspecting model probabilities after import

hashtagModel building

hashtagLimit the number of variables

hashtagLess connections is better

hashtagAvoid cyclical relationships

hashtagNumber of categories per variable

hashtagAdding latent (hidden, unobserved) variables

hashtagBinning categorical or discrete data