What makes them unique?
- They can deal with missing values, which is interesting
- They can be used for "abduction", searching for "the best possible explanation" of an outcome
- They're explicitly modelling all features' effect on potentially all other features (in an undirected, Markov Network-case, or at least the features that it has causal connection with - in the Bayesian Network-case) - this creates a combinatorial explosion of potential conditional probability distributions to estimate, which makes them either really slow or impossible to train sophisticated-enough models
- It also makes them more versatile models - they don't just learn what needs to be necessarily learned for the specific task you're training them for. that makes them somewhat "portable"?
- For continuous variables, you need to specify the distribution of what those values are coming from, or transforming your data to gaussian (then choosing a Gaussian network)
- In a continuous case, it's not straightforward to train models that can capture non-linear relationships (haven't found any implementation of that in python)
- Because of that, it's very common to quantize your data, which bring back the combinatorial explosion, as well as some potentially problematic distribution assumptions
- So it makes a lot of sense why people use them in context with little data and/or lot of need for explainability
- But I assume they'll always underperform (or they're simply inappropriately costly to train) in the context when lots of data are available
- Unless someone finds the "backprop" learning method for Graphical Models