Can Deep Learning’s success can explained by embeddings alone?

There’s a phenomena that’s well documented in the “Tabular data ML” field: Deep learning models usually don’t outperform gradient boosted trees (let’s say, XGBoost). Well, sometimes they do: when you’re using embedding on your tabular data.

But, fundamentally why would you “embed”/transform data already in numeric format to… a numeric format?

You can think of embeddings as a learnable vector database (it is a hash table/dictionary after all!).

That means a lot a couple hundred new dimensions to use, to store what that single float (in case of tabular data) really means.

NLP is a field that - I think - was thriving because of (sub-)word embeddings. Word2vec was a breakthrough, and although now we’re embedding smaller and smaller pieces of text [link], the mechanism remained the same. It’s true, that QKV attention mechanism [link] is now taking up more parameters than the (sub-)word embeddings, so they may matter less than previously. (thanks for Roland Szabo for pointing this out)

With Computer Vision, embedding ( are completely missing, even with Vision Transformers.

But if I’m completely honest, what makes DL so powerful is… backprop, and its ability to connect various subsystems (like, the embedding layer) together, in an all-encompassing, learnable manner.