Embedding
A simple blog on embedding can be found here and Google’s tutorial is here.
An embedding is a mapping of a discrete - categorical - variable to a vector of continuous numbers. Embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space.
Embeddings have three primary purposes:
- Finding nearest neighbors in the embedding space. These can be used to make recommendations based on user interests or cluster categories.
- As input to a machine learning model for a supervised task.
- For visualization of concepts and relations between categories.
One-hot encoding can be considered as a naive embedding method, and it has two obvious drawbacks: dimensions and uninformed.
Learning Embedding
Embedding can be learned with a supervised neural network, where we create some loss function to make the similar entities’ representations close in the embedded space.
t-Distributed Stochastic Neighbor Embedding (TSNE) is an embedding method for visualization.
One hyperparameter is the dimensions of the embedding, an empirical rule-of-thumb is
\[dimensions\approx \sqrt[4]{possible\ values}\]