Sentencepiece Algorithm
Sentencepiece Algorithm . A shown by u/narsilouu, u/fasttosmile, sentencepiece contains all bpe, wordpiece and unigram (with unigram as the main norm), and provides optimized versions of each. Unigram gets all possible combinations of substrings, then removes each if it maximises the likelihood of the corpus the least. Subword regularization on BPE models · Issue 371 · google from github.com Note that bpe algorithm used in wordpiece is slightly different from the original bpe. Here are the high level differences from other implementations. And unigram language model kudo , with the extension of direct training from raw sentences.