Word Embeddings¶

Word2Vec , FastText and SentencePiece Unigram Tokenizer and its associated Word2Vec Turkish word embeddings are trained on a corpora of 32 GBs.
In terms Tokenization, there are two groups of word embeddings: NLTK.tokenize.TreebankWordTokenizer and SentencePiece Unigram Tokenizer.

SentencePiece Unigram Tokenizer and Word Embeddings¶

Sentence Piece Unigram Tokenizer and its associated Word2Vec embeddings come in 2 sizes for each config and can be downloaded from the links below:
Medium Tokenizer and its Word2Vec Embeddings:
- Medium Tokenizer : vocabulary size: 32_000
- Large Word2Vec embeddings : vector size: 256
- Medium Word2Vec embeddings : vector size: 128
Small Tokenizer and its Word2Vec Embeddings:
- Small Tokenizer : vocabulary size: 16_000
- Large Word2Vec embeddings : vector size: 256
- Medium Word2Vec embeddings : vector size: 128
Word2Vec and FastText embeddings are trained with gensim algorithm.
Sentence Piece Unigram tokenizer is trained with SentencePiece algorithm.

TreebankWordTokenizer tokenized Word Embeddings¶

TreebankWordTokenizer tokenized Word2Vec and FastText embeddings come in 3 sizes and can be downloaded from the links below:
Large:
- Word2Vec : vocabulary size: 128_000, vector size: 256
- FastText : vocabulary size: 128_000, vector size: 256
Medium:
- Word2Vec : vocabulary size: 64_000, vector size: 128
- FastText : vocabulary size: 64_000, vector size: 128
Small:
- Word2Vec: : vocabulary size: 32_000, vector size: 64
- FastText : vocabulary size: 32_000, vector size: 64

Usage:

>>> # Word2Vec
>>> from gensim.models import Word2Vec
>>>
>>> model = Word2Vec.load('Word2Vec_large.model')
>>> model.wv.most_similar('gandalf', topn = 10)
[('saruman', 0.7291593551635742),
('thorin', 0.6473978161811829),
('aragorn', 0.6401687264442444),
('isengard', 0.6123237013816833),
('orklar', 0.59786057472229),
('gollum', 0.5905635952949524),
('baggins', 0.5837421417236328),
('frodo', 0.5819021463394165),
('belgarath', 0.5811135172843933),
('sauron', 0.5763844847679138)]

>>> # FastText
>>> from gensim.models import FastText
>>>
>>> model = FastText.load('FastText_large.model')
>>> model.wv.most_similar('yamaçlardan', topn = 10)
[('kayalardan', 0.8601457476615906),
('kayalıklardan', 0.8567330837249756),
('tepelerden', 0.8423191905021667),
('ormanlardan', 0.8362939357757568),
('dağlardan', 0.8140010833740234),
('amaçlardan', 0.810560405254364),
('bloklardan', 0.803180992603302),
('otlardan', 0.8026642203330994),
('kısımlardan', 0.7993910312652588),
('ağaçlardan', 0.7961613535881042)]

>>> # SentencePiece Unigram Tokenizer
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor('SentencePiece_16k_Tokenizer.model')
>>> tokenizer.encode_as_pieces('bilemezlerken')
['▁bile', 'mez', 'lerken']
>>> tokenizer.encode_as_ids('bilemezlerken')
[180, 1200, 8167]

For more details about corpora, preprocessing and training, see ReadMe.