Word Embeddings

  • Word2Vec , FastText and SentencePiece Unigram Tokenizer and its associated Word2Vec Turkish word embeddings are trained on a corpora of 32 GBs.

  • In terms Tokenization, there are two groups of word embeddings: NLTK.tokenize.TreebankWordTokenizer and SentencePiece Unigram Tokenizer.

SentencePiece Unigram Tokenizer and Word Embeddings

TreebankWordTokenizer tokenized Word Embeddings

  • TreebankWordTokenizer tokenized Word2Vec and FastText embeddings come in 3 sizes and can be downloaded from the links below:

  • Large:
    • Word2Vec : vocabulary size: 128_000, vector size: 256

    • FastText : vocabulary size: 128_000, vector size: 256

  • Medium:
    • Word2Vec : vocabulary size: 64_000, vector size: 128

    • FastText : vocabulary size: 64_000, vector size: 128

  • Small:
    • Word2Vec: : vocabulary size: 32_000, vector size: 64

    • FastText : vocabulary size: 32_000, vector size: 64

  • Usage:
    >>> # Word2Vec
    >>> from gensim.models import Word2Vec
    >>>
    >>> model = Word2Vec.load('Word2Vec_large.model')
    >>> model.wv.most_similar('gandalf', topn = 10)
    [('saruman', 0.7291593551635742),
    ('thorin', 0.6473978161811829),
    ('aragorn', 0.6401687264442444),
    ('isengard', 0.6123237013816833),
    ('orklar', 0.59786057472229),
    ('gollum', 0.5905635952949524),
    ('baggins', 0.5837421417236328),
    ('frodo', 0.5819021463394165),
    ('belgarath', 0.5811135172843933),
    ('sauron', 0.5763844847679138)]
    
    >>> # FastText
    >>> from gensim.models import FastText
    >>>
    >>> model = FastText.load('FastText_large.model')
    >>> model.wv.most_similar('yamaçlardan', topn = 10)
    [('kayalardan', 0.8601457476615906),
    ('kayalıklardan', 0.8567330837249756),
    ('tepelerden', 0.8423191905021667),
    ('ormanlardan', 0.8362939357757568),
    ('dağlardan', 0.8140010833740234),
    ('amaçlardan', 0.810560405254364),
    ('bloklardan', 0.803180992603302),
    ('otlardan', 0.8026642203330994),
    ('kısımlardan', 0.7993910312652588),
    ('ağaçlardan', 0.7961613535881042)]
    
    >>> # SentencePiece Unigram Tokenizer
    >>> import sentencepiece as spm
    >>> sp = spm.SentencePieceProcessor('SentencePiece_16k_Tokenizer.model')
    >>> tokenizer.encode_as_pieces('bilemezlerken')
    ['▁bile', 'mez', 'lerken']
    >>> tokenizer.encode_as_ids('bilemezlerken')
    [180, 1200, 8167]
    
  • For more details about corpora, preprocessing and training, see ReadMe.