**Word Embeddings** ---------- - `Word2Vec `_ , `FastText `_ and `SentencePiece Unigram Tokenizer `_ and its associated Word2Vec Turkish word embeddings are trained on a corpora of 32 GBs. - In terms Tokenization, there are two groups of word embeddings: NLTK.tokenize.TreebankWordTokenizer and SentencePiece Unigram Tokenizer. SentencePiece Unigram Tokenizer and Word Embeddings =================================== - Sentence Piece Unigram Tokenizer and its associated Word2Vec embeddings come in 2 sizes for each config and can be downloaded from the links below: - **Medium Tokenizer and its Word2Vec Embeddings**: - `Medium Tokenizer `_ : vocabulary size: 32_000 - `Large Word2Vec embeddings `_ : vector size: 256 - `Medium Word2Vec embeddings `_ : vector size: 128 - **Small Tokenizer and its Word2Vec Embeddings**: - `Small Tokenizer `_ : vocabulary size: 16_000 - `Large Word2Vec embeddings `_ : vector size: 256 - `Medium Word2Vec embeddings `_ : vector size: 128 - Word2Vec and FastText embeddings are trained with `gensim `_ algorithm. - Sentence Piece Unigram tokenizer is trained with `SentencePiece `_ algorithm. TreebankWordTokenizer tokenized Word Embeddings =================================== - TreebankWordTokenizer tokenized Word2Vec and FastText embeddings come in 3 sizes and can be downloaded from the links below: - **Large**: - `Word2Vec `_ : vocabulary size: 128_000, vector size: 256 - `FastText `_ : vocabulary size: 128_000, vector size: 256 - **Medium**: - `Word2Vec `_ : vocabulary size: 64_000, vector size: 128 - `FastText `_ : vocabulary size: 64_000, vector size: 128 - **Small**: - `Word2Vec: `_ : vocabulary size: 32_000, vector size: 64 - `FastText `_ : vocabulary size: 32_000, vector size: 64 - Usage: >>> # Word2Vec >>> from gensim.models import Word2Vec >>> >>> model = Word2Vec.load('Word2Vec_large.model') >>> model.wv.most_similar('gandalf', topn = 10) [('saruman', 0.7291593551635742), ('thorin', 0.6473978161811829), ('aragorn', 0.6401687264442444), ('isengard', 0.6123237013816833), ('orklar', 0.59786057472229), ('gollum', 0.5905635952949524), ('baggins', 0.5837421417236328), ('frodo', 0.5819021463394165), ('belgarath', 0.5811135172843933), ('sauron', 0.5763844847679138)] >>> # FastText >>> from gensim.models import FastText >>> >>> model = FastText.load('FastText_large.model') >>> model.wv.most_similar('yamaçlardan', topn = 10) [('kayalardan', 0.8601457476615906), ('kayalıklardan', 0.8567330837249756), ('tepelerden', 0.8423191905021667), ('ormanlardan', 0.8362939357757568), ('dağlardan', 0.8140010833740234), ('amaçlardan', 0.810560405254364), ('bloklardan', 0.803180992603302), ('otlardan', 0.8026642203330994), ('kısımlardan', 0.7993910312652588), ('ağaçlardan', 0.7961613535881042)] >>> # SentencePiece Unigram Tokenizer >>> import sentencepiece as spm >>> sp = spm.SentencePieceProcessor('SentencePiece_16k_Tokenizer.model') >>> tokenizer.encode_as_pieces('bilemezlerken') ['▁bile', 'mez', 'lerken'] >>> tokenizer.encode_as_ids('bilemezlerken') [180, 1200, 8167] - For more details about corpora, preprocessing and training, see `ReadMe `_.