Stopword Remover¶
- class vnlp.stopword_remover.stopword_remover.StopwordRemover[source]¶
Stopword Remover class.
Consists of Static and Dynamic stopword detection methods.
Static stopwords list is taken from Zemberek and some minor improvements are done.
Dynamic stopword algorithm is implemented according to two papers.
On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter proposes to classify stopwords according to their frequency..
Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior proposes to determine a cut-point automatically.
- add_to_stop_words(novel_stop_words: List[str])[source]¶
Updates self.stop_words by adding given novel_stop_words to existing dictionary.
- Parameters:
novel_stop_words – Tokens to be added to existing stop_words dictionary.
Example:
from vnlp import StopwordRemover stopword_remover = StopwordRemover() stopword_remover.add_to_stop_words(['ama', 'aşı', 'gelip', 'eve'])
- drop_stop_words(list_of_tokens: List[str]) List[str] [source]¶
Given list of tokens, drops stop words and returns list of remaining tokens.
- Parameters:
list_of_tokens – List of input tokens.
- Returns:
List of tokens stripped of stopwords
Example:
from vnlp import StopwordRemover stopword_remover = StopwordRemover() stopword_remover.drop_stop_words("acaba bugün kahvaltıda kahve yerine çay mı içsem ya da neyse süt içeyim".split()) ['bugün', 'kahvaltıda', 'kahve', 'çay', 'içsem', 'süt', 'içeyim']
- dynamically_detect_stop_words(list_of_tokens: List[str], rare_words_freq: int = 0) List[str] [source]¶
Dynamically detects stop words and returns them as list of tokens.
Use a large corpus with at least hundreds of unique tokens for a reasonable result.
- Parameters:
list_of_tokens – List of input tokens
rare_words_freq – Maximum frequency of words when deciding rarity. Default value is 0 so it does not detect any rare words by default.
- Returns:
List of dynamically detected stop words.
- Raises:
ValueError – Number of unique tokens must be at least 3 for Dynamic Stop Word Detection.
Example:
from vnlp import StopwordRemover stopword_remover = StopwordRemover() stopword_remover.dynamically_detect_stop_words(""ben bugün gidip aşı olacağım sonra da eve gelip telefon açacağım aşı nasıl etkiledi eve gelip anlatırım aşı olmak bu dönemde çok ama ama ama ama çok önemli"".split()) ['ama', 'aşı', 'gelip', 'eve']