Normalizer¶
- class vnlp.normalizer.normalizer.Normalizer[source]¶
Normalizer class
It contains the following functions to process and normalize text:
Spelling/Typo correction
Deasciification
Convert numbers to word form
Lower case
Punctuation Remover
Remove accent marks
For more details about the algorithms and datasets, see Readme.
- convert_numbers_to_words(tokens: List[str], num_dec_digits: int = 6, decimal_seperator: str = ',') List[str] [source]¶
Converts numbers to word form.
- Parameters:
tokens – List of input tokens.
num_dec_digits – Number of precision (decimal points) for floats.
decimal_seperator – Decimal seperator character. Can be either “.” or “,”.
- Returns:
List of converted tokens
- Raises:
ValueError – Given ‘decimal seperator’ is not a valid decimal seperator value. Use either “.” or “,”.
Example:
from vnlp import Normalizer normalizer = Normalizer() normalizer.convert_numbers_to_words("sabah 3 yumurta yedim ve tartıldığımda 1,15 kilogram aldığımı gördüm".split()) ['sabah', 'üç', 'yumurta', 'yedim', 've', 'tartıldığımda', 'bir', 'virgül', 'on', 'beş', 'kilogram', 'aldığımı', 'gördüm']
- correct_typos(tokens: List[str]) List[str] [source]¶
Detects and corrects spelling mistakes and typos.
This implementation uses StemmerAnalyzer and Hunspell to detect typos. Detected typos are corrected by Hunspell algorithm using “tdd-hunspell-tr-1.1.0” dict.
- Parameters:
tokens – List of input tokens.
- Returns:
List of corrected tokens.
Example:
from vnlp import Normalizer normalizer = Normalizer() normalizer.correct_typos("Kasıtlı yazişm hatasıı ekliyoruum".split()) ["Kasıtlı", "yazım", "hatası", "ekliyorum"]
- static deasciify(tokens: List[str]) List[str] [source]¶
Deasciifies the given text for Turkish.
This function uses Emre Sevinç’s implementation.
- Parameters:
tokens – List of input tokens.
- Returns:
List of deasciified tokens.
Example:
from vnlp import Normalizer Normalizer.deasciify("dusunuyorum da boyle sey gormedim duymadim".split()) ["düşünüyorum", "da", "böyle", "şey", "görmedim", "duymadım"]
- static lower_case(text: str) str [source]¶
Converts a string of text to lowercase for Turkish language.
This is needed because Python does not properly handle all Turkish characters, e.g., “İ” -> “i”.
- Parameters:
text – Input text.
- Returns:
Text in lowercase form.
Example:
from vnlp import Normalizer Normalizer.lower_case("Test karakterleri: İIĞÜÖŞÇ") 'test karakterleri: iığüöşç'