Normalizer

class vnlp.normalizer.normalizer.Normalizer[source]

Normalizer class

  • It contains the following functions to process and normalize text:

    • Spelling/Typo correction

    • Deasciification

    • Convert numbers to word form

    • Lower case

    • Punctuation Remover

    • Remove accent marks

  • For more details about the algorithms and datasets, see Readme.

convert_numbers_to_words(tokens: List[str], num_dec_digits: int = 6, decimal_seperator: str = ',') List[str][source]

Converts numbers to word form.

Parameters
  • tokens – List of input tokens.

  • num_dec_digits – Number of precision (decimal points) for floats.

  • decimal_seperator – Decimal seperator character. Can be either “.” or “,”.

Returns

List of converted tokens

Raises

ValueError – Given ‘decimal seperator’ is not a valid decimal seperator value. Use either “.” or “,”.

Example:

from vnlp import Normalizer
normalizer = Normalizer()
normalizer.convert_numbers_to_words("sabah 3 yumurta yedim ve tartıldığımda 1,15 kilogram aldığımı gördüm".split())

['sabah',
'üç',
'yumurta',
'yedim',
've',
'tartıldığımda',
'bir',
'virgül',
'on',
'beş',
'kilogram',
'aldığımı',
'gördüm']
correct_typos(tokens: List[str]) List[str][source]

Detects and corrects spelling mistakes and typos.

This implementation uses StemmerAnalyzer and Hunspell to detect typos. Detected typos are corrected by Hunspell algorithm using “tdd-hunspell-tr-1.1.0” dict.

Parameters

tokens – List of input tokens.

Returns

List of corrected tokens.

Example:

from vnlp import Normalizer
normalizer = Normalizer()
normalizer.correct_typos("Kasıtlı yazişm hatasıı ekliyoruum".split())

["Kasıtlı", "yazım", "hatası", "ekliyorum"]
static deasciify(tokens: List[str]) List[str][source]

Deasciifies the given text for Turkish.

This function uses Emre Sevinç’s implementation.

Parameters

tokens – List of input tokens.

Returns

List of deasciified tokens.

Example:

from vnlp import Normalizer
Normalizer.deasciify("dusunuyorum da boyle sey gormedim duymadim".split())

["düşünüyorum", "da", "böyle", "şey", "görmedim", "duymadım"]
static lower_case(text: str) str[source]

Converts a string of text to lowercase for Turkish language.

This is needed because Python does not properly handle all Turkish characters, e.g., “İ” -> “i”.

Parameters

text – Input text.

Returns

Text in lowercase form.

Example:

from vnlp import Normalizer
Normalizer.lower_case("Test karakterleri: İIĞÜÖŞÇ")

'test karakterleri: iığüöşç'
static remove_accent_marks(text: str) str[source]

Removes accent marks from the given string.

Parameters

text – Input text.

Returns

Text stripped from accent marks.

Example:

from vnlp import Normalizer
Normalizer.remove_accent_marks("merhâbâ")

'merhaba'
static remove_punctuations(text: str) str[source]

Removes punctuations from the given string.

Parameters

text – Input text.

Returns

Text stripped from punctuations.

Example:

from vnlp import Normalizer
Normalizer.remove_punctuations("merhaba,.!")

'merhaba'