Normalizer¶

class vnlp.normalizer.normalizer.Normalizer[source]¶

Normalizer class

It contains the following functions to process and normalize text:
- Spelling/Typo correction
- Deasciification
- Convert numbers to word form
- Lower case
- Punctuation Remover
- Remove accent marks
For more details about the algorithms and datasets, see Readme.

convert_numbers_to_words(tokens: List[str], num_dec_digits: int = 6, decimal_seperator: str = ',') → List[str][source]¶

Converts numbers to word form.

Parameters:

tokens – List of input tokens.
num_dec_digits – Number of precision (decimal points) for floats.
decimal_seperator – Decimal seperator character. Can be either “.” or “,”.

Returns:

List of converted tokens

Raises:

ValueError – Given ‘decimal seperator’ is not a valid decimal seperator value. Use either “.” or “,”.

Example:

from vnlp import Normalizer
normalizer = Normalizer()
normalizer.convert_numbers_to_words("sabah 3 yumurta yedim ve tartıldığımda 1,15 kilogram aldığımı gördüm".split())

['sabah',
'üç',
'yumurta',
'yedim',
've',
'tartıldığımda',
'bir',
'virgül',
'on',
'beş',
'kilogram',
'aldığımı',
'gördüm']

correct_typos(tokens: List[str]) → List[str][source]¶

Detects and corrects spelling mistakes and typos.

This implementation uses StemmerAnalyzer and Hunspell to detect typos. Detected typos are corrected by Hunspell algorithm using “tdd-hunspell-tr-1.1.0” dict.

Parameters:: tokens – List of input tokens.
Returns:: List of corrected tokens.

Example:

from vnlp import Normalizer
normalizer = Normalizer()
normalizer.correct_typos("Kasıtlı yazişm hatasıı ekliyoruum".split())

["Kasıtlı", "yazım", "hatası", "ekliyorum"]

static deasciify(tokens: List[str]) → List[str][source]¶

Deasciifies the given text for Turkish.

This function uses Emre Sevinç’s implementation.

Parameters:: tokens – List of input tokens.
Returns:: List of deasciified tokens.

Example:

from vnlp import Normalizer
Normalizer.deasciify("dusunuyorum da boyle sey gormedim duymadim".split())

["düşünüyorum", "da", "böyle", "şey", "görmedim", "duymadım"]

static lower_case(text: str) → str[source]¶

Converts a string of text to lowercase for Turkish language.

This is needed because Python does not properly handle all Turkish characters, e.g., “İ” -> “i”.

Parameters:: text – Input text.
Returns:: Text in lowercase form.

Example:

from vnlp import Normalizer
Normalizer.lower_case("Test karakterleri: İIĞÜÖŞÇ")

'test karakterleri: iığüöşç'

static remove_accent_marks(text: str) → str[source]¶

Removes accent marks from the given string.

Parameters:: text – Input text.
Returns:: Text stripped from accent marks.

Example:

from vnlp import Normalizer
Normalizer.remove_accent_marks("merhâbâ")

'merhaba'

static remove_punctuations(text: str) → str[source]¶

Removes punctuations from the given string.

Parameters:: text – Input text.
Returns:: Text stripped from punctuations.

Example:

from vnlp import Normalizer
Normalizer.remove_punctuations("merhaba,.!")

'merhaba'