Sonority Based Syllabification of Macedonian and Serbian
Journal
Technical editors
Date Issued
2024-11-21
Author(s)
Kuzmanova, Jana
Abstract
Phonetically, syllables are sequences of sounds that contain a single peak of prominence, while
phonologically they are units of stress placement. According to the Sound Sequencing
Principle, sonority within a syllable rises to the nucleus of the syllable and then falls in sonority.
So far, there were several attempts to syllabify Macedonian and Serbian words. The accuracy
of the Macedonian experiment was not evaluated on a specific corpus, while the Serbian
syllabification exceeded 98%. The rule-based approach was rather complex, compared to
sonority based syllabification that we proposed for Macedonian and extend to Serbian.
The sonority of Macedonian phonemes depends on their basic classification: vowels (weight
12), sonorants (4), voiced (2) and voiceless (1). When the sonorant р (Latin transcription: r) is
between two consonants, it becomes a syllable carrier, and therefore its sonority is higher,
initially 6. Two adjacent vowels are separated by a fictitious consonant FC.
The sonority of Serbian phonemes is more complex and embraces additional classes: vowels
(12), sonorant р (8), sonorants and plosive voiced phonemes (4), plosive voiceless and fricative
voiced (3), fricative voiceless and voiced affricates (2), and voiceless affricates (1).
The syllable nuclei in both languages are the five vowels. In Macedonian, a nucleus can be the
sonorant р appearing within a consonant group (крст, вр-ста, пр-вен-ство) or at the end of
the word (ма-са-кр). In Serbian language, apart from the sonorant р (тврд, црв, тр-ка), the
sonorants л and н can also become syllable nuclei (for example, би-ци-кл, Вл-та-ва, Њу-тн).
They are determined by calculating the triplet difference between the sonority of the current
phoneme and its left and right neighbours.
Determination of syllable boundaries depends on the monotonically non-decreasing and
decreasing sonority. In Macedonian, whenever the sonority of two consonants is non decreasing, they are split into two adjacent syllables. In Serbian, in the same case both
consonants are part of the second syllable.
In Macedonian, the accuracy of the baseline algorithm was rather low, mainly because the
suffixes ски, ство and ствен and their inflections, which should remain within one syllable
were separated. By adjusting this, we achieved an accuracy of 95.60% evaluated on a corpus
of more than 1000 words. However, it affected the syllabification of the nouns: гус-ки, мас ки, прас-ки, in which ски is not a morpheme.
Based on the sample of more than 3000 syllabified Serbian words, the accuracy of the baseline
algorithm was 97.59%. By modifying the sonority of р to 6, the accuracy reached 98.54%,
exceeding the rule-based syllabification accuracy.
The approach we proposed is extremely simple and at the same time, very efficient. We intend
to further improve it by taking into account the PoS tags for the Macedonian language and the
exclusions for Serbian, hoping to reach an accuracy of over 99%.
phonologically they are units of stress placement. According to the Sound Sequencing
Principle, sonority within a syllable rises to the nucleus of the syllable and then falls in sonority.
So far, there were several attempts to syllabify Macedonian and Serbian words. The accuracy
of the Macedonian experiment was not evaluated on a specific corpus, while the Serbian
syllabification exceeded 98%. The rule-based approach was rather complex, compared to
sonority based syllabification that we proposed for Macedonian and extend to Serbian.
The sonority of Macedonian phonemes depends on their basic classification: vowels (weight
12), sonorants (4), voiced (2) and voiceless (1). When the sonorant р (Latin transcription: r) is
between two consonants, it becomes a syllable carrier, and therefore its sonority is higher,
initially 6. Two adjacent vowels are separated by a fictitious consonant FC.
The sonority of Serbian phonemes is more complex and embraces additional classes: vowels
(12), sonorant р (8), sonorants and plosive voiced phonemes (4), plosive voiceless and fricative
voiced (3), fricative voiceless and voiced affricates (2), and voiceless affricates (1).
The syllable nuclei in both languages are the five vowels. In Macedonian, a nucleus can be the
sonorant р appearing within a consonant group (крст, вр-ста, пр-вен-ство) or at the end of
the word (ма-са-кр). In Serbian language, apart from the sonorant р (тврд, црв, тр-ка), the
sonorants л and н can also become syllable nuclei (for example, би-ци-кл, Вл-та-ва, Њу-тн).
They are determined by calculating the triplet difference between the sonority of the current
phoneme and its left and right neighbours.
Determination of syllable boundaries depends on the monotonically non-decreasing and
decreasing sonority. In Macedonian, whenever the sonority of two consonants is non decreasing, they are split into two adjacent syllables. In Serbian, in the same case both
consonants are part of the second syllable.
In Macedonian, the accuracy of the baseline algorithm was rather low, mainly because the
suffixes ски, ство and ствен and their inflections, which should remain within one syllable
were separated. By adjusting this, we achieved an accuracy of 95.60% evaluated on a corpus
of more than 1000 words. However, it affected the syllabification of the nouns: гус-ки, мас ки, прас-ки, in which ски is not a morpheme.
Based on the sample of more than 3000 syllabified Serbian words, the accuracy of the baseline
algorithm was 97.59%. By modifying the sonority of р to 6, the accuracy reached 98.54%,
exceeding the rule-based syllabification accuracy.
The approach we proposed is extremely simple and at the same time, very efficient. We intend
to further improve it by taking into account the PoS tags for the Macedonian language and the
exclusions for Serbian, hoping to reach an accuracy of over 99%.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
JUDIG-2024-book of abstracts.pdf
Size
2.07 MB
Format
Adobe PDF
Checksum
(MD5):2a74b24284408ac661f4718b8231c41e
