Cultural topic modelling over novel wikipedia corpora for south-slavic languages
Date Issued
2021-09
Author(s)
Markoski, Filip
Markoska, Elena
Ljubešić, Nikola
Kocarev, Ljupco
Abstract
There is a shortage of high-quality corpora
for South-Slavic languages. Such corpora are
useful to computer scientists and researchers
in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for
mining Wikipedia content and processing it
into linguistically-processed corpora, applied
on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian
Wikipedia. We make the resulting seven corpora publicly available. We showcase these
corpora by comparing the content of the underlying Wikipedias, our assumption being that
the content of the Wikipedias reflects broadly
the interests in various topics in these Balkan
nations. We perform the content comparison by using topic modelling algorithms and
various distribution comparisons. The results
show that all Wikipedias are topically rather
similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science
for South-Slavic languages. Such corpora are
useful to computer scientists and researchers
in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for
mining Wikipedia content and processing it
into linguistically-processed corpora, applied
on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian
Wikipedia. We make the resulting seven corpora publicly available. We showcase these
corpora by comparing the content of the underlying Wikipedias, our assumption being that
the content of the Wikipedias reflects broadly
the interests in various topics in these Balkan
nations. We perform the content comparison by using topic modelling algorithms and
various distribution comparisons. The results
show that all Wikipedias are topically rather
similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science
File(s)![Thumbnail Image]()
Loading...
Name
2021.ranlp-1.104.pdf
Size
247.62 KB
Format
Adobe PDF
Checksum
(MD5):2f9b48349759e44f8b29ca19126081d2
