Cultural topic modelling over novel wikipedia corpora for south-slavic languages

Markoski, Filip; Markoska, Elena; Ljubešić, Nikola; Zdravevski, Eftim; Kocarev, Ljupco

Cultural topic modelling over novel wikipedia corpora for south-slavic languages

Date Issued

2021-09

Author(s)

Markoski, Filip

Markoska, Elena

Ljubešić, Nikola

Kocarev, Ljupco

Abstract

There is a shortage of high-quality corpora
for South-Slavic languages. Such corpora are
useful to computer scientists and researchers
in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for
mining Wikipedia content and processing it
into linguistically-processed corpora, applied
on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian
Wikipedia. We make the resulting seven corpora publicly available. We showcase these
corpora by comparing the content of the underlying Wikipedias, our assumption being that
the content of the Wikipedias reflects broadly
the interests in various topics in these Balkan
nations. We perform the content comparison by using topic modelling algorithms and
various distribution comparisons. The results
show that all Wikipedias are topically rather
similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science

File(s)

Name

2021.ranlp-1.104.pdf

Size

247.62 KB

Format

Adobe PDF

Checksum

(MD5):2f9b48349759e44f8b29ca19126081d2