Cultural topic modelling over novel wikipedia corpora for south-slavic languages

Markoski, Filip; Markoska, Elena; Ljubešić, Nikola; Zdravevski, Eftim; Kocarev, Ljupco

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.12188/22315

Title:	Cultural topic modelling over novel wikipedia corpora for south-slavic languages
Authors:	Markoski, Filip Markoska, Elena Ljubešić, Nikola Zdravevski, Eftim Kocarev, Ljupco
Issue Date:	Sep-2021
Conference:	International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Abstract:	There is a shortage of high-quality corpora for South-Slavic languages. Such corpora are useful to computer scientists and researchers in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for mining Wikipedia content and processing it into linguistically-processed corpora, applied on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian Wikipedia. We make the resulting seven corpora publicly available. We showcase these corpora by comparing the content of the underlying Wikipedias, our assumption being that the content of the Wikipedias reflects broadly the interests in various topics in these Balkan nations. We perform the content comparison by using topic modelling algorithms and various distribution comparisons. The results show that all Wikipedias are topically rather similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science
URI:	http://hdl.handle.net/20.500.12188/22315
Appears in Collections:	Faculty of Computer Science and Engineering: Conference papers

Files in This Item:

File	Description	Size	Format
2021.ranlp-1.104.pdf		247.62 kB	Adobe PDF	View/Open

Show full item record

Page view(s)

33

checked on Nov 9, 2024

Download(s)

8

checked on Nov 9, 2024

Google Scholar^TM

Check

Repository of UKIM

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM