Repository logo
Communities & Collections
Research Outputs
Fundings & Projects
People
Statistics
User Manual
Have you forgotten your password?
  1. Home
  2. Faculty of Computer Science and Engineering
  3. Faculty of Computer Science and Engineering: Conference papers
  4. Cultural topic modelling over novel wikipedia corpora for south-slavic languages
Details

Cultural topic modelling over novel wikipedia corpora for south-slavic languages

Date Issued
2021-09
Author(s)
Markoski, Filip
Markoska, Elena
Ljubešić, Nikola
Kocarev, Ljupco
Abstract
There is a shortage of high-quality corpora
for South-Slavic languages. Such corpora are
useful to computer scientists and researchers
in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for
mining Wikipedia content and processing it
into linguistically-processed corpora, applied
on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian
Wikipedia. We make the resulting seven corpora publicly available. We showcase these
corpora by comparing the content of the underlying Wikipedias, our assumption being that
the content of the Wikipedias reflects broadly
the interests in various topics in these Balkan
nations. We perform the content comparison by using topic modelling algorithms and
various distribution comparisons. The results
show that all Wikipedias are topically rather
similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science
File(s)
Loading...
Thumbnail Image
Name

2021.ranlp-1.104.pdf

Size

247.62 KB

Format

Adobe PDF

Checksum

(MD5):2f9b48349759e44f8b29ca19126081d2

⠀

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Accessibility settings
  • Privacy policy
  • End User Agreement
  • Send Feedback
Repository logo COAR Notify