Наслов: Optical character recognition applied on receipts printed in Macedonian Language
Authors: Gjoreski, Martin
Zajkovski, Gorjan
Bogatinov, Aleksandar
Madjarov, Gjorgji
Gjorgjevikj, Dejan
Gjoreski, Hristijan
Keywords: OCR; Receipt digitalization; Tesseract; DTW;
Issue Date: апр-2014
Conference: International Conference on Informatics and Information Technologies (CIIT)
Abstract: The paper presents an approach to Optical Character Recognition (OCR) applied on receipts printed in Macedonian language. The OCR engine recognizes the characters of the receipt and extracts some useful information, such as: the name of the market, the names of the products purchased, the prices of the products, the total amount of money spent, and also the date and the time of the purchase. We used the publicly available OCR framework Tesseract, which was trained on pictures of receipts printed in Macedonian language. The results showed that it can recognize the characters with 93% accuracy. Additionally, we used another approach that uses the original Tesseract to extract the features out of the picture and the final classification was performed with k-nearest neighbor’s classifier using dynamic time warping as a distance metrics. Even though the accuracy achieved with the modified approach was for 6 percentage points lower than the original approach, it is a proof of concept and we plan to further research it in future publications. The additional analysis of the results showed that the accuracy is higher for the words which are prescribed for each receipt, such as the date and the time of the purchase and the total amount of money spent.
