Optical character recognition applied on receipts printed in Macedonian Language

Gjoreski, Martin; Zajkovski, Gorjan; Bogatinov, Aleksandar; Madjarov, Gjorgji; Gjorgjevikj, Dejan; Gjoreski, Hristijan

Optical character recognition applied on receipts printed in Macedonian Language

Date Issued

2014-04

Author(s)

Gjoreski, Martin

Zajkovski, Gorjan

Bogatinov, Aleksandar

Madjarov, Gjorgji

Gjorgjevikj, Dejan

Gjoreski, Hristijan

Abstract

The paper presents an approach to Optical
Character Recognition (OCR) applied on receipts printed in
Macedonian language. The OCR engine recognizes the
characters of the receipt and extracts some useful information,
such as: the name of the market, the names of the products
purchased, the prices of the products, the total amount of money
spent, and also the date and the time of the purchase. We used
the publicly available OCR framework Tesseract, which was
trained on pictures of receipts printed in Macedonian language.
The results showed that it can recognize the characters with 93%
accuracy. Additionally, we used another approach that uses the
original Tesseract to extract the features out of the picture and
the final classification was performed with k-nearest neighbor’s
classifier using dynamic time warping as a distance metrics. Even
though the accuracy achieved with the modified approach was
for 6 percentage points lower than the original approach, it is a
proof of concept and we plan to further research it in future
publications. The additional analysis of the results showed that
the accuracy is higher for the words which are prescribed for
each receipt, such as the date and the time of the purchase and
the total amount of money spent.

Subjects

OCR; Receipt digitali...

File(s)

Name

CIIT2014.59.pdf

Size

304.25 KB

Format

Adobe PDF

Checksum

(MD5):72845ae33ff7b644feefcc1c0fc3c0b4