Optical character recognition applied on receipts printed in Macedonian Language
Date Issued
2014-04
Author(s)
Gjoreski, Martin
Zajkovski, Gorjan
Bogatinov, Aleksandar
Madjarov, Gjorgji
Gjorgjevikj, Dejan
Gjoreski, Hristijan
Abstract
The paper presents an approach to Optical
Character Recognition (OCR) applied on receipts printed in
Macedonian language. The OCR engine recognizes the
characters of the receipt and extracts some useful information,
such as: the name of the market, the names of the products
purchased, the prices of the products, the total amount of money
spent, and also the date and the time of the purchase. We used
the publicly available OCR framework Tesseract, which was
trained on pictures of receipts printed in Macedonian language.
The results showed that it can recognize the characters with 93%
accuracy. Additionally, we used another approach that uses the
original Tesseract to extract the features out of the picture and
the final classification was performed with k-nearest neighbor’s
classifier using dynamic time warping as a distance metrics. Even
though the accuracy achieved with the modified approach was
for 6 percentage points lower than the original approach, it is a
proof of concept and we plan to further research it in future
publications. The additional analysis of the results showed that
the accuracy is higher for the words which are prescribed for
each receipt, such as the date and the time of the purchase and
the total amount of money spent.
Character Recognition (OCR) applied on receipts printed in
Macedonian language. The OCR engine recognizes the
characters of the receipt and extracts some useful information,
such as: the name of the market, the names of the products
purchased, the prices of the products, the total amount of money
spent, and also the date and the time of the purchase. We used
the publicly available OCR framework Tesseract, which was
trained on pictures of receipts printed in Macedonian language.
The results showed that it can recognize the characters with 93%
accuracy. Additionally, we used another approach that uses the
original Tesseract to extract the features out of the picture and
the final classification was performed with k-nearest neighbor’s
classifier using dynamic time warping as a distance metrics. Even
though the accuracy achieved with the modified approach was
for 6 percentage points lower than the original approach, it is a
proof of concept and we plan to further research it in future
publications. The additional analysis of the results showed that
the accuracy is higher for the words which are prescribed for
each receipt, such as the date and the time of the purchase and
the total amount of money spent.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
CIIT2014.59.pdf
Size
304.25 KB
Format
Adobe PDF
Checksum
(MD5):72845ae33ff7b644feefcc1c0fc3c0b4
