Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.12188/27399
DC FieldValueLanguage
dc.contributor.authorBajrami, Merxhanen_US
dc.contributor.authorKulakov, Andreaen_US
dc.contributor.authorGaissmaier, Yvonneen_US
dc.contributor.authorLameski, Petreen_US
dc.date.accessioned2023-08-15T08:42:57Z-
dc.date.available2023-08-15T08:42:57Z-
dc.date.issued2023-07-
dc.identifier.urihttp://hdl.handle.net/20.500.12188/27399-
dc.description.abstractIn many industries, organizations often face the challenge of managing batch of more documents merged into a single file. This can lead to difficulties in identifying where each individual document begins and ends, making document processing a time-consuming and error-prone task. For example, in many businesses, invoices are received in large batches and need to be processed quickly and accurately. This can be time consuming and error-prone to manually split a large document containing multiple invoices into individual files. In legal and financial sectors, large volumes of documents such as contracts, invoices, and receipts can be merged together, leading to diffi culties in managing and processing the documents. To address this challenge, we propose a binary classification approach using the Donut [1] model, which is an OCR-free model that can learn to recognize patterns and features in the data without relying on optical character recognition. Our approach involves fine-tuning the model on a dataset of 5527 files, manually labeled into new document and same document classes. We developed a new methodology for creating the dataset that ensures a well-balanced distribution of examples for each class, and carefully selected hyperparameters to optimize the performance of the model. Our results demonstrate that our approach achieved an accuracy of 0.89, an f1 score of 0.92, precision of 0.87, and recall of 0.93. These results suggest that our proposed approach is highly effective in identifying individual documents within merged PDFs, which has significant implications for a range of industries. For instance, in the legal sector, our approach could help to automate the process of document separation, making it easier for lawyers to manage and process large volumes of legal documents. In the financial sector, the approach could help to streamline the processing of invoices, receipts, and other financial documents.en_US
dc.publisherSs Cyril and Methodius University in Skopje, Faculty of Computer Science and Engineering, Republic of North Macedoniaen_US
dc.relation.ispartofseriesCIIT 2023 papers;24;-
dc.subjectPage stream segmentation, section segmentation, Donut, intelligent document processing, deep learningen_US
dc.titleDocSplitAI: A Deep Learning Approach for Document Segmentation in Large PDFsen_US
dc.typeProceeding articleen_US
dc.relation.conference20th International Conference on Informatics and Information Technologies - CIIT 2023en_US
item.fulltextWith Fulltext-
item.grantfulltextopen-
crisitem.author.deptFaculty of Computer Science and Engineering-
crisitem.author.deptFaculty of Computer Science and Engineering-
Appears in Collections:Faculty of Computer Science and Engineering: Conference papers
Files in This Item:
File Description SizeFormat 
CIIT2023_paper_24.pdf9.18 MBAdobe PDFView/Open
Show simple item record

Page view(s)

81
checked on Sep 8, 2024

Download(s)

128
checked on Sep 8, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.