Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.12188/27399
Title: DocSplitAI: A Deep Learning Approach for Document Segmentation in Large PDFs
Authors: Bajrami, Merxhan
Kulakov, Andrea 
Gaissmaier, Yvonne
Lameski, Petre 
Keywords: Page stream segmentation, section segmentation, Donut, intelligent document processing, deep learning
Issue Date: Jul-2023
Publisher: Ss Cyril and Methodius University in Skopje, Faculty of Computer Science and Engineering, Republic of North Macedonia
Series/Report no.: CIIT 2023 papers;24;
Conference: 20th International Conference on Informatics and Information Technologies - CIIT 2023
Abstract: In many industries, organizations often face the challenge of managing batch of more documents merged into a single file. This can lead to difficulties in identifying where each individual document begins and ends, making document processing a time-consuming and error-prone task. For example, in many businesses, invoices are received in large batches and need to be processed quickly and accurately. This can be time consuming and error-prone to manually split a large document containing multiple invoices into individual files. In legal and financial sectors, large volumes of documents such as contracts, invoices, and receipts can be merged together, leading to diffi culties in managing and processing the documents. To address this challenge, we propose a binary classification approach using the Donut [1] model, which is an OCR-free model that can learn to recognize patterns and features in the data without relying on optical character recognition. Our approach involves fine-tuning the model on a dataset of 5527 files, manually labeled into new document and same document classes. We developed a new methodology for creating the dataset that ensures a well-balanced distribution of examples for each class, and carefully selected hyperparameters to optimize the performance of the model. Our results demonstrate that our approach achieved an accuracy of 0.89, an f1 score of 0.92, precision of 0.87, and recall of 0.93. These results suggest that our proposed approach is highly effective in identifying individual documents within merged PDFs, which has significant implications for a range of industries. For instance, in the legal sector, our approach could help to automate the process of document separation, making it easier for lawyers to manage and process large volumes of legal documents. In the financial sector, the approach could help to streamline the processing of invoices, receipts, and other financial documents.
URI: http://hdl.handle.net/20.500.12188/27399
Appears in Collections:Faculty of Computer Science and Engineering: Conference papers

Files in This Item:
File Description SizeFormat 
CIIT2023_paper_24.pdf9.18 MBAdobe PDFView/Open
Show full item record

Page view(s)

61
checked on May 29, 2024

Download(s)

90
checked on May 29, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.