DocSplitAI: A Deep Learning Approach for Document Segmentation in Large PDFs

Bajrami, Merxhan; Kulakov, Andrea; Gaissmaier, Yvonne; Lameski, Petre

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.12188/27399

Title:	DocSplitAI: A Deep Learning Approach for Document Segmentation in Large PDFs
Authors:	Bajrami, Merxhan Kulakov, Andrea Gaissmaier, Yvonne Lameski, Petre
Keywords:	Page stream segmentation, section segmentation, Donut, intelligent document processing, deep learning
Issue Date:	Jul-2023
Publisher:	Ss Cyril and Methodius University in Skopje, Faculty of Computer Science and Engineering, Republic of North Macedonia
Series/Report no.:	CIIT 2023 papers;24;
Conference:	20th International Conference on Informatics and Information Technologies - CIIT 2023
Abstract:	In many industries, organizations often face the challenge of managing batch of more documents merged into a single file. This can lead to difficulties in identifying where each individual document begins and ends, making document processing a time-consuming and error-prone task. For example, in many businesses, invoices are received in large batches and need to be processed quickly and accurately. This can be time consuming and error-prone to manually split a large document containing multiple invoices into individual files. In legal and financial sectors, large volumes of documents such as contracts, invoices, and receipts can be merged together, leading to diffi culties in managing and processing the documents. To address this challenge, we propose a binary classification approach using the Donut [1] model, which is an OCR-free model that can learn to recognize patterns and features in the data without relying on optical character recognition. Our approach involves fine-tuning the model on a dataset of 5527 files, manually labeled into new document and same document classes. We developed a new methodology for creating the dataset that ensures a well-balanced distribution of examples for each class, and carefully selected hyperparameters to optimize the performance of the model. Our results demonstrate that our approach achieved an accuracy of 0.89, an f1 score of 0.92, precision of 0.87, and recall of 0.93. These results suggest that our proposed approach is highly effective in identifying individual documents within merged PDFs, which has significant implications for a range of industries. For instance, in the legal sector, our approach could help to automate the process of document separation, making it easier for lawyers to manage and process large volumes of legal documents. In the financial sector, the approach could help to streamline the processing of invoices, receipts, and other financial documents.
URI:	http://hdl.handle.net/20.500.12188/27399
Appears in Collections:	Faculty of Computer Science and Engineering: Conference papers

Files in This Item:

File	Description	Size	Format
CIIT2023_paper_24.pdf		9.18 MB	Adobe PDF	View/Open

Show full item record

Page view(s)

61

checked on May 29, 2024

Download(s)

90

checked on May 29, 2024

Google Scholar^TM

Check

Repository of UKIM

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM