DocSplitAI: A Deep Learning Approach for Document Segmentation in Large PDFs

Bajrami, Merxhan; Kulakov, Andrea; Gaissmaier, Yvonne; Lameski, Petre

Ве молиме користете го овој идентификатор да го цитирате или поврзете овој запис: http://hdl.handle.net/20.500.12188/27399

DC Field	Value	Language
dc.contributor.author	Bajrami, Merxhan	en_US
dc.contributor.author	Kulakov, Andrea	en_US
dc.contributor.author	Gaissmaier, Yvonne	en_US
dc.contributor.author	Lameski, Petre	en_US
dc.date.accessioned	2023-08-15T08:42:57Z	-
dc.date.available	2023-08-15T08:42:57Z	-
dc.date.issued	2023-07	-
dc.identifier.uri	http://hdl.handle.net/20.500.12188/27399	-
dc.description.abstract	In many industries, organizations often face the challenge of managing batch of more documents merged into a single file. This can lead to difficulties in identifying where each individual document begins and ends, making document processing a time-consuming and error-prone task. For example, in many businesses, invoices are received in large batches and need to be processed quickly and accurately. This can be time consuming and error-prone to manually split a large document containing multiple invoices into individual files. In legal and financial sectors, large volumes of documents such as contracts, invoices, and receipts can be merged together, leading to diffi culties in managing and processing the documents. To address this challenge, we propose a binary classification approach using the Donut [1] model, which is an OCR-free model that can learn to recognize patterns and features in the data without relying on optical character recognition. Our approach involves fine-tuning the model on a dataset of 5527 files, manually labeled into new document and same document classes. We developed a new methodology for creating the dataset that ensures a well-balanced distribution of examples for each class, and carefully selected hyperparameters to optimize the performance of the model. Our results demonstrate that our approach achieved an accuracy of 0.89, an f1 score of 0.92, precision of 0.87, and recall of 0.93. These results suggest that our proposed approach is highly effective in identifying individual documents within merged PDFs, which has significant implications for a range of industries. For instance, in the legal sector, our approach could help to automate the process of document separation, making it easier for lawyers to manage and process large volumes of legal documents. In the financial sector, the approach could help to streamline the processing of invoices, receipts, and other financial documents.	en_US
dc.publisher	Ss Cyril and Methodius University in Skopje, Faculty of Computer Science and Engineering, Republic of North Macedonia	en_US
dc.relation.ispartofseries	CIIT 2023 papers;24;	-
dc.subject	Page stream segmentation, section segmentation, Donut, intelligent document processing, deep learning	en_US
dc.title	DocSplitAI: A Deep Learning Approach for Document Segmentation in Large PDFs	en_US
dc.type	Proceeding article	en_US
dc.relation.conference	20th International Conference on Informatics and Information Technologies - CIIT 2023	en_US
item.fulltext	With Fulltext	-
item.grantfulltext	open	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
Appears in Collections:	Faculty of Computer Science and Engineering: Conference papers

Files in This Item:

File	Опис	Size	Format
CIIT2023_paper_24.pdf		9.18 MB	Adobe PDF	View/Open

Прикажи едноставен запис

Page view(s)

111

checked on 4.5.2025

Download(s)

179

checked on 4.5.2025

Google Scholar^TM

Проверете

Записите во DSpace се заштитени со авторски права, со сите права задржани, освен ако не е поинаку наведено.

Репозиториум на трудови на УКИМ

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM