Repository logo
Communities & Collections
Research Outputs
Fundings & Projects
People
Statistics
User Manual
Have you forgotten your password?
  1. Home
  2. Faculty of Computer Science and Engineering
  3. Faculty of Computer Science and Engineering: Conference papers
  4. DocSplitAI: A Deep Learning Approach for Document Segmentation in Large PDFs
Details

DocSplitAI: A Deep Learning Approach for Document Segmentation in Large PDFs

Date Issued
2023-07
Author(s)
Bajrami, Merxhan
Gaissmaier, Yvonne
Abstract
In many industries, organizations often face the
challenge of managing batch of more documents merged into
a single file. This can lead to difficulties in identifying where
each individual document begins and ends, making document
processing a time-consuming and error-prone task. For example,
in many businesses, invoices are received in large batches and
need to be processed quickly and accurately. This can be time consuming and error-prone to manually split a large document
containing multiple invoices into individual files. In legal and
financial sectors, large volumes of documents such as contracts,
invoices, and receipts can be merged together, leading to diffi culties in managing and processing the documents.
To address this challenge, we propose a binary classification
approach using the Donut [1] model, which is an OCR-free
model that can learn to recognize patterns and features in
the data without relying on optical character recognition. Our
approach involves fine-tuning the model on a dataset of 5527
files, manually labeled into new document and same document
classes. We developed a new methodology for creating the dataset
that ensures a well-balanced distribution of examples for each
class, and carefully selected hyperparameters to optimize the
performance of the model.
Our results demonstrate that our approach achieved an
accuracy of 0.89, an f1 score of 0.92, precision of 0.87, and
recall of 0.93. These results suggest that our proposed approach
is highly effective in identifying individual documents within
merged PDFs, which has significant implications for a range of
industries. For instance, in the legal sector, our approach could
help to automate the process of document separation, making it
easier for lawyers to manage and process large volumes of legal
documents. In the financial sector, the approach could help to
streamline the processing of invoices, receipts, and other financial
documents.
Subjects

Page stream segmentat...

File(s)
Loading...
Thumbnail Image
Name

CIIT2023_paper_24.pdf

Size

8.97 MB

Format

Adobe PDF

Checksum

(MD5):ab32bcc4f08739f2b0dfdbc60d6767d5

⠀

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Accessibility settings
  • Privacy policy
  • End User Agreement
  • Send Feedback
Repository logo COAR Notify