Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images

December 11, 2022 · Declared Dead · 🏛 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Hongkuan Zhang, Edward Whittaker, Ikuo Kitagishi arXiv ID 2212.05525 Category cs.CL: Computation & Language Cross-listed cs.CV Citations 10 Venue 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) Last Checked 4 months ago

Abstract

Digitization of scanned receipts aims to extract text from receipt images and save it into structured documents. This is usually split into two sub-tasks: text localization and optical character recognition (OCR). Most existing OCR models only focus on the cropped text instance images, which require the bounding box information provided by a text region detection model. Introducing an additional detector to identify the text instance images in advance adds complexity, however instance-level OCR models have very low accuracy when processing the whole image for the document-level OCR, such as receipt images containing multiple text lines arranged in various layouts. To this end, we propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end. Specifically, we finetune the pretrained instance-level model TrOCR with randomly cropped image chunks, and gradually increase the image chunk size to generalize the recognition ability from instance images to full-page images. In our experiments on the SROIE receipt OCR dataset, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rate (CER), respectively, which outperforms the baseline results with 48.5 F1-score and 50.6% CER. The best model, which splits the full image into 15 equally sized chunks, gives 87.8 F1-score and 4.98% CER with minimal additional pre or post-processing of the output. Moreover, the characters in the generated document-level sequences are arranged in the reading order, which is practical for real-world applications.