Skip to content

OCR application using Tesseract and Tess4J

Following are the instructions for setting OCR application using Tesseract and Tess4J

  1. For this we are using Tesseract Click Here, open source OCR library continuously developing by Google and community

  2. To use this stand alone application from java applications we are using one client wrapper called Tess4J Click Here

Integration steps with Java

1.First install Tesseract software in your target server, I am assuming Ubuntu as a server here, for other OS please follow document Click Here

1
2
    sudo apt install tesseract-ocr
    sudo apt install libtesseract-dev

2.Download Training data set of interested language from Tesseract git hub Click Here, currently I assuming English as target language So I downloaded eng.traineddata from above site and saved in my local machine folder. We will use this path in java program.

3.Create Spring boot project using Spring Initializer and add maven dependency for Tess4J client wrapper for Tesseract in POM file

1
2
3
4
5
6
7
8
9
             <dependency>
                    <groupId>net.sourceforge.tess4j</groupId>
                    <artifactId>tess4j</artifactId>
                    <version>4.0.0</version>
            </dependency>


!!! Summary "Caution"
        The Installed <b>Tesseract</b> in 1st step and <b>Tess4j</b> Should be on same version

4.Create a Demo Java Class in Spring boot project as below

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
 public class OcrExample {

     private static Tesseract getTesseract() {
            Tesseract instance = new Tesseract();
            instance.setDatapath("/home/Documents/work/OCR/tessdata"); // Path to en.traineddata file
            instance.setLanguage("eng");
            //instance.setHocr(true);
            return instance;
        }

        public static void main(String[] args) throws TesseractException {

            Tesseract tesseract = getTesseract();
            File file = new File("/home/Documents/work/OCR/images/myimage.jpg"); // Path to Source image to convert to text
            String result = tesseract.doOCR(file);
            System.out.println(result);
        }

}

Accuracy Improvement Tips

  1. The performance of OCR directly related to quality of image provided

  2. If input image file is not up to the quality you can scale using different softwares

  3. I done this using imagemagic in Ubuntu, please follow steps documented in this link Click Here

sample image and OCR output

OCRApplication.jpg

output

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
DEPARTMENT OF HEALTH AND HUMAN SERVICES
Food and Drug Administration

indications for Use

Form Approved: OMB No. 0910-0120
Expiration Date: 06/30/2020
See PRA Statement below



570¢k) Number (if know)
Kis2714
Device Name
Novidia™ Bulk Fill Flow Composite

Indications for Use (Describe)

The Novidia™ Bulk Fill Flow Composite is indicated for:

|. Base under Class | and II direct restorations

2. Liner under direct restorative materials

3. Pit and fissure sealant

4. Restoration of minimally invasive cavity preparations (including small, non-stress-bearing occlusal restorations)
5. Class II] and V restorations

6. Blocking out of undercuts

7. Repair of small enamel defects

8. Repair of small defects in esthetic indirect restorations

9. Repair of resin and acrylic temporary materials

10. As a core build-up where at least half the coronal tooth structure is remaining to provide structural support for
the crown

Type of Use (Select one or both, as appicatie}
2) Prescription Use (Part 21 CFR 801 Subpart D) _| Over-The-Counter Use (21 CFR 801 Subpart C)

CONTINUE ON A SEPARATE PAGE IF NEEDED.

This section apples only to requirements of the Paperwork Reduction Act of 1996
“DO NOT SEND YOUR COMPLETED FORM TO THE PRA STAFF EMAIL ADDRESS BELOW,

The burden time for thes collection of information is estimated to average 79 hours per response, including the
time to review instructions, search existing date sources, gather and maintain the data needed and complete
and review the collection of information. Send comments regarding thés burden estenate or any other aspect
of this information collection, including suggestions for reducing this burden, to:

Department of Health and Human Services

Food and Drug Administration

Office of Chief Information Officer

Paperwork Reduction Act (PRA) Staff

PRAStafiiida hhs. gov

‘An agaricy may nol conduct or sportsor, and & person is not required to respond fo, a collection of
information unless uf displays 4 currently valid OMB number “

FORM FDA 3881 (7/17) Page 1 of | Webley eave sae tot ef

Usage in KPAI

In future we may get patient documents in OCR format, so we can convert using this application before GATE analysis.

Thank You Document By: Krishna Reddy