User Manual

3.71 Extract Text (OCR)

Updated on

In this article, you will learn

  • which prerequisites are required for this Fixup, and
  • how to apply this Fixup.

1. Introduction

The Fixup Extract Text (OCR) is available in the Workflow. The purpose of this Fixup is to extract all text elements of the PDF file in the selected language and stores them as a text file (.txt) in the tab "Additional Data".

2. General

Use the Fixup Extract Text (OCR) to extract characters and numbers from the Print Item in the selected language. Increasing the contrast can significantly improve the readability of the text in some cases. This Fixup does not change the Print Item itself.

Use this Fixup to check the spelling of text in a file. For example, the extracted text can be checked for spelling errors using a spellchecker of your choice.

You can find this Fixup on the tab Data Preparation of an Article, Order or Production Job by:

  • entering the name of the Fixup in the area Filter,
  • activating the option Text in the area Category,
  • activating the option Create in the area Action,
  • activating the option Text or OCR in the area Property.

3. Description

This Fixup uses OCR to extract text from the file, even if the text has already been converted to an outline. The resulting text file is saved in the tab “Additional Data”. Increasing the contrast can often improve the recognition of letters so that it is even possible to distinguish gray letters on a gray background.

Note that this Fixup does not change the Print Item.

3.1. Prerequisites and Functionality

For the Fixup to perform as intended, the following requirements must be met:

  • The file must contain letters or numbers.
  • The form in which the letters or numbers appear is irrelevant. They may be present
    • as text objects,
    • as vectors (text converted to outlines), or
    • in images (text that have already been rendered).

Extraction Sequence

Please note that text passages are not always extracted in the correct sequence. This is particularly the case when a layout with texts is arranged across the entire Print Item and individual text blocks are formatted below it, e.g., in two columns. Text recognition always starts at the top-left and proceeds to the bottom-right, depending on the structure of the file. Texts that have been converted to outlines or images are always added at the end. Texts that have been converted to outlines are given priority over text images.

Figure 1: The dialog of the Fixup Extract text (OCR)

To extract text from the Print Item, select the following options:

  • Language [1] – use the dropdown menu to select the language in which the text that should be extracted was written. The following options are available:
    • German – the text to be extracted was written in German.
    • English – the text to be extracted was written in English.
    • Italian – the text to be extracted was written in Italian.
    • French – the text to be extracted was written in French.
    • Spanish – the text to be extracted was written in Spanish.
    • Portuguese – the text to be extracted was written in Portuguese.
    • Swedish – the text to be extracted was written in Swedish.
    • Polish – the text to be extracted was written in Polish.
    • Russian – the text to be extracted was written in Russian.
    • Japanese – the text to be extracted was written in Japanese.
    • Korean – the text to be extracted was written in Korean.
    • Chinese – the text to be extracted was written in Chinese.
    • Variable Content [4] – for Variable Content [4] , select the desired placeholder – Database Field or User-defined Field – from which the value for the language should be retrieved. The following values must be found in the selected field for the selection to be executed:
      • deu
      • eng
      • ita
      • fra
      • spa
      • por
      • swe
      • pol
      • rus
      • jpn
      • kor
      • chi_tra
  • Accuracy [2] – use the dropdown menu to select the resolution at which the image should be rendered for OCR recognition. The more ornate the writing, the higher the accuracy should be. However, keep in mind that a higher resolution also takes more time. The following options are available:
    • Fast (large text size) – the file is rendered at a resolution of 150 dpi. This option is suitable for large text sizes.
    • Normal (reading sizes) – the file is rendered at a resolution of 600 dpi. This option is suitable for reading sizes.
    • High Quality (small text size) – the file is rendered at a resolution of 1200 dpi. This option is suitable for small text sizes.
    • Variable Content [5] – for Variable Content [5], select the desired placeholder – Database Field or User-defined Field – from which the value for the accuracy should be retrieved. The following values must be found in the selected field for the selection to be executed:
      • 150
      • 600
      • 1200
  • Contrast [3] – use the drop-down menu to select the text contrast. In many cases, the higher the contrast, the more legible the text. The following options are available:
    • Standard – the text contrast remains unchanged.
    • High – the contrast of the image for OCR recognition is increased.
    • Extreme – the contrast of the image for OCR recognition is extremely amplified.
    • Variable Content [6] – for Variable Content [6] , select the desired placeholder – Database Field or User-defined Field – from which the value for the contrast should be retrieved. The following values must be found in the selected field for the selection to be executed:
      • Standard
      • High
      • Extreme

Figure 2: Left: Values present in the selected field for the option Language; Center: Values present in the selected field for the option Accuracy; Right: Values present in the selected field for the option Contrast

3.2. Before/After

To experiment with this function, refer to the example file "Sample_Extract text (OCR).pdf". There are three different fonts in the file.

  • The heading was created using the Bradley Hand Bold font, Bold style, and 8 pt font size. The heading was converted to an image.
  • The first paragraph was formatted using the Myriad Pro font, Regular and Italic styles, and a 5 pt font size. This part of the text was converted to paths, which means that it is now a vector.
  • The second paragraph was created using the Snell Roundhand font in Regular style and 6 pt font size. This is a normal text passage.  

After applying the Fixup, the text file "Sample_Extract text_OCR.txt" is created, which can be downloaded in the tab Additional Data. To do this, use the values from Figure 1.

Figure 3: Left: Original file; Right: The tab Additional Data with the extracted text file

Article update: Workflow 1.20.1 – 04/2025

Previous Article 3.70 Extract Print Items from Page
Next Article 3.72 Fill Closed Path with Spot Color