Manually adding an OCR layer to scanned PDF

September 12, 2020

(Summary: How to add an invisible text layer to a PDF containing scanned images, using an OCR tool—in this case Google Cloud Vision API—that also gives the position of each recognized word.)

Background

Situation: A printed book has been scanned into images, possibly assembled into a single PDF. As each page is just a picture, no text has been associated with it, and it is not searchable. To get there, either we need the text to be typed in manually by someone, or we need OCR. Either way, many errors are likely, so when we open the PDF we’d still like to see the scanned image, except that it would be nice if the text can be selected and copied (and also searched).

There are many OCR tools that do this automatically (ABBYY FineReader, etc), but they are often not very reliable for the kinds of use-cases I am interested in (old books with Sanskrit text in Devanagari script, of uneven scan quality), and do not (I think?) allow editing the resulting text. One doesn’t feel in control of what’s happening.

Google has a Cloud Vision API (try it out here) that appears to provide not only the detected text, but also the position (bounding box) on the page of each detected word. This is great!

(Actually, as I was typing this, I learned that other OCR tools like tesseract can provide this too: there’s a data format called hOCR that includes this bounding box information. Anyway, I’m interested in the Google Vision API for now, which seems to have its own format.)

So what we’d like to do is to add this text layer to the scanned image, positioning each detected word at its original position on the page. It will still be the scanned image that is visible (good), but the text can be selected/copied/searched.

Exploration

(Just how I arrived at the code in the next “Results” section… can probably skip this section and go there directly….)

Google Search led to this question on Super User about an interesting problem where you want to take a “searchable scan” PDF and combine it with another PDF with higher quality images. The answer links to this question which mentions using ReportLab to write out the results to a PDF. I didn’t explore this further.

Meanwhile, another of the results is this question on Stack Overflow which has a useful discussion. Apparently, as far as the PDF format is concerned, usually OCR is not in another “layer” (despite everyone commonly saying so) but is simply text placed on the page (just like the image), simply in “invisible” rendering mode (mode 3 in section 9.3.6 of the PDF spec).

So all that’s needed for each page is to insert the scanned image, and also place each (invisible) word at a certain known position on the page.

Probably any decent PDF tool can do this: we may even be able to output the PDF format directly from a program! I know for sure that the family of tools from the TeX/LaTeX ecosystem can do things like this, so that’s what I looked at. (Asked a question here.)

Exploring in more detail:

JSON response

I used the image from this book page (specifically, this) — it happens to have two font sizes and some other typographical extranea. When I upload the jpg image to the Try it! page, the JSON response (which I imagine is how the response will look when I actually use the API, which I haven’t yet!) seems to get most things right.

The image dimensions are 1044 × 1561 pixels:

$ identify panchatantracoll00purn_0258.jpg
panchatantracoll00purn_0258.jpg JPEG 1044x1561 1044x1561+0+0 8-bit sRGB 216481B 0.000u 0:00.000

and the height and width somewhere in the response (see later below) agree:

        "height": 1561,
        "width": 1044

Each word has a position in the response, for example, “बहुना” (second Devanagari word on the page) is at:

    {
      "boundingPoly": {
        "vertices": [
          {
            "x": 173,
            "y": 213
          },
          {
            "x": 248,
            "y": 213
          },
          {
            "x": 248,
            "y": 262
          },
          {
            "x": 173,
            "y": 262
          }
        ]
      },
      "description": "बहुना"
    },

This means something like, for just that word:

                  (x=173)            (x=248)
(y=213)          (173, 213)         (248, 213)
(y=262)          (173, 262)         (248, 262)

Sometimes it may be slightly rotated, i.e. the rectangle may not be axis-parallel.

Is it even guaranteed to be a rectangle, i.e. 4 vertices in boundingPoly? Let’s hope so.

TeX side of things

Inserting the scanned image is just:

\usepackage{pdfpages}

\includepdf[fitpaper=true]{⟨filename⟩}

For placing the text, looking at the documentation of textpos on texdoc and (excellent) TUGboat article shows that the syntax is:

\usepackage[absolute]{textpos}

% ...

% If image dimentions are `nhoriz × nvert`, set that up:
\TPGrid{nhoriz}{nvert}

% ...

% `howWide` here can be any sufficiently large width,
% or even (with `textblock*`) an absolute dimen like \hsize?
\begin{textblock}{howWide}(hpos,vpos)
(word goes here)
\end{textblock}

(we may not need all these packages or even TeX in the first place, but we can simplify later).

With this, I was able to get a sample .tex file (only placed two words manually, using their positions from the JSON response) working:

\documentclass{article}

\usepackage{pdfpages}
\usepackage[absolute]{textpos}

\usepackage{fontspec}
\setmainfont[Color=red,Opacity=0.1]{Chandas}

\begin{document}
% For some reason, package textpos uses \paperheight and \paperwidth
\paperheight=1561bp
\paperwidth=1044bp
\TPGrid{1044}{1561}
\begin{textblock}{54}(172,129)Book \end{textblock}%
\begin{textblock}{75}(173,213)बहुना \end{textblock}%
\includepdf[fitpaper=true]{panchatantracoll00purn_0258.pdf}
\end{document}

I compiled with lualatex. (Am using TeX Live 2020, so this is the LuaHBTeX engine; else compiling with xelatex may be better.)

Results

With all this, writing a Python script is straightforward. As this is just a proof-of-concept and not a full-fledged flexible tool, have only bothered to deal with a single specific page, not bothered setting font size or fine-tuning character position, etc.

#!/usr/bin/env python3

from string import Template
import sys
import json
header = r'''\documentclass{article}
\pagestyle{empty}

\usepackage{pdfpages}
\usepackage[absolute]{textpos}
\usepackage{fontspec}
\usepackage{polyglossia}
\setmainlanguage{sanskrit}
\newfontfamily\devanagarifont[Script=Devanagari,Color=red,Opacity=$opacity]{Chandas}

\begin{document}
\paperwidth=${paperwidth}bp
\paperheight=${paperheight}bp
\pagewidth=${paperwidth}bp
\pageheight=${paperheight}bp
\TPGrid{$paperwidth}{$paperheight}
'''


def place_all_words(response, place_word):
    """Places each word from OCR response.
    `response` is the JSON response.
    `place_word` is a function to call."""
    for wordData in response["textAnnotations"][1:]:
        assert set(wordData.keys()) == set(["boundingPoly", "description"])
        word = wordData["description"]
        poly = wordData["boundingPoly"]
        assert set(poly.keys()) == set(["vertices"]), poly.keys()
        vertices = poly["vertices"]
        assert len(vertices) == 4
        x, y = vertices[0]["x"], vertices[0]["y"]
        xRight = vertices[1]["x"]
        width = xRight - x
        place_word(word, x, y, width)


def tex_place_word_fn(tex_write):
    def tex_place_word(word, x, y, width):
        tex_write(r'\begin{textblock}{%s}(%s,%s)' % (width, x, y))
        tex_write(word)
        # The space here is intentional: it may help PDF reader split word?
        tex_write(r' \end{textblock}')
    return tex_place_word


def tex_write_stdout(s):
    sys.stdout.write(s)


if __name__ == '__main__':
    # Hack, parametrize these filenames later
    filename = 'panchatantracoll00purn_0258'
    response = json.load(open('response.json'))
    # Another hack: [0,1) = text with that opacity; 1 = only text (no image)
    opacity = float(sys.argv[1])
    d = {
        'paperwidth': response["fullTextAnnotation"]["pages"][0]["width"],
        'paperheight': response["fullTextAnnotation"]["pages"][0]["height"],
        'opacity': opacity,
    }
    tex_write_stdout(Template(header).substitute(d))
    place_all_words(response, tex_place_word_fn(tex_write_stdout))
    if opacity < 1:
        tex_write_stdout(r'\includepdf[fitpaper=true]{%s.pdf}' % filename)
    tex_write_stdout('\n' + r'\end{document}' + '\n')

Run like python3 tryocr.py 0.3 > tryocr.tex && lualatex tryocr.tex or whatever. With response.json as above, and panchatantracoll00purn_0258.jpg as above (and wrapped into a PDF with convert filename.jpg filename.pdf). We get results about as good as can be expected for a PoC:

Opacity 0.3 (for debugging): Version.A1-Opacity-0.3.pdf
Opacity 0 (the kind of PDF we want): Version.B1-Opacity-0.pdf
Opacity 1 (no scanned image, just text): Version.C1-Opacity-1.pdf

Of course for “real use” one would have to tweak font size and text position so that it’s actually more usable. Also, there are many overfull warnings (a few even at width * 10), which seems worth investigating. More importantly, the text layer also seems to have some Unicode issues, probably related to glyphs' position in their font rather than Unicode codepoint, so for many words searching doesn’t actually work—but I imagine they can be resolved.

Updates:

When the OCR-and-position data is in the hOCR format, there is an existing project (last updated 2011 but was apparently working well then) that can probably be used. Even if it cannot be used directly (e.g. convert data into hOCR format and invoke it), I should look into it; it might have some relevant ideas.

For the issue with the incorrect ActualText, a solution that normally works is to just use Renderer=HarfBuzz with lualatex, but that doesn’t work with everyshi. Instead, we can set ActualText manually. Things work better now:

Opacity 0.3 (for debugging): Version.A2-Opacity-0.3.pdf
Opacity 0 (the kind of PDF we want): Version.B2-Opacity-0.pdf
Opacity 1 (no scanned image, just text): Version.C2-Opacity-1.pdf

Code:

#!/usr/bin/env python3

from string import Template
import sys
import json
header = r'''\documentclass{article}
\pagestyle{empty}

\usepackage{pdfpages}
\usepackage[absolute,quiet]{textpos}
\usepackage{fontspec}
\usepackage{polyglossia}
\setmainlanguage{sanskrit}
\newfontfamily\devanagarifont[Renderer=HarfBuzz,Script=Devanagari,Color=red,Opacity=$opacity]{Adishila}
\setlength{\parindent}{0pt}

\usepackage{accsupp}
\usepackage{tikz}
\usepackage{stringenc}
\usepackage{pdfescape}
\makeatletter
\newcommand*{\BeginAccSuppUnicode}[1]{%
  \EdefSanitize\asu@str{#1}%
  \edef\asu@str{%
    \expandafter\expandafter\expandafter\asu@ToSpaceOther
    \expandafter\asu@str\space\@nil
  }%
  \expandafter\let\expandafter\asu@str\expandafter\@empty
  \expandafter\asu@ToHexUC\asu@str\relax
  \EdefUnescapeHex{\asu@str}{\asu@str}%
  \StringEncodingConvert{\asu@str}{\asu@str}{utf32be}{utf16be}%
  \EdefEscapeHex{\asu@str}{\asu@str}%
  \BeginAccSupp{%
    unicode,%
    method=hex,%
    ActualText=\asu@str
  }%
}
\begingroup
  \lccode`\9=`\ %
\lowercase{\endgroup
  \def\asu@SpaceOther{9}%
}
\def\asu@ToSpaceOther#1 #2\@nil{%
  #1%
  \ifx\\#2\\%
    \expandafter\@gobble
  \else
    \asu@SpaceOther
    \expandafter\@firstofone
  \fi
  {\asu@ToSpaceOther#2\@nil}%
}
\def\asu@ToHexUC#1{%
  \ifx#1\relax
  \else
    \pgfmathHex{\the\numexpr`#1+"10000000\relax}%
    \edef\asu@str{%
      \asu@str
      0\expandafter\@gobble\pgfmathresult
    }%
    \expandafter\asu@ToHexUC
  \fi
}
\makeatother

\begin{document}
\paperwidth=${paperwidth}bp
\paperheight=${paperheight}bp
\pagewidth=${paperwidth}bp
\pageheight=${paperheight}bp
\TPGrid{$paperwidth}{$paperheight}
'''

def place_all_words(response, place_word):
    """Places each word from OCR response.
    `response` is the JSON response.
    `place_word` is a function to call."""
    for wordData in response["textAnnotations"][1:]:
        assert set(wordData.keys()) == set(["boundingPoly", "description"])
        word = wordData["description"]
        poly = wordData["boundingPoly"]
        assert set(poly.keys()) == set(["vertices"]), poly.keys()
        vertices = poly["vertices"]
        assert len(vertices) == 4
        topLeft, topRight, bottomRight, bottomLeft = vertices
        xLeft = bottomLeft["x"]
        yTop = topLeft["y"]
        xRight = topRight["x"]
        yBottom = bottomRight["y"]
        width = xRight - xLeft
        height = yBottom - yTop
        place_word(word, xLeft, yBottom, width, height)

def tex_place_word_fn(tex_write):
    def tex_place_word(word, x, y, width, height):
        # tex_write(r'\begin{textblock}{%s}[0,0](%s,%s){\BeginAccSuppUnicode{%s}\resizebox{%s\TPHorizModule}{!}{%s}\EndAccSupp{}}\end{textblock}' % (width, x, y, word, width, word))
        tex_write(r'\begin{textblock}{%s}[0,1](%s,%s){\BeginAccSuppUnicode{%s}\resizebox{%s\TPHorizModule}{%s\TPVertModule}{%s}\EndAccSupp{}}\end{textblock}' % (width, x, y, word, width, height*0.6, word))
    return tex_place_word

def tex_write_stdout(s):
    sys.stdout.write(s)

if __name__ == '__main__':
    # Hack, parametrize these filenames later
    filename = 'panchatantracoll00purn_0258'
    response = json.load(open('response.json'))
    # Another hack: [0,1) = text with that opacity; 1 = only text (no image)
    opacity = float(sys.argv[1])
    d = {
        'paperwidth': response["fullTextAnnotation"]["pages"][0]["width"],
        'paperheight': response["fullTextAnnotation"]["pages"][0]["height"],
        'opacity': opacity,
    }
    tex_write_stdout(Template(header).substitute(d))
    if opacity not in [0, 1]:
        tex_write_stdout(r'\TPoptions{showboxes=true}')
    place_all_words(response, tex_place_word_fn(tex_write_stdout))
    if opacity < 1:
        tex_write_stdout(r'\includepdf[fitpaper=true]{%s.pdf}' % filename)
    tex_write_stdout('\n' + r'\end{document}' + '\n')

We can also draw the original bounding boxes to highlight the regions that the API is returning. There are various ways to do that (see answers here, here, here) and I just picked the first thing I tried. I really should be using version control instead of posting incremental updates, but anyway, the present version of the code is here:

#!/usr/bin/env python3

from string import Template
import sys
import json
header = r'''\documentclass{article}
\pagestyle{empty}

\usepackage{pdfpages}
\usepackage[absolute,quiet]{textpos}
\usepackage{fontspec}
\usepackage{polyglossia}
\setmainlanguage{sanskrit}
\newfontfamily\devanagarifont[Renderer=HarfBuzz,Script=Devanagari,Color=red,Opacity=$opacity]{Adishila}
\setlength{\parindent}{0pt}

\usepackage{accsupp}
\usepackage{tikz}
\usepackage{geometry}
\usetikzlibrary{positioning,calc}
\usepackage{stringenc}
\usepackage{pdfescape}
\makeatletter
\newcommand*{\BeginAccSuppUnicode}[1]{%
  \EdefSanitize\asu@str{#1}%
  \edef\asu@str{%
    \expandafter\expandafter\expandafter\asu@ToSpaceOther
    \expandafter\asu@str\space\@nil
  }%
  \expandafter\let\expandafter\asu@str\expandafter\@empty
  \expandafter\asu@ToHexUC\asu@str\relax
  \EdefUnescapeHex{\asu@str}{\asu@str}%
  \StringEncodingConvert{\asu@str}{\asu@str}{utf32be}{utf16be}%
  \EdefEscapeHex{\asu@str}{\asu@str}%
  \BeginAccSupp{%
    unicode,%
    method=hex,%
    ActualText=\asu@str
  }%
}
\begingroup
  \lccode`\9=`\ %
\lowercase{\endgroup
  \def\asu@SpaceOther{9}%
}
\def\asu@ToSpaceOther#1 #2\@nil{%
  #1%
  \ifx\\#2\\%
    \expandafter\@gobble
  \else
    \asu@SpaceOther
    \expandafter\@firstofone
  \fi
  {\asu@ToSpaceOther#2\@nil}%
}
\def\asu@ToHexUC#1{%
  \ifx#1\relax
  \else
    \pgfmathHex{\the\numexpr`#1+"10000000\relax}%
    \edef\asu@str{%
      \asu@str
      0\expandafter\@gobble\pgfmathresult
    }%
    \expandafter\asu@ToHexUC
  \fi
}
%\makeatother

\begin{document}
\thispagestyle{empty}
\paperwidth=${paperwidth}bp
\paperheight=${paperheight}bp
\pagewidth=${paperwidth}bp
\pageheight=${paperheight}bp
\TPGrid{$paperwidth}{$paperheight}
\def\mypagecommand{}
'''

def place_all_words(response, place_word):
    """Places each word from OCR response.
    `response` is the JSON response.
    `place_word` is a function to call."""
    for wordData in response["textAnnotations"][1:]:
        assert set(wordData.keys()) == set(["boundingPoly", "description"])
        word = wordData["description"]
        poly = wordData["boundingPoly"]
        assert set(poly.keys()) == set(["vertices"]), poly.keys()
        vertices = poly["vertices"]
        assert len(vertices) == 4
        place_word(word, vertices)

def tex_place_word_fn(tex_write):
    def tex_place_word(word, vertices):
        p1, p2, p3, p4 = vertices
        # (x1, y1)       (x2, y2)
        # (x4, y4)       (x3, y3)
        x1, y1 = p1["x"], p1["y"]
        x2, y2 = p2["x"], p2["y"]
        x3, y3 = p3["x"], p3["y"]
        x4, y4 = p4["x"], p4["y"]
        def avg(a, b): return (a + b) / 2.0
        width = avg(x2 - x1, x3 - x4)
        height = avg(y4 - y1, y3 - y2)
        bbox = r'''%
        \g@addto@macro\mypagecommand{
        \begin{tikzpicture}[remember picture,overlay]
        \draw[blue, thick] ($$(current page.north west)+($x1 bp,-$y1 bp)$$) --
                                   ($$(current page.north west)+($x2 bp,-$y2 bp)$$) --
                                   ($$(current page.north west)+($x3 bp,-$y3 bp)$$) --
                                   ($$(current page.north west)+($x4 bp,-$y4 bp)$$) -- cycle;
        \end{tikzpicture}}'''
        wordbox = r'''
        \begin{textblock}{$width}[0,0]($x1,$y1){%
            \BeginAccSuppUnicode{$word }%
                \resizebox{$width\TPHorizModule}{$height\TPVertModule}{$word}%
            \EndAccSupp{}}%
        \end{textblock}
        '''
        tex_write(Template(bbox).substitute(locals()))
        tex_write(Template(wordbox).substitute(locals()))
    return tex_place_word

def tex_write_stdout(s):
    sys.stdout.write(s)

if __name__ == '__main__':
    # Hack, parametrize these filenames later
    filename = 'panchatantracoll00purn_0258'
    response = json.load(open('response.json'))
    # Another hack: [0,1) = text with that opacity; 1 = only text (no image)
    opacity = float(sys.argv[1])
    d = {
        'paperwidth': response["fullTextAnnotation"]["pages"][0]["width"],
        'paperheight': response["fullTextAnnotation"]["pages"][0]["height"],
        'opacity': opacity,
    }
    tex_write_stdout(Template(header).substitute(d))
    if opacity not in [0, 1]:
        tex_write_stdout(r'\TPoptions{showboxes=true}')
    place_all_words(response, tex_place_word_fn(tex_write_stdout))
    if opacity < 1:
        tex_write_stdout(r'\includepdf[fitpaper=true,pagecommand=\mypagecommand]{%s.pdf}' % filename)
    tex_write_stdout('\n' + r'\end{document}' + '\n')

We can use the bounding boxes to generate a HTML page that uses the scanned images of words in paragraphs! Here’s a demo and here’s a 20-second recording of how it reflows when the browser is resized. Code to extract (shells out to magick i.e. Imagemagick) is this:

#!/usr/bin/env python3

from string import Template
import sys
import json

def extract(parts, vertices):
    output_filename = f"extracted-{len(parts)}-{'.'.join(str(n) for n in parts)}.jpg"
    assert len(vertices) == 4
    x = [0, 0, 0, 0]
    y = [0, 0, 0, 0]
    for i in range(4):
        x[i], y[i] = vertices[i]["x"], vertices[i]["y"]
    smallerX = (x[0] + x[3]) / 2
    smallerY = (y[0] + y[1]) / 2
    width = (x[1] + x[2]) / 2 - smallerX
    height = (y[2] + y[3]) / 2 - smallerY
    import subprocess
    args = ["magick", "-extract",
            f"{width}x{height}+{smallerX}+{smallerY}", "image.jpg", output_filename]
    subprocess.run(args)

if __name__ == '__main__':
    # Hack, parametrize these filenames later
    response = json.load(open('response.json'))

    page = response["fullTextAnnotation"]["pages"][0]
    for (b, block) in enumerate(page["blocks"]):
        assert block["blockType"] == "TEXT", block["blockType"]
        extract([b], block["boundingBox"]["vertices"])
        for p, paragraph in enumerate(block["paragraphs"]):
            extract([b, p], paragraph["boundingBox"]["vertices"])
            for w, word in enumerate(paragraph["words"]):
                extract([b, p, w], word["boundingBox"]["vertices"])