OCR to Add Text to File

I have some PDF files that are just scans, no text in the PDF (so it is just like an image, no text to be indexed, searched or copied).

I thought the wonderful Linux world would have a software to OCR that file, convert the “image” to text (so it can be indexed, searched and copied as a PDF created through a word processor or Print a web page to PDF).

I searched and found an answer here

The app is OCRmyPDF

I do not know if:

  • this is an efficient OCR
  • it can handle Arabic language.
  • it will write the text to the PDF (as required above)

Of course I care to have the OCRed text added to the original/source file not create an extra file.

Anybody using this software? Any other suggestions?

As I am a bit … mmm… I hope there might be a way with this software installed to right click the file and get the job done!

I would rather ask the experts or someone experienced before … messing things up.

Your feed back highly appreciated.
Thank you.

Why not try it and see if it works?

It clearly should be able to do this based on the description.

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

Generates a searchable PDF/A file from a regular PDF

Here is a list of the languages it supports. You need to install the correct one for you.

Language packages
tesseract-data-yor
tesseract-data-yid
tesseract-data-vie
tesseract-data-uzb_cyrl
tesseract-data-uzb
tesseract-data-urd
tesseract-data-ukr
tesseract-data-uig
tesseract-data-tur
tesseract-data-ton
tesseract-data-tir
tesseract-data-tha
tesseract-data-tgl
tesseract-data-tgk
tesseract-data-tel
tesseract-data-tat
tesseract-data-tam
tesseract-data-syr
tesseract-data-swe
tesseract-data-swa
tesseract-data-sun
tesseract-data-srp_latn
tesseract-data-srp
tesseract-data-sqi
tesseract-data-spa_old
tesseract-data-spa
tesseract-data-snd
tesseract-data-slv
tesseract-data-slk_frak
tesseract-data-slk
tesseract-data-sin
tesseract-data-san
tesseract-data-rus
tesseract-data-ron
tesseract-data-que
tesseract-data-pus
tesseract-data-por
tesseract-data-pol
tesseract-data-pan
tesseract-data-osd
tesseract-data-ori
tesseract-data-oci
tesseract-data-nor
tesseract-data-nld
tesseract-data-nep
tesseract-data-mya
tesseract-data-msa
tesseract-data-mri
tesseract-data-mon
tesseract-data-mlt
tesseract-data-mkd
tesseract-data-mar
tesseract-data-mal
tesseract-data-ltz
tesseract-data-lit
tesseract-data-lav
tesseract-data-lat
tesseract-data-lao
tesseract-data-kor_vert
tesseract-data-kor
tesseract-data-kmr
tesseract-data-kir
tesseract-data-khm
tesseract-data-kaz
tesseract-data-kat_old
tesseract-data-kat
tesseract-data-kan
tesseract-data-jpn_vert
tesseract-data-jpn
tesseract-data-jav
tesseract-data-ita_old
tesseract-data-ita
tesseract-data-isl
tesseract-data-ind
tesseract-data-iku
tesseract-data-hye
tesseract-data-hun
tesseract-data-hrv
tesseract-data-hin
tesseract-data-heb
tesseract-data-hat
tesseract-data-guj
tesseract-data-grc
tesseract-data-glg
tesseract-data-gle
tesseract-data-gla
tesseract-data-fry
tesseract-data-frm
tesseract-data-frk
tesseract-data-fra
tesseract-data-fin
tesseract-data-fil
tesseract-data-fas
tesseract-data-fao
tesseract-data-eus
tesseract-data-est
tesseract-data-equ
tesseract-data-epo
tesseract-data-enm
tesseract-data-eng
tesseract-data-ell
tesseract-data-dzo
tesseract-data-div
tesseract-data-deu_frak
tesseract-data-deu
tesseract-data-dan_frak
tesseract-data-dan
tesseract-data-cym
tesseract-data-cos
tesseract-data-chr
tesseract-data-chi_tra
tesseract-data-chi_sim
tesseract-data-ces
tesseract-data-ceb
tesseract-data-cat
tesseract-data-bul
tesseract-data-bre
tesseract-data-bos
tesseract-data-bod
tesseract-data-ben
tesseract-data-bel
tesseract-data-aze_cyrl
tesseract-data-aze
tesseract-data-asm
tesseract-data-ara
tesseract-data-amh
tesseract-data-afr

To check about OCR,

I ran yay ocr

and found following relevant results :

4 community/gocr 0.52-2 (223.8 KiB 485.8 KiB) 
    OCR (Optical Character Recognition) program, which converts scanned images of text back to text files

And there is Tesseract OCR(https://github.com/tesseract-ocr) for which interfaces are :

3 community/gimagereader-qt 3.4.0-4 (551.0 KiB 1.5 MiB) 
    Qt front-end to tesseract-ocr
2 community/gimagereader-gtk 3.4.0-4 (540.4 KiB 1.6 MiB) 
    Gtk front-end to tesseract-ocr

another is :

6 community/gscan2pdf 2.12.6-2 (425.4 KiB 2.1 MiB) 
    A GUI with OCR capability to produce PDFs or DjVus from scanned documents

The project the OP mentioned is using Tesseract OCR to do the actual OCR.

I was not aware of it, as I thought ocrmypdf and tesseract are different applications !

I think what the OP is asking for is more than just plain OCR. If I understand correctly, they are looking for the result of the OCR to be inserted into the original PDF.

Will not gocr will serve the purpose of creating of OCR from documents ? As mentioned in its description.

I don’t think that is the goal. The goal is to take the text inside the PDF, OCR it and re-insert it into the PDF again so that the PDF itself can be indexed.

Wow ! It is so tough to understand for me !

This is what I understood. For this I would have converted PDF into editable format, if PDF is scannable, otherwise would have OCRed it first and then follow this procedure to insert the text, then convert it back to PDF.

This I didn’t understood.

That would damage the PDF.

What the tool the OP referenced does is leave the source alone and adds a layer to the PDF containing the text as an overlay.

1 Like

I was not aware of it, thanks !!

what this means ?

Take a look at this:

1 Like

This one ?

image

Lots of PDF tools are mentioned here :

https://wiki.archlinux.org/title/PDF,_PS_and_DjVu

No, this one:

WOW!!!

An incredible community as usual! All this in just a few minutes!

This is exactly what I am looking for!

I just thought discussing here with experts will add to my knowledge. One “added knowledge” I knew about layers in images only, but now I know that PDF have “layers”. This might open more doors for me.

I will try install and give you feed back.

Thank you all.

2 Likes

OK! I tried

[limo@limo ~]$ yay -Syyu ocrmypdf

I got lots of things but at the end:

   from jinja2 import environmentfilter
ImportError: cannot import name 'environmentfilter' from 'jinja2' (/usr/lib/python3.10/site-packages/jinja2/__init__.py)
==> ERROR: A failure occurred in build().
    Aborting...
 -> error making: python-coloredlogs

I don’t have a clue what this means!
Though

[limo@limo ~]$ yay -Syyu tesseract-data-ara

went fine I believe.
Trying “ocrmypdf” filename.pdf

bash: ocrmypdf: command not found

I noticed:

[Aur:2]  python-coloredlogs-15.0.1-3  ocrmypdf-13.6.0-1

  2 python-coloredlogs               (Build Files Exist)
  1 ocrmypdf                         (Build Files Exist)
==> Packages to cleanBuild?
==> [N]one [A]ll [Ab]ort [I]nstalled [No]tInstalled or (1 2 3, 1-3, ^4)

Should I perhaps select [A] instead of the default [N]one?

Because it failed to install python-coloredlogs

You can try that.

EDIT: That package builds and installs fine for me.

I selected [A] one time and [^4] another time…
still same error!
I will try install separately!

UPDATE:
Tried

[limo@limo ~]$ yay -Syyu python-coloredlogs 

Still

ImportError: cannot import name 'environmentfilter' from 'jinja2' (/usr/lib/python3.10/site-packages/jinja2/__init__.py)
==> ERROR: A failure occurred in build().
    Aborting...
 -> error making: python-coloredlogs

by default jinja2 is not available. you need jinja2, don’t know why it’s not in the dependencies :man_shrugging:

Edit: 126 provider are for tessdata available :thinking:

[pycrk@ovoEnd ~]$ yay -S ocrmypdf
:: Es sind 126 Anbieter für tessdata verfügbar:
:: Repository community
    1) tesseract-data-afr 2) tesseract-data-amh 3) tesseract-data-ara 4) tesseract-data-asm 5) tesseract-data-aze 6) tesseract-data-aze_cyrl 7) tesseract-data-bel 8) tesseract-data-ben 9) tesseract-data-bod 10)...
1 Like