Batch pdf OCR converting?

sentinel166 · February 2, 2022, 1:55pm

Hi,
this is not a bug, but I don’t know where I’m supposed stuff like that.
I wish to convert pdf into searchable pdfs with an OCR doing on okay-ish job: I can with a GUI, but how do you do it from a CLI, in an automated way ? I’ve been surprised to see it’s not obvious, which would be ridiculous… is there a way other than 1) create a temporary folder 2) extract all images 3) ocr them 4) fuse them back together
I’m not very skilled at scripting.
More precisely, I can learn things, but not when it’s uselessly complicated. Same way things like Ada feel much simpler to me than C, even though it’s more complete. Because it’s rational.

Sorry in case someone already asked

HCHR · February 2, 2022, 2:47pm

Not always, but most of the time the pdf has the text in it.

You can test with:
pdftotext -nopgbrk -layout -eol unix ToText.pdf
It will make a file ToText.txt with the text that’s in it

You might need to install “poppler” first

sentinel166 · February 2, 2022, 3:15pm

Very often I work with old scanned studies, and these don’t always include the text, so it would be useful to have a command that does it automatically… and doesn’t do it if there’s already some text.

s4ndm4n · February 2, 2022, 3:20pm

Did you try this: https://www.howtogeek.com/682389/how-to-do-ocr-from-the-linux-command-line-using-tesseract/

sentinel166 · February 2, 2022, 5:37pm

Thanks you two,
It would have been simpler with a command that takes a hOCR and merge it with a pre-existing pdf, but I can’t find that.

It would give that:

for file in *.pdf
pdftoppm $file -jpeg $file
mv *.jpg essai/
cd essai;
tesseract “$file”.jpg $file_ -l eng -c hocr -progress
hocr-pdf . "$file"_hocr.pdf
rm *.jpg
cd ..
end

But it doesn’t work, it does precisely nothing, because I can’t learn shell. Can someone who knows fish, tell me what’s wrong ?
Last worry on my side:
convertoppm asks for a resolution, but I want a purely lossless conversion. How do I do ?