teaser

Making sharp and compact PDFs from scans

I've found that document scanner software rarely generates good PDF files when scanning plain text documents. The PDFs are often big in file size, with artifacts and may not be sharp and contrasty enough for a clear reprint. The issue is that lossy compression similar to JPEG is used in the PDF to store the scans and it's not a very good choice for text and sharp line art typically found in documents.

The PDF format does offer better ways to compress material like this. In this post I'll compare different PDF generation methods and provide a couple of scrips I use personally to make my PDF files from scans (Windows batch files).

Background

Some time ago I discovered that the humble tool Simple Scan, available in many Linux distributions generated more compact and sharper PDFs than most other scanner applications. Digging a little deeper I realized that instead of encoding the scan bitmap with a JPEG algorithm Simple Scan uses a lossless "flate" compression on a bitmap with a reduced number of bits per pixel. This is a far superior method of encoding sharp line art like text than JPEG.

Simple Scan only does this in the grayscale Text mode however, and then the image is reduced quite aggressively to only 2 bits-per-pixel meaning black, white and two levels of gray. This sacrifies all color information but the two levels of gray allows the image to keep just a bit of anti-aliasing and makes it look a lot better than pure black and white.

I thought it was a shame that in order to preserve simple colors in documents, like for example logos or signatures with a non-black pen, would mean resorting to the Photo mode and get clumsy, artifact ridden and blurry PDF's. Why not use the Simple Scan Text method with some more colors per pixel and get sharp, highly compressed scans preserving color?

I tested many different scanning programs but was unable to find a good one that created this kind of output, so I went about researching how to make these PDF's myself, and learned that... drum roll... Image Magick could do it and with color support!

ScansToPdf script

The result of my efforts is a set of bat file scripts for Windows. The scripts use Image Magick to process the scans and concatenates the pages into one PDF using pdftk. There are three script variants: High quality, Low quality and Grayscale, and they basically do the same with different processing settings. The scripts should be used with lossless source images like PNG files.

The process I use is to save the scans in PNG format and then select the files to go into one PDF in File Explorer and drag-drop them on to the bat file with the desired quality level. This will produce a multi page PDF file with the order of the pages matching the selection order.

Beware that the generated PDF will only have the correct paper size if the script is configured with the correct scan DPI. I keep everything at 300 DPI.

Download here.

PDF size/quality comparison

To compare PDF's created by various scanner software and my ScansToPdf method, I created the test document seen below. It's basically a Danish Windows 7 Printer Test page with some added manual markup.

The page was printed with a Samsung SL-C480W multifunction color laser printer, which was also used for scanning.

Photo of test document

The page was scanned in both Windows and Linux with various software and settings.

(Find close ups from the PDFs further down in the article)

Simple Scan, Linux

PDF's generated by Simple Scan v.3.20.0

Scans were done in 300 DPI using either Photo or Text mode:

Photo mode:
  • Download (PDF)
  • Encoding: RGB / DCTDecode (JPEG), PDF v1.3
  • Text quality: Decent quality, retaining colors nicely. Text slightly soft, artifacts can be seen. Blacks are not fully black.
  • Size: 693 KB
Text mode:
  • Download (PDF)
  • Encoding: 2 bbp grayscale / flate compression, PDF v1.3
  • Evaluation: Great grayscale quality. Text very sharp and clear, artifact free. Blacks are fully black.
  • Size: 176 KB

ScansToPdf script, on Windows

The PDF's were created in Windows 7 using the ScansToPdf bat file scripts downloadable above.

Script input was the reference PNG saved by Easy Document Creator, see below.

High quality
  • Download (PDF)
  • Image Magick processing: COLORS=64 LEVELS=30%,95%
  • Encoding: 8 bpp, LZW compression, PDF v1.3
  • Evaluation: Very sharp and clear text with well preserved colors. Colors slightly off in the Windows logo. Blacks are fully black
  • Size: 514 KB
Low quality
  • Download (PDF)
  • Image Magick processing: COLORS=12 LEVELS=60%,95%
  • Encoding: 8 bpp, LZW compression, PDF v1.3
  • Evaluation: Very sharp and clear text. Handwriting pen colors clearly preserved. Incorrect colors in Windows logo. Blacks are fully black
  • Size: 261 KB
Grayscale
  • Download (PDF)
  • Image Magick processing: COLORS=7 -colorspace gray LEVELS=65%,85%
  • Encoding: 8 bpp, LZW compression, PDF v1.3
  • Evaluation: Very sharp and clear text with no color information. Blacks are fully black
  • Size: 217 KB

Samsung Easy Document Creator, Windows

This is the default scanner software supplied by Samsung.

One scan was done in color 300 DPI with auto exposure ON. The result was saved with different settings:

Lossless
PDF max quality
  • Download (PDF)
  • Encoding: JPEG-like, PDF v1.7
  • Evaluation: Great quality, a bit soft. Blacks are not fully black.
  • Size: 2419 KB
PDF medium quality
  • Download (PDF)
  • Encoding: JPEG-like, PDF v1.7
  • Evaluation: Great quality, a bit soft. Minor artifacts can be seen on close inspection. Blacks are not fully black.
  • Size: 1277 KB
PDF low quality
  • Download (PDF)
  • Encoding: JPEG-like, PDF v1.7
  • Evaluation: Mediocre quality, soft. Artifacts show around text and edges. Blacks are not fully black.
  • Size: 533 KB
Compact PDF low quality
  • Download (PDF)
  • Encoding: ?, PDF v1.7
  • Evaluation: Horrible quality with lots of artifacts, colors show up in a smeary, blurry way.
  • Size: 92 KB

PDF close-ups

Simple Scan - Photo mode - 693 KB: Simple Scan Photo mode PDF sample

Simple Scan - Text mode - 176 KB: Simple Scan Text mode PDF sample

ScansToPdf - High quality - 514 KB: Script high quality PDF sample

ScansToPdf - Low quality - 261 KB: Script low quality PDF sample

ScansToPdf - Grayscale - 217 KB: Script grayscale PDF sample

Easy Document Creator - PDF max quality - 2419 KB: Easy document creator max quality PDF sample

Easy Document Creator - PDF medium quality - 1277 KB: Easy document creator medium quality PDF sample

Easy Document Creator - PDF low quality - 533 KB: Easy document creator medium quality PDF sample

Easy Document Creator - Compact PDF low quality - 92 KB: Easy document creator compact low quality PDF sample


Comments

Comments powered by Talkyard