on making and using ebooks

2020-10-05 – I.W.

by now i've put a fair amount of time into hobbyist book digitization and dedrming ebooks. this is my accumulated wisdom: reasons, tools, tricks. updated 2021-10-16.

obligatory note on piracy: it's illegal, which is to say wrong! an offense in the eyes of God, by facilitating it you will bring about the dissolution of orderly society. meditate long and hard on the tears you will bring to me if you use my techniques for evil, and more than that, the tears you will bring to “Mr. Routledge,” the publisher man in a raggedy suit whose loving care produced the most nondescript academic book covers in history. any references to the illicit distribution of ideas in what follows are literary embellishment.

origin of a habit

i once had a prefigurative vision of myself in old age, ravaged by tsundoku, lost in stacks of books which would consume all my funds while mostly never being read. naturally i resolved to avert this dread fate. tsundoku arises from a simple cause, i reasoned: when one feels the impulse of interest in a book, one obtains it with the purest intention to read. but mere possession brings with it enough satisfaction that the impulse to read slackens prematurely, and one listlessly discards the volume in favor of the next prize. the solution i devised was to make possession immaterial to me: switch to reading pirated ebooks rather than paper copies. not only would downloading an epub provide a less addictive pleasure than holding an object in my hands, but if i could just get over the learning curve, i'd save a sizable sum.

also, there was the problem of my decaying focus: by age thirteen i realized it was increasingly difficult for me to keep my eyes on a text for very long without my mind wandering. i'm still not sure whether to think of that as adhd or a vision problem or my being a victim of internet-induced mass attention death. i took to audiobooks to compensate for the issue, since the narrator's insistent voice refused me any chance for distraction. when no audiobook exists, i can use text-to-speech on an ebook for a similar effect, reading the text at the same time as i listen.

as in any fable of fate defied, the result of my efforts was whole new inconveniences. so used to immediately receiving a pdf of any book i wanted, i no longer know how to accept when something is not available. if i can't find something from the usual sources online, i'm overcome with anger, and will go to almost any length to obtain it. if i can get it from the library, i digitize it, and if i don't know the language, i learn to translate it (can't do the latter as often, of course).

my digitization workflow

several times now i've had someone ask me how to digitize a book properly. my method relies on basic command-line navigation skills, and the tools are all things that run on linux. the fact that everything is for linux means my elaborate answers here honestly are probably not useful to the askers. i haven't timed myself at work, but can say vaguely that it takes a couple hours to scan a couple books, and a couple more hours later to turn the scans into a useful format.

to start with, i use a flatbed scanner—there are fancier setups that aim to be easier on a book's spine, but in my experience the strain is negligible and easily justified by the digital immortality that the book will gain by the ordeal.
scan the pages in grayscale mode (not B/W), except where color is necessary. output should be tiff format, whatever high resolution option you have. you need 300dpi+ to get good ocr results, according to conventional wisdom, and erring on the higher side shouldn't hurt since you'll compress it in time for the pdf output. the actual scanning is tedious but tiring in the same way driving is; it's not stimulating but you need to pay a little attention to make sure you're getting every spread and they're not coming out wrong. the scan software should have a preview of your results, but i can't recommend a specific tool because i just use what runs on the scanner workstations at my library and then copy it to a usb key to take home. i pretend i'm back in boarding school and the book is a younger boy who hasn't been paying his dues. dunk his head in the toilet, hold it down, pull up for air, repeat. you get into the rhythm of it, and i fend off boredom by listening on my earbuds to whatever i digitized last week.
post-process the tiffs with scantailor advanced. you can compile it yourself, or i think get a build from a ppa on ubuntu or from nixpkgs (haven't double-checked this). it's pretty straightforward to use and halfway automates most of the steps: fixing image orientation, splitting spreads into pages, deskewing pages, selecting the content region of the page, positioning the content on the output pages, and generating the final output. the main things you have to intervene in by hand are content selection, positioning, identifying regions you don't want converted to black and white, and maybe manually cleaning up stray marks or annotations. the virtue of letting scantailor convert grayscale to b/w instead of the scanner itself is that it can tell undesirable shadows from desirable text and gives you precise control over the line between black and white.
you need to write a few metadata files. i don't put a lot of detail into them. create a metadata.yaml for the epub version in the project directory like this:
```
---
title:
- type: main
  text: "Title"
creator:
- role: author
  text: "Author"
...
```
in the scantailor output directory (i make this a subfolder of the one dedicated to the book project one), make an equivalent file metadata for the pdf:
```
~~~
Author: "Author"
Title: "Title"
~~~
```
you'll also need a dummy bookmarks file to generate the pdf. you can make it a proper table of contents later, for now just do this:
```
~~~
"Cover" 1
~~~
```
cd into your scantailor output directory. to generate markdown and pdf output, you'll need tesseract for ocr (probably tesseract-ocr in your distro repos), hocr-combine from hocr-tools (install python from distro repos, then pip install hocr-tools), pandoc (probably in distro repos), imagemagick (ditto), and pdfbeads (install ruby from distro repos, then gem install pdfbeads). might have forgotten some dependencies. i use a script called bind, which looks like this:
```
#!/bin/bash
export COUNT=1
export TOTAL=`ls -1|wc -l`
for f in *.tif; do
  echo "OCRing $f ($COUNT of $TOTAL)"
  tesseract -l eng $f $(basename $f .tiff) hocr
  export COUNT=$((COUNT + 1))
done
rename -v 's/.tif.hocr/.html/' *.tif.hocr
hocr-combine *.html|pandoc -f html-native_divs-native_spans -t markdown+smart -o ../book.md
pdfbeads -C bookmarks -M metadata > ../book.pdf
```
change "eng" on the tesseract line to the relevant language code if the book isn't in english.
you'll want to add a real table of contents to the pdf. the more manual way is: open book.pdf and use the page numbers in it to write a better bookmarks file, then run the pdfbeads command from the above script again to make a final pdf. the other one is to use emacs toc-mode. if you can get it installed, it's extremely convenient! it does depend on you knowing your way around emacs, though.
the markdown file book.md is the corpus you can use to generate an epub. you'll want to edit it to fix typos and remove page numbers (regex helps for this), add chapter headings, and if necessary insert images. this step is kinda optional, depending on marginal return. i'm often satisfied with only a pdf for nonfiction books, or i put the minimum of effort into the epub that i need to make it readable by text-to-speech. when you're satisfied, do pandoc book.md -o book.epub --metadata-file=metadata.yaml.

converting and improving existing ebooks

many ebooks are put together badly in one way or another: pdfs that are badly scanned, have a spread per page, skewed pages, bad or nonexistent ocr layer, no table of contents. there are a few ways to correct them short of finding and rescanning the book.

pdftk is an all-purpose manipulation tool that lets you do things like combining pages from multiple pdfs, removing or rotating pages, or inserting a table of contents using a text file.
toc-mode for emacs is an astonishingly convenient way to add a table of contents to a pdf. it extracts entries from the in-text toc, lets you clean them up, then makes it easy for you to check they work and offset them correctly. however, it has dependencies that i have been unable to get working since i moved to nixos, so for now i'm stuck using pdftk.
briss is a janky and long-unupdated java tool for cropping pdfs, but the only one that seems to actually remove cropped-out text instead of only cropping the image layer. briss2 is a slightly more recent but also abandoned fork which adds a couple of features. you will need the version from this patch to make it build in CURRENTYEAR, or just use the original briss. briss is useful for splitting apart bad scans where each page contains a whole spread.
pdfsandwich is a sensible automatic reprocessor that deskews, cleans appearance, and redoes ocr using tesseract. a warning: pdfsandwich crashes if you use it on a pdf with more than ~10k pages. this is frustrating, if you've already been waiting on it for hours! for lengthy pdfs, you will want to first break them into parts with pdftk, then run the parts through pdfsandwich, before finally recombining with pdftk.
calibre is a beastly piece of software, but its conversion abilities are handy. if you want to turn a pdf into an epub, you can try using it with the “heuristic processing” option. the results will still be bad, because a pdf is a digital representation of a physical artifact rather than a representation of a semantic book. however, they may be good enough to use with text to speech while looking at the actual book, something no one but me has ever done.
kindle comic converter is for the edge case of someone who has a big cbz comic that they want to read on their black and white kindle in .mobi form. it's good at what it does.

books that i have put together

many of the things i've scanned i don't read, or not immediately, so i can't always say whether something is worth reading. more or less everything i've scanned or dedrmed is stored on aaarg, as well as in my ebook library.

dedrming books

to remove drm from a book, you will need calibre and the dedrm_tools plugin. this works for the adobe digital editions pdf/epub books you might get from your school library (i use a windows vm for this part), as well as for books from the kindle store. for kindle books, it helps if you have a physical kindle and can use the "download and transfer by usb option," else you need some particular version of the kindle desktop software. amazon is lenient with refunds, so you can probably buy a book, download and dedrm it, then immediately refund it! however, no guarantees that they won't catch onto you and deny the refund, especially if you do it frequently.

it also used to be possible to sign up for audible with a new account and dummy credit card information, download the free trial audiobook very quickly before the payment verification fails, then decrypt it and remove the drm. however, i haven't done this in a long time, and i remember having trouble the last time. dedrming audible books should still be easy enough using audible-activator and ffmpeg, you just will probably have to pay for it.

tools for reading ebooks

i've ended up trying many ebook readers, and only a few have been particularly worthwhile.

a kindle. you can find these super cheap (i picked up my current one for 30 USD used), since they subsidize them to hell off ebook purchases. just use calibre to stock it with whatever you want. the old ones are as good as the new ones.
voiceover/talkback features on ios/android will let you use text-to-speech on ebooks, but they may require you to keep the phone screen on. this is frustrating when you're trying to do something else, because it drains the battery, doesn't let you pause/play easily, and you will definitely brush the screen by accident in your pocket and mess it up. i use the kindle app on grapheneos to do this and sideload the books. i remember google play books working better, but it's been a very long time.
on ios, marvin 3 is my ideal epub reader (with cbz support too). notably, it has opds support, and integrates text to speech so that you can turn off the phone screen while it reads and control it with your headphones buttons, instead of faking it with voiceover. i don't use this anymore because i don't have ios.
koreader is great on android (also supports desktop linux and jailbroken ereaders) for anything but text to speech. it's the first app i've seen make reading pdfs manageable on a phone-size screen, through some smart navigational features and text reflow.
evince, the standard gnome document viewer on linux, is smooth and simple for pdf and cbz.
emacs pdf-tools is the other good linux pdf viewer; it has features like recoloring but it does not have smooth scrolling or continuous pages. with org-noter you can achieve a sophisticated annotation system in emacs, though the only epub mode is nov.el, and it isn't phenomenal.