on making and using ebooks
by now i've put a fair amount of time into hobbyist book digitization and dedrming ebooks. this is my accumulated wisdom: reasons, tools, tricks. updated <2021-10-16 六> and <2025-12-07 日>.
i have used ebooks almost to the exclusion of physical copies for over a decade, for a few reasons:
- possession of a physical copy creates a psychological illusion that one has already done the work, a barrier to actually reading (hence tsundoku etc)
- it's simply not economical to collect physical copies. if you're doing research, you just need to bite the bullet and get 100% used to ebooks, whatever it takes. the time saved in pulling up whatever texts the chain of mental associations takes you to ultimately creates an altered relationship with texts that previous generations couldn't have imagined.
- in the past i was very adhd and relied on text-to-speech while reading to keep my focus from drifting. i am more medicated now, but still do this to an extent.
tools for scanning ebooks
these are tools i've found useful at some point. however, i no longer have a working pipeline, as it's been a long time since i digitized any books and some tools have become broken. i was never completely happy with my pipeline anyway.
- scantailor-advanced: run all your scans through scantailor for postprocessing before you try to make a pdf! it allows you to split and deskew pages, format margins, and convert grayscale to b/w. you should be scanning in highish dpi grayscale tiff and then using other tools to compress afterwards.
- tesseract: ocr tool. perhaps one of the newer "AI" tools is better for this now (or for some purposes), but i haven't tried them.
- hocr-tools: used for turning hocr (output by tesseract) into pdf
- jbig2enc: compress tiff into jbig2 format
- pdfbeads: now VERY deprecated tool for compressing the scan layer and combining it with ocr data.
improving existing pdfs
many ebooks are put together badly in one way or another: pdfs that are badly scanned, have a spread per page, skewed pages, bad or nonexistent ocr layer, no table of contents. there are a few ways to correct them short of finding and rescanning the book. i often find myself using briss or pdfsandwich to improve somebody else's badly-done scans before reading.
- pdftk is an all-purpose manipulation tool that lets you do things like combining pages from multiple pdfs, removing or rotating pages, or inserting a table of contents using a text file.
- toc-mode for emacs is an astonishingly convenient way to add a table of contents to a pdf. it extracts entries from the in-text toc, lets you clean them up, then makes it easy for you to check they work and offset them correctly. however, it has dependencies that i have been unable to get working since i moved to nixos, so for now i'm stuck using pdftk.
- briss is a janky and long-unupdated java tool for cropping pdfs, but the only one that seems to actually remove cropped-out text instead of only cropping the image layer. briss is useful for splitting apart bad scans where each page contains a whole spread.
- doc-tools-toc for emacs: extract the table of contents text from a pdf's ocr layer, edit it for errors, and make it a proper toc sidebar
- pdfsandwich is
a sensible automatic reprocessor that deskews, cleans appearance, and
redoes ocr using tesseract. a warning: pdfsandwich crashes if you use it
on a pdf with more than ~10k pages.update <2025-12-07 日>: i do not remember what i
was doing with a ten thousand page pdf and honestly dread to
think.
this is frustrating, if you've already been waiting on it for hours! for lengthy pdfs, you will want to first break them into parts with pdftk, then run the parts through pdfsandwich, before finally recombining with pdftk. - ocrmypdf looks like a similar tool to pdfsandwich, but i haven't tried it yet.
converting between formats
it's also often expedient to convert between various formats (epub to pdf, pdf to mobi, etc.) for different purposes, though the process is very lossy.
- calibre is a beastly piece of software, but its conversion abilities are handy. if you want to turn a pdf into an epub, you can try using it with the “heuristic processing” option. the results will still be bad, because a pdf is a digital representation of a physical artifact rather than a representation of a semantic book. however, they may be good enough to use with text to speech while looking at the actual book, something no one but me has ever done.
- pdftohtml from poppler-utils
- kindle comic converter is for the edge case of someone who has a big cbz comic that they want to read on their black and white kindle in .mobi form. it's good at what it does.
- pandoc: converts various formats, e.g. md, org, html, epub, pdf (the latter only for output).
- tectonic: the only LaTeX engine i've found to work well, especially when it comes to chinese characters. useful with pandoc for publishing books or artices you've composed yourself.
dedrming books
- to remove drm from a book, you will need calibre and the dedrmtools plugin. this works for the adobe digital editions pdf/epub books you might get from your school library (i used to use a windows vm for this part), as well as for books from the kindle store. for kindle books, it helps if you have a physical kindle and can use the "download and transfer by usb option," else you need some particular version of the kindle desktop software. amazon is lenient with refunds, so you can probably buy a book, download and dedrm it, then immediately refund it! however, no guarantees that they won't catch onto you and deny the refund, especially if you do it frequently.
- it also used to be possible to sign up for audible with a new account and dummy credit card information, download the free trial audiobook very quickly before the payment verification fails, then decrypt it and remove the drm. however, i haven't done this in a long time, and i remember having trouble the last time. dedrming audible books should still be easy enough using audible-activator and ffmpeg, you just will probably have to pay for it. n.b. i have not done this in years, and audible may well have changed how it does things.
tools for reading ebooks
i've ended up trying many ebook readers, and only a few have been particularly worthwhile.
- a kindle. you can find these super cheap (i picked up my current one for 30 USD used), since they subsidize them to hell off of ebook purchases. just use a usb cable to stock it with whatever you want. the old ones are as good as the new ones. i'm sure there are other readers that are as least as good; what's important is just that you can install koreader on it.
- voiceover/talkback features on ios/android will let you use text-to-speech on ebooks, but they may require you to keep the phone screen on. this is frustrating when you're trying to do something else, because it drains the battery, doesn't let you pause/play easily, and you will definitely brush the screen by accident in your pocket and mess it up. i use the kindle app on grapheneos to do this and sideload the books.
- on ios, marvin 3 was my ideal epub reader (with cbz support too). notably, it has opds support, and integrates text to speech so that you can turn off the phone screen while it reads and control it with your headphones buttons, instead of faking it with voiceover. i haven't used this in many years, since i don't have ios.
- koreader is great on android (also supports desktop linux and jailbroken ereaders) for anything but text to speech. it's the first app i've seen make reading pdfs manageable on a phone-size screen, through some smart navigational features and text reflow.
- on desktop, zotero's built-in pdf viewer is now not bad. i never used to understand people who used highlighters, but i've lately gotten into mutilating the page with the red highlighter function whenever i disagree with a claim or dislike the author's wording.
- emacs pdf-tools is the other good linux pdf viewer; it has features like recoloring, but it does not have smooth scrolling or continuous pages. with org-noter you can achieve a sophisticated annotation system in emacs, though the only epub mode is nov.el, and it isn't phenomenal.
books that i have put together
i've scanned or dedrmed around fifty books, most of which i uploaded to the now-defunct aaarg (RIP), whose contents have now been merged into anna's archive.