The TeX program: A program of study

Note (2022): This web page was mostly written in 2019, but now (2022) I have a much better understanding of the program and its big picture, and can basically read it off the book somewhat comfortably. I’ll add a new page with my improved understanding (and what helped); hopefully sometime this year.

Short version

This webpage describes my (quixotic) project to “understand” (and maybe explain) the TeX program.

There are a lot of plans, but very little has been done so far. If you’d like to help with (or take over!) any of this, let me know!

Medium-length version

The source code of the TeX program, written in a “literate programming” style, has been published as a book:

TeX: The Program, by Donald E. Knuth. ISBN 0-201-13437-3 (Reading, Massachusetts: Addison-Wesley, 1986), xviii+600pp. Volume B of Computers And Typesetting.

The main contents of the book (the program itself), other than some introduction and appendices, is also available as a PDF:

Just invoke texdoc tex if you have a TeX distribution installed, or else find it online here (3.4 MiB).

Apparently a lot of people have successfully read this.

When I tried to read this book (in 2018 or so), I found it not very easy going. Despite the obvious (and obviously correct) reasons for this diffculty—namely, that I am not very good at reading code, and that reading a program is never expected to be very easy anyway—I cannot help thinking that the task could be made easier if the program were to be presented differently.

I can think of three aspects to this:

(A1) Superficial aspects of presentation.

The program is written in Donald Knuth’s own literate-programming system called WEB (created for writing TeX in), which can be seen as a preprocessing layer on top of a certain dialect of Pascal. The code mostly makes sense to programmers familiar with a C-like language, but there are several subtle differences. Moreover, the typographical conventions of the typeset of output—the use of proportional fonts instead of monospaced fonts, the (today) uncommon style of indentation, putting multiple statements on a line like “paragraphs”, syntax-highlighting using bold/italic fonts rather than using colour, etc.—are not what most of us are used to. There are also the special features/innovations specific to literate programming: the printed book has a mini-index on each two-page spread (TODO: add example image below); in the PDF the references to other sections are hyperlinked. Neither of them has back-references (where all does this symbol appear?), nor does it have any way of distinguishing variables from functions from WEB macros. Also, the context of a LP module (section) would be useful to have: expand/collapse a section in place, see a quick list of what names (symbols) it defines/uses/indexes, etc.

And so on. So fixing all this would require parsing the code and writing a code formatter and code browser.

A more advanced version of this would be to also translate the code from WEB/Pascal to some other more familiar programming language, one in which functions can return values without assigning to a pseudo-variable, for example. Also, the code browser could highlight the non-trivial control-flow in cases of gotos.

(A2) Overall program structure

Apart from presenting code nicely or differently, which helps in the small, we could try to make it easier to find answers to the questions that one normally asks about a program as a whole. What are the (major?) components of the program, and how do they interact? What is its architecture?

The TeX program is, partly because of limitations of the language that was being targeted at the time, written (at the Pascal level) as a single monolithic program (in a single file) of the form something like:

program TEX;
label
    1, 9998, 9999;
const
    ...;
type
    ...;
var
    ...;

procedure p1
    ...
end;
...
procedure pN
    ...
end;

begin
    ...
end;

That is, a single file which contains a bunch of (global) declarations of labels, constants, types and variables, then a bunch of procedures and functions, then finally a “main” begin…end block.

There are no “modules” in the sense of separate compilation units. There are hundreds (about 300?) of global variables. Such a program is hard to understand usually, though the Literate Programming paradigm helps a great deal—there are “modules” (chapters and sections) at the WEB level. So what would be good would be to highlight these different major components—hyphenation, macro expansion, line-breaking, math lists, etc.—and show the precise interactions and interfaces between them. If someone wants to understand just one part of TeX in isolation (with the minimal amount required from other parts), they should be able to do so. The hundreds of global variables are actually mostly used in one chapter or so each (apart from some common places like initialization code), so wouldn’t it be cool to automatically associate them to corresponding chapters? Etc.

A clear, and consistently formatted, specification / contract for each function?
Unit tests. There exists a TRIP test, but it (like the program itself) it is monolithic rather than modular, and thus hard to undertand. (See Doug McKenna’s article.) Also, it is more a conformance test than a unit test of functionality.

(A3) Program execution / dynamic nature.

“An algorithm must be seen to be believed”, wrote DEK near the start of TAOCP. There are limits to how much we can understand a program purely from a static representation of its code, no matter how well-formatted, hyperlinked, contextualized, etc. It would be useful to see what TeX does as it runs, on a few easy and not-so-easy inputs. E.g., what is TeX’s input processor doing, what is its macro expansion doing, its paragraph builder and page builder, etc. TeX does have debug output for much of this (with \tracingall for example), but all this debug output goes to the same terminal in cryptic form, from which it must be decoded.

Ideally, debug output would be in a clear/verbose format that highlights the workings of each part of TeX individually, and in expandable detail. This could be got by parsing log output or by running TeX in a regular debugger like gdb, but best would be a custom debugger. It could have separate areas for TeX’s “eyes”, “mouth”, “stomach”, or, in more detail for say TeX’s input stack, macro expansion, token lists, hyphenation, etc., with clear highlighting of control passing between them.

                 [main loop]
               /         \    \  action procedures
get_next,     /                \
get_token    /                  \ ...
            /
           /
   [Input = mouth]
      /    \
     /      \
   /         \

Overall, these three (A1, A2, A3) amount to writing a code formatter, code browser, sophisticated static analysis, a custom runtime/interpreter/debugger… all for the sake of a single program, which is clearly insane. Hence, the “quixotic” at the top.

While these (A1, A2, A3) above all involve changing how the program is presented, there are other things we can do in parallel either as complement or alternative:

(B1) Preparation.

Just get some familiarity and learn about the WEB language, both the semantics of Pascal/WEB, and the typographical conventions of WEAVE (the typeset output) — the cross-references, etc. “How to read a WEB program”, basically.

(B2) Other programs.

Instead of attacking the TeX program directly, work up to it, by reading other programs by Knuth or other examples of WEB, or even Knuth’s more recent (CWEB) programs — just to get practice with the style and programming philosophy.

(B3) Other TeX implementations.

Read other TeX implementations or programs — you don’t have to read Knuth’s WEB if that’s not the easiest to read.

(B4) History.

Look at the TeX program historically. Start with the earliest available version or design document (TEXDR.AFT and TEX.ONE), match with SAILDART, read TeX78, follow along with errorlog, etc. What were the constraints at the time?

(B5) Kibitz / philosophize.

Some general remarks about software engineering, about how Knuth is different, why we should expect this to be hard, etc. Probably doesn’t help, but is fun to do.

(B6) Read!

Just read the program as it is, understand, take notes from understanding. If I had any sense this is what I’d do, but I’m saving it for the last. Preserving my precious state of ignorance/confusion, because once (if) I get too comfortable I’ll no longer have an idea what the difficulties were.

These then — A1, A2, A3, and B1, B2, B3, B4, B5, B6 — are the 9 kinds of activities that this project involves. I’ll (soon) add more detail on each of these below: what has been done (nearly nothing), and what could be done.

I gave a talk about some of this at TUG 2019. At the time I proposed the talk, I was wildly optimistic about how much would be done by the time of the talk. Turns out it was nothing, so (looking back at it now…) the talk looks like nonsense with nothing of value whatsoever, so I feel guilty and embarrassed about it. On the other hand, the talk was (fortunately for me) not recorded, so maybe I can console myself that it was just a few minutes of lighthearted fun for the participants who were physically present. Here’s a link to the slides.

Long version

(WIP)

Why read TeX?

To simply use TeX, you don’t need to know anything about how the TeX program itself is implemented.

See preface from book. Software engineering. E.g. how to do memory allocation, packed tries, ….

Another reason to read the TeX program: sometimes it’s easier to read the program than to understand The TeXbook.

A1: Superficial aspects of presentation

Of the book and PDF,

(Book ≫ PDF) The book has a useful introduction and appendices (including a diagram) that are not present in the PDF, and more importantly every page-spread (pair of facing pages) has a mini-index for all the identifiers that occur on those pair of pages,
(PDF ≫ book) the PDF has (some) click-able cross-references, which in the book require manually turning pages and so on.

A2: Overall program structure

Chart of TeX’s code components (dark) and memory regions (light+dark) (page 594), also available here (via this in TUGboat – but note that the middle section is scaled by amount of code as Knuth says in the videos, and not by memory usage: scaling by memory is only for the outer sections).

Mini-program for highlighting how control sequences like \escapechar or \baselineskip are merely “entrances” (that OP’s terminology) for internally stored parameters. Shows the run loop (with switch) in main_control, the prefixed_command procedure that’s called for both of them, and the init_prim and primitive procedures that set eqtb for the primitives.

A3: Debugger

A debugger may help. I started writing something (has a really bad interface and is implemented in a stupid way currently).

An example is here, showing some of what goes on in TeX’s internal state with the two line file:

\expandafter\expandafter\expandafter\meaning\expandafter\uppercase\expandafter{a}
\end

B1: Preparation

Pages from the book:

Preface (page v)
Supplementary Bibliography (publications of the TeX project at Stanford University) (pages vi, vii)
How to Read a WEB (pages viii-xiii – temporarily taken down till I can double-check permission issues)

Additionally, to become familiar with the WEB style of writing programs, you may also want to read:

WEB itself

“Literate Programming” by Donald Knuth, The Computer Journal Volume 27 (1984), pages 97 to 111.
The WEB manual: The WEB system of Structured Documentation (also available with texdoc webman)
Many articles available on the excellent literateprogramming.com website (I wonder who is behind it?), e.g. An Introduction to the WEB Style of Literate Programming by Bart Childs.
Wayne Sewell’s book Weaving a Program: Literate Programming in Web

Pascal

You can read Jensen and Wirth (“Pascal user manual and report”), the original tutorial and definition of Pascal.
An interesting document is Kernighan (the K in K&R) on “Why Pascal is Not My Favorite Programming Language”. Kernighan faced many difficulties with Pascal, and WEB in many ways is a solution to those same difficulties (I wrote more on that at the beginning and end of the annotated version of POOLTYPE).
You can try writing a few of your own small programs in Pascal (with and without WEB), as with my 7-page “Hello, world!” program here.
You can try reading some Pascal programs written by others, so that at least the idiosyncrasies that are common to all Pascal programs can be got over. My suggestions are:
- Let’s Build a Compiler by Jack Crenshaw. The original is here, but there’s a very convenient reformatted version here.
- The P4 Compiler and Interpreter by Steven Pemberton and Martin Daniels.
Note that these two programs are both compilers. TeX itself is written like a compiler, so you’ll find many similarities.
Another personal favourite is the (in-progress after a long hiatus) Let’s Build A Simple Interpreter series by Ruslan Spivak. (No relation, I think, to Michael Spivak the author of AmsTeX and The Joy of TeX.) This is Python code, but in trying to write a Pascal compiler you may learn something about the language. If not, you’ll at least learn about compilers in a fun way.

B2: Smaller/other programs

You may want to work your way up to TeX from smaller programs written by Knuth in the same style: I’ve prepared a List of WEB files. Specifically, a possible reading order, from smallest to largest, is:

Program	Pages	Sections	Fraction of TeX
POOLTYPE	7 + 4	20 + 2	≈ 1.4% to 2%
GLUE	8 + 3	26 + 1	≈ 1.6% to 2%
DVITYPE	47 + 7	111 + 2	≈ 8% to 10%
TANGLE	66 + 9	187 + 2	≈ 13.5% to 14%
WEAVE	98 + 12	263 + 2	≈ 19% to 20.5%
TEX	478 + 57	1378 + 2	100%

Of these,

POOLTYPE and DVITYPE share a lot of their code with TeX (specifically, TeX’s string handling and DVI output sections respectively),
GLUE shows alternatives to some code in TeX (and is probably best read after understanding the corresponding parts of TeX, even though it was published first),
TANGLE and WEAVE are the implementation of WEB, and perhaps worth reading as programs of smaller/intermediate size.

I have separate pages for each of these programs on this site:

POOLTYPE (annotated version here)
TANGLE (annotated version here)
WEAVE
DVITYPE
GLUE
TeX — just included for completeness here, mentioned again at the bottom of this page. Also, extensions of TeX: eTeX, pdfTeX, XeTeX.

Totally unrelated to TeX, but you could look at other “literate programs” entirely: Knuth’s CWEB programs, or the (Academy Award winning!) Physically Based Rendering book (see random chapter).

B3: Other versions of TeX

Instead of reading the Pascal source as written by Knuth, there are many other versions you could read. I’ve collected a bunch here.
- Many (not all!) are only of historical interest.
- The most relevant may be LuaTeX, which is under active development, and started with a manual translation of the Pascal code to C. For (a randomly picked?) example, compare section 426 of the TeX program with this part of LuaTeX — it is in more familiar C style, but there are also more cases. (Compare with pdfTeX and XeTeX.)
- A good version to read may be Richard Sandberg’s rsTeX. For example, here’s that same section. See comments at the top of the file.
web2w is a conversion of TeX from literate programming in Pascal to literate programming in C. See Martin Ruckert’s TUGboat article and website.

B4: History

Videos

After having read these smaller programs and having gained a bit of familiarity with Pascal, before reading the full TeX program I strongly recommend the series of 12 lectures that Knuth gave in 1982 called The Internal Details of TeX82. They are available on YouTube, but I’ve embedded them on this website, with some comments, here.

The Internal Details of TeX82

Also, as you read the program / watch the videos, you can also try solving these exercises from a course DEK gave about TeX:

Earlier TeX versions

It is unclear how much this would help, but often the earlier versions of a program are less complex, or at least illuminate how the program got into its current state.

TeX’s early “design documents” (TEXDR.AFT and TEX.ONE) are available on https://www.saildart.org/ (also published in Digital Typography).
The site also contains early versions of the TeX programs, in the SAIL language.
In principle these could be cross-referenced against the TeX “errorlog”, to produce something like a mini-version of Diomidis Spinellis’s Unix history repo.

Random observations

See here.

That's the end of this page. If you have any feedback about anything on this site, you can contact me here. To go back to the top-level page (if you are not on it already), click here.