The TeX program: A program of study

Note (2022): This web page was mostly written in 2019, but now (2022) I have a much better understanding of the program and its big picture, and can basically read it off the book somewhat comfortably. I’ll add a new page with my improved understanding (and what helped); hopefully sometime this year.

Short version

This webpage describes my (quixotic) project to “understand” (and maybe explain) the TeX program.

There are a lot of plans, but very little has been done so far. If you’d like to help with (or take over!) any of this, let me know!


Medium-length version

The source code of the TeX program, written in a “literate programming” style, has been published as a book:

The main contents of the book (the program itself), other than some introduction and appendices, is also available as a PDF:

Apparently a lot of people have successfully read this.

When I tried to read this book (in 2018 or so), I found it not very easy going. Despite the obvious (and obviously correct) reasons for this diffculty—namely, that I am not very good at reading code, and that reading a program is never expected to be very easy anyway—I cannot help thinking that the task could be made easier if the program were to be presented differently.

I can think of three aspects to this:

(A1) Superficial aspects of presentation.

The program is written in Donald Knuth’s own literate-programming system called WEB (created for writing TeX in), which can be seen as a preprocessing layer on top of a certain dialect of Pascal. The code mostly makes sense to programmers familiar with a C-like language, but there are several subtle differences. Moreover, the typographical conventions of the typeset of output—the use of proportional fonts instead of monospaced fonts, the (today) uncommon style of indentation, putting multiple statements on a line like “paragraphs”, syntax-highlighting using bold/italic fonts rather than using colour, etc.—are not what most of us are used to. There are also the special features/innovations specific to literate programming: the printed book has a mini-index on each two-page spread (TODO: add example image below); in the PDF the references to other sections are hyperlinked. Neither of them has back-references (where all does this symbol appear?), nor does it have any way of distinguishing variables from functions from WEB macros. Also, the context of a LP module (section) would be useful to have: expand/collapse a section in place, see a quick list of what names (symbols) it defines/uses/indexes, etc.

And so on. So fixing all this would require parsing the code and writing a code formatter and code browser.

A more advanced version of this would be to also translate the code from WEB/Pascal to some other more familiar programming language, one in which functions can return values without assigning to a pseudo-variable, for example. Also, the code browser could highlight the non-trivial control-flow in cases of gotos.

(A2) Overall program structure

Apart from presenting code nicely or differently, which helps in the small, we could try to make it easier to find answers to the questions that one normally asks about a program as a whole. What are the (major?) components of the program, and how do they interact? What is its architecture?

The TeX program is, partly because of limitations of the language that was being targeted at the time, written (at the Pascal level) as a single monolithic program (in a single file) of the form something like:

program TEX;
label
    1, 9998, 9999;
const
    ...;
type
    ...;
var
    ...;

procedure p1
    ...
end;
...
procedure pN
    ...
end;

begin
    ...
end;

That is, a single file which contains a bunch of (global) declarations of labels, constants, types and variables, then a bunch of procedures and functions, then finally a “main” beginend block.

There are no “modules” in the sense of separate compilation units. There are hundreds (about 300?) of global variables. Such a program is hard to understand usually, though the Literate Programming paradigm helps a great deal—there are “modules” (chapters and sections) at the WEB level. So what would be good would be to highlight these different major components—hyphenation, macro expansion, line-breaking, math lists, etc.—and show the precise interactions and interfaces between them. If someone wants to understand just one part of TeX in isolation (with the minimal amount required from other parts), they should be able to do so. The hundreds of global variables are actually mostly used in one chapter or so each (apart from some common places like initialization code), so wouldn’t it be cool to automatically associate them to corresponding chapters? Etc.

(A3) Program execution / dynamic nature.

“An algorithm must be seen to be believed”, wrote DEK near the start of TAOCP. There are limits to how much we can understand a program purely from a static representation of its code, no matter how well-formatted, hyperlinked, contextualized, etc. It would be useful to see what TeX does as it runs, on a few easy and not-so-easy inputs. E.g., what is TeX’s input processor doing, what is its macro expansion doing, its paragraph builder and page builder, etc. TeX does have debug output for much of this (with \tracingall for example), but all this debug output goes to the same terminal in cryptic form, from which it must be decoded.

Ideally, debug output would be in a clear/verbose format that highlights the workings of each part of TeX individually, and in expandable detail. This could be got by parsing log output or by running TeX in a regular debugger like gdb, but best would be a custom debugger. It could have separate areas for TeX’s “eyes”, “mouth”, “stomach”, or, in more detail for say TeX’s input stack, macro expansion, token lists, hyphenation, etc., with clear highlighting of control passing between them.

                 [main loop]
               /         \    \  action procedures
get_next,     /                \
get_token    /                  \ ...
            /
           /
   [Input = mouth]
      /    \
     /      \
   /         \

Overall, these three (A1, A2, A3) amount to writing a code formatter, code browser, sophisticated static analysis, a custom runtime/interpreter/debugger… all for the sake of a single program, which is clearly insane. Hence, the “quixotic” at the top.

While these (A1, A2, A3) above all involve changing how the program is presented, there are other things we can do in parallel either as complement or alternative:

(B1) Preparation.

Just get some familiarity and learn about the WEB language, both the semantics of Pascal/WEB, and the typographical conventions of WEAVE (the typeset output) — the cross-references, etc. “How to read a WEB program”, basically.

(B2) Other programs.

Instead of attacking the TeX program directly, work up to it, by reading other programs by Knuth or other examples of WEB, or even Knuth’s more recent (CWEB) programs — just to get practice with the style and programming philosophy.

(B3) Other TeX implementations.

Read other TeX implementations or programs — you don’t have to read Knuth’s WEB if that’s not the easiest to read.

(B4) History.

Look at the TeX program historically. Start with the earliest available version or design document (TEXDR.AFT and TEX.ONE), match with SAILDART, read TeX78, follow along with errorlog, etc. What were the constraints at the time?

(B5) Kibitz / philosophize.

Some general remarks about software engineering, about how Knuth is different, why we should expect this to be hard, etc. Probably doesn’t help, but is fun to do.

(B6) Read!

Just read the program as it is, understand, take notes from understanding. If I had any sense this is what I’d do, but I’m saving it for the last. Preserving my precious state of ignorance/confusion, because once (if) I get too comfortable I’ll no longer have an idea what the difficulties were.


These then — A1, A2, A3, and B1, B2, B3, B4, B5, B6 — are the 9 kinds of activities that this project involves. I’ll (soon) add more detail on each of these below: what has been done (nearly nothing), and what could be done.


I gave a talk about some of this at TUG 2019. At the time I proposed the talk, I was wildly optimistic about how much would be done by the time of the talk. Turns out it was nothing, so (looking back at it now…) the talk looks like nonsense with nothing of value whatsoever, so I feel guilty and embarrassed about it. On the other hand, the talk was (fortunately for me) not recorded, so maybe I can console myself that it was just a few minutes of lighthearted fun for the participants who were physically present. Here’s a link to the slides.


Long version

(WIP)

Why read TeX?

To simply use TeX, you don’t need to know anything about how the TeX program itself is implemented.

See preface from book. Software engineering. E.g. how to do memory allocation, packed tries, ….

Another reason to read the TeX program: sometimes it’s easier to read the program than to understand The TeXbook.

A1: Superficial aspects of presentation

Of the book and PDF,

A2: Overall program structure

Chart of TeX’s code components (dark) and memory regions (light+dark) (page 594), also available here (via this in TUGboat – but note that the middle section is scaled by amount of code as Knuth says in the videos, and not by memory usage: scaling by memory is only for the outer sections).

A3: Debugger

A debugger may help. I started writing something (has a really bad interface and is implemented in a stupid way currently).

\expandafter\expandafter\expandafter\meaning\expandafter\uppercase\expandafter{a}
\end

B1: Preparation

Pages from the book:

Additionally, to become familiar with the WEB style of writing programs, you may also want to read:

WEB itself

Pascal

B2: Smaller/other programs

You may want to work your way up to TeX from smaller programs written by Knuth in the same style: I’ve prepared a List of WEB files. Specifically, a possible reading order, from smallest to largest, is:

Program Pages Sections Fraction of TeX
POOLTYPE 7 + 4 20 + 2 ≈ 1.4% to 2%
GLUE 8 + 3 26 + 1 ≈ 1.6% to 2%
DVITYPE 47 + 7 111 + 2 ≈ 8% to 10%
TANGLE 66 + 9 187 + 2 ≈ 13.5% to 14%
WEAVE 98 + 12 263 + 2 ≈ 19% to 20.5%
TEX 478 + 57 1378 + 2 100%

Of these,

I have separate pages for each of these programs on this site:

Totally unrelated to TeX, but you could look at other “literate programs” entirely: Knuth’s CWEB programs, or the (Academy Award winning!) Physically Based Rendering book (see random chapter).

B3: Other versions of TeX

B4: History

Videos

After having read these smaller programs and having gained a bit of familiarity with Pascal, before reading the full TeX program I strongly recommend the series of 12 lectures that Knuth gave in 1982 called The Internal Details of TeX82. They are available on YouTube, but I’ve embedded them on this website, with some comments, here.

Also, as you read the program / watch the videos, you can also try solving these exercises from a course DEK gave about TeX:

Earlier TeX versions

It is unclear how much this would help, but often the earlier versions of a program are less complex, or at least illuminate how the program got into its current state.


Random observations


That's the end of this page. If you have any feedback about anything on this site, you can contact me here. To go back to the top-level page (if you are not on it already), click here.