This webpage describes my (quixotic) project to “understand” (and maybe explain) the TeX program.
There are a lot of plans, but very little has been done so far. If you’d like to help with (or take over!) any of this, let me know!
The source code of the TeX program, written in a “literate programming” style, has been published as a book:
The main contents of the book (the program itself), other than some introduction and appendices, is also available as a PDF:
texdoc texif you have a TeX distribution installed, or else find it online here (3.4 MiB).
Apparently a lot of people have successfully read this.
When I tried to read this book (in 2018 or so), I found it not very easy going. Despite the obvious (and obviously correct) reasons for this diffculty—namely, that I am not very good at reading code, and that reading a program is never expected to be very easy anyway—I cannot help thinking that the task could be made easier if the program were to be presented differently.
I can think of three aspects to this:
The program is written in Donald Knuth’s own literate-programming system called WEB (created for writing TeX in), which can be seen as a preprocessing layer on top of a certain dialect of Pascal. The code mostly makes sense to programmers familiar with a C-like language, but there are several subtle differences. Moreover, the typographical conventions of the typeset of output—the use of proportional fonts instead of monospaced fonts, the (today) uncommon style of indentation, putting multiple statements on a line like “paragraphs”, syntax-highlighting using bold/italic fonts rather than using colour, etc.—are not what most of us are used to. There are also the special features/innovations specific to literate programming: the printed book has a mini-index on each two-page spread (TODO: add example image below); in the PDF the references to other sections are hyperlinked. Neither of them has back-references (where all does this symbol appear?), nor does it have any way of distinguishing variables from functions from WEB macros. Also, the context of a LP module (section) would be useful to have: expand/collapse a section in place, see a quick list of what names (symbols) it defines/uses/indexes, etc.
And so on. So fixing all this would require parsing the code and writing a code formatter and code browser.
A more advanced version of this would be to also translate the code from WEB/Pascal to some other more familiar programming language, one in which functions can return values without assigning to a pseudo-variable, for example. Also, the code browser could highlight the non-trivial control-flow in cases of gotos.
Apart from presenting code nicely or differently, which helps in the small, we could try to make it easier to find answers to the questions that one normally asks about a program as a whole. What are the (major?) components of the program, and how do they interact? What is its architecture?
The TeX program is, partly because of limitations of the language that was being targeted at the time, written (at the Pascal level) as a single monolithic program (in a single file) of the form something like:
program TEX; label 1, 9998, 9999; const ...; type ...; var ...; procedure p1 ... end; ... procedure pN ... end; begin ... end;
That is, a single file which contains a bunch of (global) declarations of labels, constants, types and variables, then a bunch of procedures and functions, then finally a “main”
There are no “modules” in the sense of separate compilation units. There are hundreds (about 300?) of global variables. Such a program is hard to understand usually, though the Literate Programming paradigm helps a great deal—there are “modules” (chapters and sections) at the WEB level. So what would be good would be to highlight these different major components—hyphenation, macro expansion, line-breaking, math lists, etc.—and show the precise interactions and interfaces between them. If someone wants to understand just one part of TeX in isolation (with the minimal amount required from other parts), they should be able to do so. The hundreds of global variables are actually mostly used in one chapter or so each (apart from some common places like initialization code), so wouldn’t it be cool to automatically associate them to corresponding chapters? Etc.
“An algorithm must be seen to be believed”, wrote DEK near the start of TAOCP. There are limits to how much we can understand a program purely from a static representation of its code, no matter how well-formatted, hyperlinked, contextualized, etc. It would be useful to see what TeX does as it runs, on a few easy and not-so-easy inputs. E.g., what is TeX’s input processor doing, what is its macro expansion doing, its paragraph builder and page builder, etc. TeX does have debug output for much of this (with
\tracingall for example), but all this debug output goes to the same terminal in cryptic form, from which it must be decoded.
Ideally, debug output would be in a clear/verbose format that highlights the workings of each part of TeX individually, and in expandable detail. This could be got by parsing log output or by running TeX in a regular debugger like
gdb, but best would be a custom debugger. It could have separate areas for TeX’s “eyes”, “mouth”, “stomach”, or, in more detail for say TeX’s input stack, macro expansion, token lists, hyphenation, etc., with clear highlighting of control passing between them.
[main loop] / \ \ action procedures get_next, / \ get_token / \ ... / / [Input = mouth] / \ / \ / \
Overall, these three (A1, A2, A3) amount to writing a code formatter, code browser, sophisticated static analysis, a custom runtime/interpreter/debugger… all for the sake of a single program, which is clearly insane. Hence, the “quixotic” at the top.
While these (A1, A2, A3) above all involve changing how the program is presented, there are other things we can do in parallel either as complement or alternative:
Just get some familiarity and learn about the WEB language, both the semantics of Pascal/WEB, and the typographical conventions of WEAVE (the typeset output) — the cross-references, etc. “How to read a WEB program”, basically.
Instead of attacking the TeX program directly, work up to it, by reading other programs by Knuth or other examples of WEB, or even Knuth’s more recent (CWEB) programs — just to get practice with the style and programming philosophy.
Read other TeX implementations or programs — you don’t have to read Knuth’s WEB if that’s not the easiest to read.
Look at the TeX program historically. Start with the earliest available version or design document (
TEX.ONE), match with SAILDART, read TeX78, follow along with errorlog, etc. What were the constraints at the time?
Some general remarks about software engineering, about how Knuth is different, why we should expect this to be hard, etc. Probably doesn’t help, but is fun to do.
Just read the program as it is, understand, take notes from understanding. If I had any sense this is what I’d do, but I’m saving it for the last. Preserving my precious state of ignorance/confusion, because once (if) I get too comfortable I’ll no longer have an idea what the difficulties were.
These then — A1, A2, A3, and B1, B2, B3, B4, B5, B6 — are the 9 kinds of activities that this project involves. I’ll (soon) add more detail on each of these below: what has been done (nearly nothing), and what could be done.
I gave a talk about some of this at TUG 2019. At the time I proposed the talk, I was wildly optimistic about how much would be done by the time of the talk. Turns out it was nothing, so (looking back at it now…) the talk looks like nonsense with nothing of value whatsoever, so I feel guilty and embarrassed about it. On the other hand, the talk was (fortunately for me) not recorded, so maybe I can console myself that it was just a few minutes of lighthearted fun for the participants who were physically present. Here’s a link to the slides.
To simply use TeX, you don’t need to know anything about how the TeX program itself is implemented.
See preface from book. Software engineering. E.g. how to do memory allocation, packed tries, ….
Another reason to read the TeX program: sometimes it’s easier to read the program than to understand The TeXbook.
Of the book and PDF,
Chart of TeX’s code components (dark) and memory regions (light+dark) (page 594), also available here (via this in TUGboat – but note that the middle section is scaled by amount of code as Knuth says in the videos, and not by memory usage: scaling by memory is only for the outer sections).
\baselineskipare merely “entrances” (that OP’s terminology) for internally stored parameters. Shows the run loop (with switch) in
prefixed_commandprocedure that’s called for both of them, and the
primitiveprocedures that set
eqtbfor the primitives.
A debugger may help. I started writing something (has a really bad interface and is implemented in a stupid way currently).
Pages from the book:
Preface (page v)
How to Read a WEB (pages viii-xiii – temporarily taken down till I can double-check permission issues)
Additionally, to become familiar with the WEB style of writing programs, you may also want to read:
You can read Jensen and Wirth (“Pascal user manual and report”), the original tutorial and definition of Pascal.
An interesting document is Kernighan (the K in K&R) on “Why Pascal is Not My Favorite Programming Language”. Kernighan faced many difficulties with Pascal, and WEB in many ways is a solution to those same difficulties (I wrote more on that at the beginning and end of the annotated version of POOLTYPE).
You can try writing a few of your own small programs in Pascal (with and without WEB), as with my 7-page “Hello, world!” program here.
You can try reading some Pascal programs written by others, so that at least the idiosyncrasies that are common to all Pascal programs can be got over. My suggestions are:
Note that these two programs are both compilers. TeX itself is written like a compiler, so you’ll find many similarities.
You may want to work your way up to TeX from smaller programs written by Knuth in the same style: I’ve prepared a List of WEB files. Specifically, a possible reading order, from smallest to largest, is:
|Program||Pages||Sections||Fraction of TeX|
|POOLTYPE||7 + 4||20 + 2||≈ 1.4% to 2%|
|GLUE||8 + 3||26 + 1||≈ 1.6% to 2%|
|DVITYPE||47 + 7||111 + 2||≈ 8% to 10%|
|TANGLE||66 + 9||187 + 2||≈ 13.5% to 14%|
|WEAVE||98 + 12||263 + 2||≈ 19% to 20.5%|
|TEX||478 + 57||1378 + 2||100%|
DVITYPE share a lot of their code with TeX (specifically, TeX’s string handling and DVI output sections respectively),
GLUE shows alternatives to some code in TeX (and is probably best read after understanding the corresponding parts of TeX, even though it was published first),
WEAVE are the implementation of WEB, and perhaps worth reading as programs of smaller/intermediate size.
I have separate pages for each of these programs on this site:
Totally unrelated to TeX, but you could look at other “literate programs” entirely: Knuth’s CWEB programs, or the (Academy Award winning!) Physically Based Rendering book (see random chapter).
Instead of reading the Pascal source as written by Knuth, there are many other versions you could read. I’ve collected a bunch here.
Many (not all!) are only of historical interest.
The most relevant may be LuaTeX, which is under active development, and started with a manual translation of the Pascal code to C. For (a randomly picked?) example, compare section 426 of the TeX program with this part of LuaTeX — it is in more familiar C style, but there are also more cases. (Compare with pdfTeX and XeTeX.)
After having read these smaller programs and having gained a bit of familiarity with Pascal, before reading the full TeX program I strongly recommend the series of 12 lectures that Knuth gave in 1982 called The Internal Details of TeX82. They are available on YouTube, but I’ve embedded them on this website, with some comments, here.
Also, as you read the program / watch the videos, you can also try solving these exercises from a course DEK gave about TeX:
It is unclear how much this would help, but often the earlier versions of a program are less complex, or at least illuminate how the program got into its current state.