boxes and glue - TeX’s algorithms re-implemented

A presentation at TUG 2022 in July 2022 in by speedata

Slide 1

Slide 1

Title slide

The project I present is called boxes and glue and is named after TeXs model of joining rectangular areas together with flexible width.

This presentation is aimed at developers. But I hope that even if you are not a developer yourself, there are still some interesting slides for you.

Before I show the Why and the How in detail, I try to give a short overview of boxes and glue.

Slide 2

Slide 2

What is boxes and glue?

So, what is boxes and glue?

boxes and glue is a collection of software libraries therefore it is is not a ready to run program such as TeX.

The library is written in the Go programming language

The attempt to bring TEX’s superb typesetting quality to a modern environment and available under an Open source license

Slide 3

Slide 3

Why boxes and glue?

To explain the reasoning why I work on boxes and glue, I have to go back a bit. I am self employed and earn my money producing product catalogs, data sheets and other documents in an unattended workflow using my own software called the speedata Publisher that I have presented here I think It was two years ago.

The speedata Publisher is based on LuaTeX. LuaTeX is used a typesetting backend and my software arranges the visible elements on the pages depending on the provided data and layout instructions.

There are a few example here on this slide and also on my home page.

Slide 4

Slide 4

How does this work?

A PDF page consists of many little items such as glyphs, rules, images, coloured areas and other rectangle shaped things that can be joined together in a horizontal or a vertical box which themselves can be placed in boxes, until there is one big box that represents the page. All the things and rectangles are called nodes in TeX.

Now with LuaTeX you can jump into the Lua mode which allows you do to anything that could be done from the TeX side, just with a different interface.

So for example to create a glyph output from Lua you have to create a new node of the glyph type and fill in the fields. The font has to be loaded in advance to get the glyph dimensions.

Just as an example how to create a horizontal box from this glyph node, the code is shown here. The real code that has to be used is of course much more complicated.

This looks a bit tedious and compared to TeXs input, it really is. but with enough abstraction it is quite doable. This is the way you assemble pages in LuaTeX

Slide 5

Slide 5

Why LuaTeX in the first place?

Contrary to what many believe, LuaTeX is vey, very fast. On my rather new macBook with the Apple Silicon chip, LuaTeX creates about 500 pages per second.

LuaTeX has all features of TeX, but is better programmable. This of course is highly subjective, but I don’t like TeXs input language very much.

It is very flexible, you can create almost any PDF code you want to comply with a huge range of PDF standards

And LuaTeX comes with harfbuzz, a text shaping library. harfbuzz loads a font and turns a series of characters into a list of glyph ids from the font. This is trivial for a single letter such as A, but it comes in handy when you have complex ligatures, non-western scripts such as Arabic where glyph shape changes depending on the position of the character in a word and it also interprets OpenType features. So harfbuzz is a very valuable library.

Slide 6

Slide 6

LuaTeXs limitation

The talk is not just about praising LuaTeX. LuaTeX has in my opinion a lot of shortcomings

  • error handling. It is really hard to trace errors in LuaTeX, sometime the only error message is “this cannot happen”.
  • It might sound trivial, but you cannot create https connections to a web server. In almost every other catalog project I have to retrieve images from a web based database which connects with https, of course.
  • There is no garbage collection of the nodes, so this is something I need to take extra care of
  • Also very subjective: Lua is only suitable for smaller projects. There is no type checking for variables you use, the tooling for development with Lua is almost non-existent especially compared to other programming languages
  • LuaTeX is hard to extend. it is still based on TeX which is not know for its modularity. You can add third party libraries to LuaTeX, but that makes shipping and compiling more complicated.
  • There are little things that can be annoying. For example (I don’t know if this is still true) LuaTeX on windows is not able to be installed in a directory with a non-ascii character in the path name . Or at one time - also on Windows - luatex was unable to include an image with a non-ascii character in the file name.

These limitations hit my daily work more and more so I am thinking about re-implementing TeX for quite a long time now.

Slide 7

Slide 7

Idea: re-implementing TeX

A good thing about TeX is being open source and well documented. You can download the documented source code or get Knuth’s book called Computers and typesetting B and study TeX’s source code.

There has been a re-implemation of TeX before called NTS, which was written in Java. But as far as I know, that has never taken off to be used in the real world. There was a project called extex which took the NTS code, but this also was abandoned.

Slide 8

Slide 8

Idea: re-implementing TeX

I see three possibilities to re-implement TeX and bring it into a modern environment:

The first is to go from the pascal code in the book and translate it one by one into a modern environment. But this would still be TeX, output DVI and would not load OpenType fonts.

The second way would be to take the LuaTeX code and translate it or use some automated tool (a transpiler) to create code for example for the Go programming language. But there are still a lot of dependencies on C libraries and the code would not become modular automatically.

The last option is the one I take. It is to re implement the system by looking at what needs to be done and not how it is done. The footnote here is that I do limit myself to the application I have and I will just port a subset of the current implementation of LuaTeX.

Slide 9

Slide 9

Compatibility with TeX

Now when I think of re-implementing TeX, I have to take into account the compatibility with TeX. Take for example this function from the TeX source code to calculate the badness when for example a line of text in a paragraph has to be stretched to a given horizontal size.

It should be obvious what it does, right? Well, it wasn’t to me and I can show you the formula that it implements.

(formula)

In a modern programming language this could be written like the one line code.

So the question that I have to answer is how compatible I’d like to be with the original Knuth TeX. The code to the left computes only 1095 different values but is probably very fast but also entirely incomprehensible. Since my goal is to be only similar to TeX but not exactly alike, I will take the one-line calculation of the badness which will not give exactly the same results.

Slide 10

Slide 10

So when I write everything from scratch, I have to take a close look which parts of TeX, or which algorithms in TeX I re-implement

From my point of view the most important algorithms in TeX are

  • Hyphenation
  • the famous breaking paragraphs into lines,
  • but also math typesetting
  • the calculation with glues and
  • the input language which based on macro expansion.

These algorithms are very well documented, very clever but not rocket science so it is possible to re-implement those.

Slide 11

Slide 11

Algorithm: hyphenation

I’d like to show the cleverness of one of these algorithms, the hyphenation algorithm to split word into smaller parts. The starting point is a list of hyphenation patterns.

I am not going into detail how this list is created, but it is a mixture of automatic extraction from dictionaries and lots of manual labour.

On the right hand side there is a small excerpt of the german hyphenation pattern file which has about 25000 entries. Each entry consists of letters and numbers and dots wich match the start or end of a word.

Slide 12

Slide 12

Algorithm: hyphenation

Now we take a word to hyphenate and insert spaces. You will see why on the next slides.

Slide 13

Slide 13

Algorithm: hyphenation

We go through each entry in the pattern list and try to match each pattern to the word.

The second pattern in the list matches, because you can write it like this (next slide) below the word where each letter is in the correct position.

Slide 14

Slide 14

Algorithm: hyphenation

Here you can see how the algorithm decides how the pattern matches the input word.

When there is no number between two letters, like the pattern marked here, a zero is assumed to be there.

Slide 15

Slide 15

Algorithm: hyphenation

( A few slides skipped)

We need to go through the whole list and write down all patterns which match until there is none left.

Slide 16

Slide 16

Algorithm: hyphenation

Now what is the next step? We are only interested in the numbers between the letters.

Slide 17

Slide 17

Algorithm: hyphenation

The idea is to get the highest number in each column. Between a and u the maximum is 2, between u and t the maximum is 1 and so on.

Slide 18

Slide 18

Algorithm: hyphenation

Now an odd number means a valid hyphenation point where as even numbers mark unbreakable parts of the word

So the algorithm is rather simple and can only return a signal where hyphenation is allowed or is not allowed but has no notation of a priority or a preference. I will come to this back at the end of the presentation.

Slide 19

Slide 19

I have some design goals for boxes and glue

The first one is obvious. I want similar output quality as TeX

Boxes and glue should be as fast as LuaTeX

TeXs internal data structures which I’ve mentioned before are a very good representation of typesetting items

With Arabic I mean non-western scripts in general, encoded with unicode and containing perhaps mixed left to right and right to left text

PDF standards such as accessibility should be supported

Slide 20

Slide 20

Non design goals

I also have some non-design goals.

Considering that I want to use the software as a replacement for my catalog software that I currently use,

I don’t need TeX compatibility , only the output should be similar

All 8 bit related stuff and DVI is not considered to be part of boxes and glue. I only need OpenType fonts and PDF

The input language is not needed for my purposes.

Slide 21

Slide 21

Architecture of boxes and glue

So how is the library organised? Actually it is split into two parts, a backend with some core functionality which is absolutely necessary to create PDF documents using TeX algorithms and a nice to have frontend with optional stuff like CSS and HTML parsing, and other nice-to-have code.

And of course, since this is a library, one needs an application to use boxes and glue. I currently experiment a lot like HTML typesetting, because I need this quite often, but my real test case is a new version of the speedata Publisher which runs on boxes and glue, since this is the goal of the whole project.

Slide 22

Slide 22

Comparing output of LuaTeX and boxes and glue

As an example I tried to break a paragraph into lines with LuaTeX and with boxes and glue. I’ve disabled micro type because this is not implemented yet on my side and the comparison would not be fair.

The paragraphs look similar, (… next slide)

Slide 23

Slide 23

Comparing output of LuaTeX and boxes and glue

… except for the bottom part. Now I have to find out what causes the difference between the two. But then again I might just say “I don’t care, because I don’t need perfect compatibility with TeX, just a very close visual similarity”.

Slide 24

Slide 24

Todo...

There is still a lot of things to do.

  • I think boxes and glue needs proper input language or an application besides my catalog software, but this is for somebody else to do.

  • A good output routine is needed for easy to use page layout (such as headers, footers and footnotes)

  • Math typesetting is not on my agenda, but probably very important, too

  • Fill in the gaps in the manual. It is started and there is sample code, but it is far away from being good and complete

I could go on of course.

Slide 25

Slide 25

... done

There are a lot of things that I have already implemented such as

  • Fonts and alike and I have include the aforementioned harfbuzz library into boxes and glue. Someone has ported this to Go

  • PDF output

  • TeX algorithms (except for math and the input language)

  • Image inclusion (for PDF, JPEG and PNG)

So I am quite happy about the state of boxes and glue

Slide 26

Slide 26

Next steps / outlook

The next step could be to experiment with the algorithms. For example the hyphenation algorithm that I have shown has no priority of break points, so that would be something to play with. Would the calculated maximum number during the algorithm be a good hint for hyphenation priority?

I have made boxes and glue to be as modular as possible to allow such experiments.

The page break algorithm could be optimised in such a way to take a look across a range of pages and find optimum breaks.

Or the line braking algorithm can be made to render text with different parameters in parallel and take the best result, whatever that is.

Speaking of parallel execution. Modern CPUs have lots of cores which are bored when running TeX, so to speed things up we should use them. Paragraphs can be broken into lines in advance when the parameters are known. A lot of application have very predictive settings for rendering, so there should be a huge potential speed gain.

Slide 27

Slide 27

TeX is dead, long live TeX

I would like to close with my conclusion:

TeX is dead, long live TeX.

Of course LaTeX will be still around in the next twenty years, but for my catalog software it becomes more and more realistic to replace LuaTeX with boxes and glue. So in the long run I will still use TeXs algorithm but without TeX.