HAL Laboratory compression

A bit of history.
The history of my acquaintance with the HAL’s compression algorithm begins in 2004. That time I have just started entering romhaking scene and carefully studied all the documents on hacking console games that fell into my hands. The CaH4e3’s document, of course, has become a milestone not only on my way of learning. Completely different principle (at that time) of pointer searching through debugging opened up wide opportunities for the translation of even most complicated games. Pointers system for NES game Kirby’s Adventure was also described in this document. It is casually mentioned that the tile maps are compressed and at the very beginning of the Kirby’s Adventure paragraph the code addresses relating to the unpacking procedure are given.
In 2009, having already mastered IDA for disassembling necessary code, I have also found this procedure without any problems for the training purposes. You can check the flowchart of this procedure with all subroutines here (warning, high resolution).
In the same 2004, CaH4e3 reverses the algorithm of data compression, describes it briefly in the file ‘kirby-encoding-descr.txt’ and puts it on one of the Russian-language emulation forums. You can read it (with some of my notes) here.
And in the same year, Parasyte also describes this algorithm, and codes a working unpacker in C. Any traces of this have already been lost, but I have kept everything for the record here.
Both of these guys are absolutely right: the game uses a hybrid algorithm to compress tile maps, which, using the 3 most significant bits of the control byte, can unpack the data in six different ways (in short, these are variations of LZ and RLE).
In 2007, I have completed the game several times on the emulator for PSP and was surprised by the advancement of this game. I’ve started to think about translating this game. This, of course, required modification of compressed tile maps.
Back in 2003 (before Cah4e3 and Parasyte activities) the game was translated into Swedish here. I have contacted the hacker of this translation via email and, to my regret, found out that the translation was done by hands, without repacking tile maps, and the similarity of Swedish and English languages helped a lot these guys.
I had also an idea of editing compressed maps manually. Below is screenshot, where I have added two letters to the word inhabitant, having recompressed the stream manually.

But there are several dozens of such maps in the game and changing each of them - a task beyond my mortal powers.
In parallel, in 2007, new details about the location of the game resources are being found here and here. At that time I was writing simple RLE packers and could not understand even the general idea of how an effective packer might look like for such a scheme. I also had the experience of pushing already unpacked data to the game after ROM’s expansion. I translated this way Bee-52 by Codemasters, where graphics was compressed with very hard to understand statistics coder. But in the case of Kirby’s Adventure, the game already contained the maximum number of banks for its MMC3 mapper. I could not and did not want to transfer the game into another mapper, and I would have to rewrite code for the unpacking of the extracted data in so many places. This option was inefficient.
I once even wrote an email to the HAL laboratory (they are still alive), but of course got no answer.
Comparing candidates at each step.
In 2005, appears the first compression tool for this algorithm. I have tried to use the compiled library in every possible way for sure, even reversing its with IDA, but did not achieve anything. The code just didn’t work. Having written to the author - Sukasa, I found out that this library was written in VB, absolutely without testing and re-checking the compression-decompression. The author was so nice that he has sent me the source code, but I did not dig into the someone else’s code in a new language for me - there was no guarantee that it would work at all, if it did not work now. Do not ask me how Sukasa decided to release untested compiled dll at romhacking.net.
Nevertheless, from the source code of this packer, a simple and sensible idea has appeared: simply take a position in the unpacked stream and calculate all six methods of compression for it, choose the one that gives the largest size gain in the compressed stream and save it. Then repeat the same for the position we will get to if we use the previous method. Of course, it’s rather slow, but effective. Later I learned that this is the standard strategy for hybrid packers.
In 2009, CaH4e3 dumps ‘Wario land’ - the Chinese hack of Kirby’s Adventure with a redecorated title screen and a truly repacked tile map of this title.
Apparently, either the Chinese guys wrote a packer (with the same or better efficiency than original), or put at the hex editor a poor Chinese peasant and forced him to manually pack 0x400 bytes of the title screen for a minimum payment. In any case, the likelihood that I will have a working compressor in my hands would not increase, because there was no one to communicate with in China.
Having gained some experience in programming in Delphi and armed with a simple scheme, described above, I have come at creating a tool for compression and decompression. I have written unpacker very carefully, breaking it into a lot of carefully tested procedures and the packer on this basis also came out almost with no debugging. So I have happily escaped the main danger - to drown in a pile of code. The title screen was compressed by my tool in the same size as the original one. After finally inserting it back, I began to test the compressor on other tile maps. But sadly, my packer compressed some intro screens 3-5 bytes worse than the original one. For my purposes of translation, this result was not acceptable. After all, I needed to edit only a few tile maps and I did not need to compress all the maps in a row in the hope that somewhere it would compress better and compensate for the previous increased size. And I began to investigate why the original packer compressed better than mine.
Looking ahead.
Consider the initial stream of characters: ABCDEFG … XABCZZ … | XABCDEFG
The vertical bar indicates the position with which you continue to compress. If only the LZ method is taken into account, LZ is most advantageous at this position, referring to the previous occurrence of XABC (4 symbols). But if you now throw one uncompressed letter X, you can immediately take LZ to the first occurrence of “ABCDEFG” (7 characters). Given the game serialization scheme (see any description of the compression format), the first option will take 3 bytes to the first LZ “XABC” in the compressed stream plus 3 bytes to the second LZ “DEFG” (6 bytes). And in the second case, it will occupy 2 bytes for the raw “Z” character, then 3 bytes for LZ “ABCDEFG” (only 5 bytes). As a result, the second option will result in compressed file size reduction by 1 byte. This situation is quite standard for LZ, but in the case of the HAL algorithm, sometimes there is also a gain when you select [some_method_now] or [raw_byte_and_RLE_method_after].
Several of such situations occurred in the tile map for intro, and resulted in my packer compressing several bytes worse than the original. Well, you can simply take and look 1 byte ahead, comparing already the maximum gains from all six methods at the current position and the next. But what if the original packer looked 2 bytes ahead, or 3? What if all the rest of bytes till the incoming block? I had no other options but to implement a look ahead one byte and check the result.
That time Marat from chief-net helped me a lot: he has modified my Delphi code to look 1 byte ahead, and then, I figured out and rewrote the look ahead code in a form understandable for me. The result was acceptable - tile maps were compressed either into the same size, or even a couple of bytes less. It turns out that the original packer also looked no more than 1 byte.
Later, I found out that looking at 1 byte forward is commonly used practice in string parsing when compressing LZ.
This situation is studied in “The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods.” R. Nigel Horspool Dept. of Computer Science, University of Victoria. And, for example, in the “Deflate data format specification” (item 4), this method is called “lazy matching”. Looking forward more than 1 byte is perfectly considered in this note by cbloom. Here it is called a chain scheme. And looking till the end of the incoming block is generally called spatially optimal parsing (you can read about it in serious sources: “On the bit-complexity of Lempel-Ziv compression” by Paolo Ferragina, Igor Nitto, and Rossano Venturini and “Dictionary-symbolwise flexible parsing” by Maxime Crochemore, Laura Giambrunoe, Alessio Langiu, Filippo Mignosi, Antonio Restivo.) and I will say a few words about it later.
Anyhow, for the purpose of translation, such a decision was already enough - you can write batch unpackers / packers and translate the game. By the way, in the beginning of 2018 there was no definitive translation into Russian of this remarkable game. For example, the Shedevr translation here or runesification here (both 2003 - 2004) are incomplete because the hackers did not master the compression of the tile maps. My translation was released in October 2018.
Parsing heuristics.
In 2014, appears exhal - the tool from Revenant, which compresses better than the original packer. Its code is also used in the Kale - Kirby’s Adventure level editor. Inhal also often compressed better than my Delphi tool, even with a look at 1 byte ahead implemented. Despite that the algorithm was similar to mine.
In 2018 I wrote an analogue of inhal in Haskell - kirbyLzRle. Basically, I practiced to write Haskell code, but it is worth to mention, that the code turned out to be relatively small and un-tangled (~300 lines of code for everything, including command line parsing). It is well commented out and, as a consequence, easy to read.
Writing kirbyLzRle, I have also thought a lot about the various subtleties when making the best compression decision at each step. There is vast space for improving the heuristics: for example, the size of the raw byte insertion into the compressed stream will depend on whether there were any raw bytes in the compressed stream before it. And if not, you will need an additional byte for command-length and this must be taken into account during compression. Or what to choose if several methods with the same efficiency are found at the current position? Or what to choose if the best method at the current position is equal in efficiency to the best method at the next position? First, I introduced the concept of method gain - this is the number of bytes that we will gain in compressed stream compared to an uncompressed source. For example, if I found a 10-byte LZ chain, then choosing this method, the method gain will be 7 bytes (10 - 3 as LZ command is serialized by three bytes).
Secondly, I have concluded few simple rules that make it possible to achieve better compression results, than inhal (see benchmark table below):
- If no compression methods are found at the current position, you can surely dump the raw byte into a compressed stream. Of course, it makes no sense to compare raw with the best method at the next position.
- If the best method length exceeds 0x400 bytes (10 bits is the maximum length of the method by the format of the game), then you need to select a method with a length of 0x400 bytes and analyze from this position, even though, perhaps, the method can have a longer length. We simply cannot serialize such method.
- If the gain of the method at the current position > gain of raw byte AND gain of the method at the current position >= gain of the method at the next position (this gain will take into raw byte dump made recently), then we choose the method at the current position. In any other cases, dump raw byte.
  I’ll explain the first inequality by example: suppose, in the very beginning of the stream compression we found a method with zero gain (for example LZ, referring to a chain of 3 bytes). Dumping raw byte gain in the beginning will be -1. 0 > -1 and it’s really better for us to choose LZ, which refers to 3 bytes. But if we are not at the beginning of compression, and before that we have already dumped several raw bytes into the compressed stream, then adding one byte is now exactly the same by gain, like LZ, which refers to three bytes. But for the remaining positions to the end of these three bytes, we have the opportunity to find more efficient chains. Even if we do not find anything better, we’ll just come to the same position with the same efficiency. That is, if gains are equal, it is better for us to choose raw byte dump.
  Now the second inequality: for example, we are again at the very beginning of compression and found a method with zero gain (LZ referring to a chain of 3 bytes), and at the next position the best method is LZ, referring to a chain of 4 bytes (gain = -1 + 1 is also zero efficiency). In the worst case, even if there is nothing after the current LZ at the current position, we just come to the same place with a raw dump in the end. So with equal gains, you need to take the method found at the current position. But as soon as the gain of the next position becomes at least 1 byte more than current, it is necessary to choose the method at the next position.
- If several methods found at the current position have equal gains and this gain is maximum, you need to choose the method that is decompressed into a longer chain (greedy parsing). Generally speaking, this is not always true: for example, at the current position the best methods are: RLE for 2 bytes and LZ referring to 3 bytes. RLE for 2 bytes can lead to a position where the best RLE method is for 3 bytes. And LZ referencing 3 bytes can lead to the next LZ by 3 bytes. That is, instead of Rle 2 - Rle 3 we did LZ 3 - LZ 3 and lost one byte. But in general, on the data of Kirby’s Adventure, greedy parsing often gives us better compression.
- Quite a trifle, but it is desirable that with equal output sizes, methods with faster decompression are chosen. That is, in order of increasing the time for unpacking: Rle8, RleAsc, Rle16, Lz, LzRev, LzInv.
In my code, all these rules are implemented in several lines:
```
getEntries pos raws
| pos >= length haystack = [Raw raws]
| entryGain curPosEntry > entryGain currentRawEntry &&
entryGain curPosEntry >=
entryGain currentRawEntry + entryGain skipPosEntry =
 Raw raws :
 curPosEntry : --emit accumulated raws, current entry and nullify rawbuffer
 getEntries (pos + entryDecodeSize curPosEntry) []
| otherwise =
 if length currentRaws <= maxLen -- no options better, than raw
   then getEntries (pos + 1) currentRaws --dump current raw to output
   else currentRawEntry : getEntries (pos + 1) [] --exceeded max len, break raws chunk
...

candidates = concat [findLzInv, findLzRev, findLz, findRle16, findRleAsc, findRle8]
```
And, finally, the benchmarking of four utilities in the compression ratio for the test binary files (all ripped from the Kirby’s Adventure ROM)
HAL - the output of the original packer, copied from the game’s ROM.
Kirby_Util - my tool, written in Delphi in 2009.
inhal - is Revenant’s tool v.1.21b, launched without a -fast flag for the maximum compression ratio.
kibyLzRle - is my tool in Haskell, written in 2018. v0.1
```
+-------------+---------------------+-----+-------------------+-------+-------------+
|  File name  | Original plain size | HAL | Kirby_Util Delphi | inhal | kirbyLzRle  |
+-------------+---------------------+-----+-------------------+-------+-------------+
| tips        |                 222 | 116 |               116 |   116 |         115 |
| intro       |                2048 | 799 |               806 |   799 |         790 |
| prologue2   |                2048 | 514 |               513 |   513 |         512 |
| prologue    |                3072 | 724 |               716 |   712 |         711 |
| 25_tutorial |                 125 |  37 |                37 |    37 |          37 |
| 34_tutorial |                 125 |  99 |                99 |    99 |          99 |
| title       |                1024 | 568 |               567 |   568 |         566 |
+-------------+---------------------+-----+-------------------+-------+-------------+
```
A few words about the spatially optimal algorithm.
Previously in item 3, I’ve mentioned the optimal compression of the plain stream. In a strict mathematical sense, this algorithm should produce a compressed stream of such size, which (for a given compression format) cannot be made smaller even by one byte. That is, you can get away from the idea of heuristics at each step with the choice of one method and simply choose a chain with minimal serialized form size from all possible chains of methods. To do this, you can represent each stream position as the vertex of the graph and find the minimum path along this graph from the root to the final vertex.
The standard implementations of the algorithms with weights (A* or Dijkstra) are not suitable - the weight of the next raw byte dump depends on the previous path to this node (as I explained in item 4).
I did not go into the intricacies of space optimal parsing, especially as there were positive examples of deflate and exhal working with sufficient efficiency without it.
I only went so far as to take all possible paths (previously finding all possible methods for each position), calculate their total weights, and choose the path with the minimum weight. In fact, implemented deep first search. Of course, I ran into the problem of execution time and (mostly) memory. Even at 40 bytes of input, in some cases the graph grows to several gigabytes - in practice this would be unacceptable. I have compressed several of my binary files, specially designed to check the sanity of various compression methods (no more than 50 bytes in size), and made sure that kirbyLzRle still compresses them in a strictly optimal way.