Tämä asiakirja on saatavilla myös suomeksi. Dieses Dokument ist auch auf Deutsch erhältlich.

Marko Mäkelä’s software projects: Solving Pentomino Puzzles

I like combinatorial problems, but I tend to solve them by brute force. Around the year 2000, I enumerated the solutions of the 6×10 Pentomino puzzle in a couple of CPU weeks on some 266 MHz Pentium II workstations.

My solution in C is a brute-force depth-first traversal method. In 2013, I made an obvious improvement to prune those search trees where there exists a gap whose size is not a multiple of 5. The pruning can pe achieved by flood-filling each gap in a copy of the current board and determining how big each gap was. On an Intel Core i5-2520M system at 2.5 GHz, an optimized build of the C program will cover the whole search space in about 4 minutes.

A C++ version of my solution makes use of the STL. On the platform that I tried, the optimized C++ build was slower than the C program due to the dynamic memory allocation overhead in the queue management of the flood-fill routine. The C version would use an array allocated from the stack, while the C++ version would use std::list.

The performance difference went away in January 2014 when I noticed that the work queue can more appropriately be represented as a set. The original implementation of the flood-fill algorithm could dequeue the same work item multiple times. In this case, the set can be efficiently represented as a bitmap, more precisely in a 64-bit integer, just like the 6×10 Pentomino board. The performance improved by 60% when I exploited the fact that the already filled space can easily be subtracted from the bitmap. In this way, we will never dequeue an item that has already been processed. The performance improved even further (40 seconds to return all 2,339 solutions) when I eliminated symmetric configurations for the first piece. There were 56 configurations for the I-shaped piece before and 14 after the optimization.

In December 2016, I refined the C++ implementation to make use of the C++14 constexpr keyword, so that the bit patterns for the pieces (up to 8 distinct orientations per piece) can be converted at compilation time. I experimented with constructing all 2,014 distinct placements of the pieces at compilation time. This turned out to be a bad idea not only in terms of Kolmogorov complexity (the placements can be constructed in less than 2,014×64 bits of code) but also because clang hit a resource limit while initializing the constexpr data.

In August 2019, I replaced procedural recursion with a loop and an explicit search stack. With a statically allocated array for the stack, it is slightly faster than the original procedural recursion (-DRECURSION). As expected, using dynamic allocation for the stack (-DHEAP_STACK, only implemented in C++) is slightly slower. Note: the option -DRECURSION will traverse the search space in a different order than the explicit search stack.

Earlier, I experimented with expanding the 12-deep recursion at compilation time, using templates, but that resulted in slightly slower execution time.

When compiled with clang 8.0.1 using -O3 -DNDEBUG -mtune=native -march=native, the C++ program finished 8% faster than the C program. With GCC 9.2.1 and the same switches, the C program was 3% faster than as C++ program. The biggest outlier is the clang-compiled C program, which is clearly slower than the others.

Execution times (Intel® Xeon® E5-2630 v4, 2.2 GHz)
variant GCC 9.2.1 clang 8.0.1
C -DRECURSION 27.7 s31.2 s
C++ -DRECURSION 28.7 s27.6 s
C 26.9 s29.4 s
C++ 27.6 s27.1 s
C++ -DHEAP_STACK 27.7 s28.1 s

Solving 3-dimensional puzzles

In early 2014, my wife picked up a 4×4×4 puzzle at a flea market. The cube is filled with one piece of volume 4 and twelve pieces of volume 5, which yields a total volume of 4×4×4=4+12×5=64. Unlike in the Pentomino puzzle, there are many more possible pieces of this size, and there seem to be 20,434 solutions. With minor modifications, the same solution algorithm is applicable:

  1. For simplicity, the pieces are represented as character strings.
  2. Each string is converted to the representation of the puzzle board (or cube).
  3. Each piece will be rotated and shifted to generate all possible positions on the puzzle board.
  4. Start the depth-first search with the 4-unit piece.
  5. If some free space is not divisible by 5, reject the solution, since it would not be possible to place all 5-unit pieces.

In December 2016, when I refactored the 3-dimensional puzzle solver in C++14, I noticed that there had been a bug in my symmetry reducer. When a piece is rotated along each axis, there are 4×4×4 or 64 potential configurations. For the small piece in my puzzle, 12 of these configurations are distinct. But, many of these 12 configurations are actually symmetric with each other, if we place the piece to a specific corner of the box and then rotate the entire box in all 64 ways, without shifting the piece relative to the box. It turns out that there are only 15 distinct configurations of the small piece, instead of the 261 that my original buggy symmetry reduction produced.

GCC 9.2.1 generates a much faster solver for AMD64 than clang 8.0.1. In August 2019, the GCC-compiled solver completed the search in 14,063 seconds (3 hours, 54 minutes and 23 seconds), while the clang-solver needed 15,599 seconds (4 hours, 19 minutes and 59 seconds).


Pentomino solver in C
Pentomino solver in C++
Image of all 2,339 solutions of the 6×10 Pentomino puzzle
A small utility for composing the image of all solutions.
Solver of a 3-dimensional 4×4×4 puzzle in C++
The 20,434 solutions of the 4×4×4 puzzle