Tämä asiakirja on saatavilla myös suomeksi. Dieses Dokument ist auch auf Deutsch erhältlich.
I like combinatorial problems, but I tend to solve them by brute force. Around the year 2000, I enumerated the solutions of the 6×10 Pentomino puzzle in a couple of CPU weeks on some 266 MHz Pentium II workstations.
My solution in C is a brute-force depth-first traversal method. In 2013, I made an obvious improvement to prune those search trees where there exists a gap whose size is not a multiple of 5. The pruning can pe achieved by flood-filling each gap in a copy of the current board and determining how big each gap was. On an Intel Core i5-2520M system at 2.5 GHz, an optimized build of the C program will cover the whole search space in about 4 minutes.
A C++ version
of my solution makes use of the STL.
On the platform that I tried, the optimized C++ build was slower than
the C program due to the dynamic memory allocation overhead in the
queue management of the flood-fill routine. The C version would use
an array allocated from the stack, while the C++ version would use
The performance difference went away in January 2014 when I noticed that the work queue can more appropriately be represented as a set. The original implementation of the flood-fill algorithm could dequeue the same work item multiple times. In this case, the set can be efficiently represented as a bitmap, more precisely in a 64-bit integer, just like the 6×10 Pentomino board. The performance improved by 60% when I exploited the fact that the already filled space can easily be subtracted from the bitmap. In this way, we will never dequeue an item that has already been processed. The performance improved even further (40 seconds to return all 2,339 solutions) when I eliminated symmetric configurations for the first piece. There were 56 configurations for the I-shaped piece before and 14 after the optimization.
In December 2016, I refined the C++ implementation to make use of
constexpr keyword, so that the bit patterns for
the pieces (up to 8 distinct orientations per piece) can be converted
at compilation time. I experimented with constructing all 2,014
distinct placements of the pieces at compilation time. This turned out
to be a bad idea not only in terms of Kolmogorov
complexity (the placements can be constructed in less than
2,014×64 bits of code) but also because clang hit a resource limit while
In August 2019, I replaced procedural recursion with a loop and an
explicit search stack. With a statically allocated array for the
stack, it is slightly faster than the original procedural recursion
-DRECURSION). As expected, using dynamic allocation for
the stack (
-DHEAP_STACK, only implemented in C++) is
slightly slower. Note: the option
traverse the search space in a different order than the explicit
Earlier, I experimented with expanding the 12-deep recursion at compilation time, using templates, but that resulted in slightly slower execution time.
When compiled with clang 8.0.1
-O3 -DNDEBUG -mtune=native -march=native,
the C++ program finished 8% faster than the C program. With GCC 9.2.1 and the same switches, the C
program was 3% faster than as C++ program. The biggest outlier is the
clang-compiled C program, which is clearly slower than the others.
|variant||GCC 9.2.1||clang 8.0.1|
||27.7 s||31.2 s|
||28.7 s||27.6 s|
|C||26.9 s||29.4 s|
|C++||27.6 s||27.1 s|
||27.7 s||28.1 s|
In early 2014, my wife picked up a 4×4×4 puzzle at a flea market. The cube is filled with one piece of volume 4 and twelve pieces of volume 5, which yields a total volume of 4×4×4=4+12×5=64. Unlike in the Pentomino puzzle, there are many more possible pieces of this size, and there seem to be 20,434 solutions. With minor modifications, the same solution algorithm is applicable:
In December 2016, when I refactored the 3-dimensional puzzle solver in C++14, I noticed that there had been a bug in my symmetry reducer. When a piece is rotated along each axis, there are 4×4×4 or 64 potential configurations. For the small piece in my puzzle, 12 of these configurations are distinct. But, many of these 12 configurations are actually symmetric with each other, if we place the piece to a specific corner of the box and then rotate the entire box in all 64 ways, without shifting the piece relative to the box. It turns out that there are only 15 distinct configurations of the small piece, instead of the 261 that my original buggy symmetry reduction produced.
GCC 9.2.1 generates a much faster solver for AMD64 than clang 8.0.1. In August 2019, the GCC-compiled solver completed the search in 14,063 seconds (3 hours, 54 minutes and 23 seconds), while the clang-solver needed 15,599 seconds (4 hours, 19 minutes and 59 seconds).