Lockless Transposition Tables

Transposition table is a lookup table which is used to cache negamax search results when searching for best move in a chess engine. In the previous post we looked at transposition tables and their use in a negamax search in the context of a chess engine. Transposition tables fine when used in a linear negamax search but what about parallelised negamax search? Anyone who has done basic multithreaded programming will be able to see potential problems with using a shared transposition table in a parallelised negamax search. The solution lies in the lockless transposition table.

This post discusses such a lockless transposition table. It is based upon this article by Robert Hyatt, author of the famous Crafty chess engine. The article describes lockless transposition tables in more detail and reading it will. However, if you’ve been following this series of posts or if you want to get a different explanation of lockless transposition tables it would make sense to continue reading this post.

A word about Zobrist hashing

In order to understand the implementation of a lockless transposition table, we need to have a basic understanding of what is called Zobrist hashing. Zobrist hash is a 64-bit hash which represents a chess board position. The lockless hash table will use Zorist hash values as its keys.

Like any other hash, Zobrist hash has following two properties:

– there is a low probability that any two board positions will have the same hash, and
– it is very hard or impossible to derive a board position from a given hash value

It has one more property which makes it so useful for chess engines. Zobrist hashes of two similar chess positions will be very different. Since the positions that you can move to from the current board position will be similar, their hashes will be very different. Why exctly this property helps will be clear towards the end of this post, once we have covered transposition table implementation.

Zobrist hashes are also lightweight and fast to generate. They are used with a  particular board representation called bitboard. Many endgame table-bases, which are databases of solved end games use zobrist hashing as well.

Idea behind Lockless Transposition Table

The lockless transposition table we are talking about is a hash table where the key and value are 64-bit word each. In C it will be unsigned long long. It is different in other languages. Now on a 64-bit maching reading or writing a 64-bit word is atomic operation. Therefore when you read a 64-bit key or value from the hash table, it will either be the existing item or a new one. But it won’t be a partially written item. That is the basic idea behind the lockless transposition table. There are two problems with this approach which we will address in the rest of the post.

1. Since reading the key and value are two separate operations, how do we ensure that the value is correct one, i.e. the one corresponding to the key that we read? Afterall, the value might have been overwritten after we read the key but before we read value.

2. How do we store details about board position, like score, depth searched etc, in a 64-bit word?

Ensuring that key and value correspond

First we will present the solution to this problem and then see why it works.
When storing the hash and its corresponding value we will alter the key as follows:

key = hash ^ value

So we will XOR the hash with value and use the result as key. Value will be the same as it was. When we come to retrieve the key-value pair, we will first XOR the value with key and then compare the result with the hash that we were searching for. If the two match then it means the value corresponds to the hash key we were looking for. If they don’t match then the key-value pair was corrupt and we ignore the entry.

So let’s see why this works. XOR is a reversible operation. XORing twice with the same value will give us the original value, i.e.

x ^ y ^ y = x.

What we are doing is XORing the hash with the value once to get our key. Then we XOR that key again with the value. If the value is the same that we originally XORed the hash with, we will get the correct hash, if not, then we will get some other result.

Storing details in 64-bits

What you want to store in the 64-bit value word will depend upon where you want to use the transposition table. Here we will present a simple way of storing values relevant to chess search. Remember that this is a very basic composition of the value word. In a real chess engine you’ll use different composition.

In our discussion, three things have stood out to be the most relevant:

1. Score associated with that position
2. Depth searched which resulted in the score
3. Best move in that position

The score can be packed into 32 bits, giving us 32-bit signed integer. Depth will take five bits more to give us maximum searchable depth of 32 levels which is more than enough for most engines. Best move can be represented by eight bits giving us 256 possible moves for any given position. Maximum number of legal moves in a chess position have been found to be 218 (see http://chessprogramming.wikispaces.com/Encoding+Moves#MoveIndex). So a byte should be enough.

Usage

When using a transposition table like the one described here, we are constrained by size in memory. That means that we will have to replace older entries by newer ones which is not necessarily a bad thing. However, one may want to consider a replacement strategy to improve the chances of a hit in the transposition table. A simple way to store entries is to modulo the key with the maximum number of entries (i.e. capacity) that the transposition table can hold. This happens to be a common way to store entries in transposition tables. Here we can also see why the fact that Zobrist hashes for two similar positions are very different is helpful. When we generate positions from a given position, they tend to be similar to the one we currently have. So when we store the generated positions in our hash table, they will more likely end up in indexes farther apart from each other, decreasing the chances of overwriting each other. Thus the newer more similar positions are more likely to stay cached in the table. This isa good thing because more recent positions that repeat themselves more frequently.

Advertisements

Using NegaMax & Transposition Tables

This post follows on from the previous post which described negamax. In this post we will discuss how to actually call negamax search in a chess engine. Then we will look at one way to parallelise negamax. Towards the end we will briefly cover the concept of transposition tables which are a way to optimise our search. Sticking with our theme, we will use chess engine as the context for our discussion.

Calling NegaMax

NegaMax usage consists of two methods:

  1. Root method
  2. Actual negamax

When you call negamax, you only call the root negamax which then calls the actual negamax. Actual negamax is the negamax we saw in our previous post. Here’s what the resposibilities of root negamax are:
1. Generate a list of moves available to current player. Let’s say there are n possible moves.
2. Make each move to obtain a new board position. There will be n board positions.
3. Call actual negamax for each of the n board positions and for a specific depth.
4. Multiply the result of each actual negamax call by -1. This is because after making the move in step 2 above, actual negamax will compute scores from the point of view of the opponent. By multiplying by -1, we are reversing the point of view.
5. Find the maximum of the values in step 4 above. That will be the score and the move that resulted in that score will be the best move.

Following is the pseudocode for root negamax:

int root( int depth ) {
if ( depth == 0 ) return evaluate();
int max = -oo, score;
for ( all moves)  {
make move();
score = -1*negaMax(depth – 1);
if(score > max)
{
max = score;
}
}
return max;
}

Paralellised NegaMax

Here we will describe a simple way to paralellise negamax search. This is not necessarily the best way to do it. It is presented here to tie into our next topic which is lockless lookup tables.

Our version of parallel negamax simply takes the for loop in root negamax function and makes each iteration of the for loop to run in parallel. We store the results of each iteration in an array. Be careful to store result of each iteration in a separate array slot. An easy way to do this is to assign an array slot number, 0 to n-1, to each iteration. When that iteration is finished, it updates its own respective array slot. After that we iterate through the array in a separate loop and find the max score. There are different language constructs that allow you to do this but that depends upon which language you use.

A key thing to note here is that we are explicitly avoiding locks. Instead we have accepted the overhead of a linear search through the array for the maximum score but that turns out to be a small price compared the to the overhead of locking and the nuances of it.

Transposition Tables

Before going into transposition tables, let’s do a quick recap of what we have covered. So we have a game tree which a root at the top. Every node represents a board position and every branch represents a move. The root node represents current board position. So far so good. Now, in a real chess game, same board position can be reach more than one sequence of moves. Moreover, those sequences don’t have to have the same number of moves. What does that mean for our game tree? That means that our game tree isn’t really a tree in technical sense. The tree data structure is defined as the one in which there is only one route to go from current node to a destination node. In other words, there are no cycles in a tree. In our game tree, each node represents a board position. Since a same board position can be reached by different sequences of moves (i.e. routes), our game tree has cycles in it. Now this is actually a good thing and we can exploit the cyclic nature of our tree to speed up our search. The way to do it is by using a data structure called transposition table.

A transposition table is nothing more than a lookup table. It contains keys and values corresponding to they keys. In case of a chess search, they are called transposition tables. The key identifies a board position and value stores search results associated with that board position. Here’s how to use transposition tables to speed up negamax search. When we search a position, we get a score associated with that position and that score is dependent upon the depth that we searched that position to. We store the position as key and the score, best move and depth as value. Every time we search a position if we will check the transposition table first. If that position is already in the table, we don’t need to perform the full search and return with the results stored against that position. That saves us the whole search.

NegaMax Search

NegaMax search is used to find the best choice in a zero-sum two player turn based game (e.g. chess, draughts and tic-tac-toe). In reality, negamax is nothing more than a more concise way to code another search algorithm called MiniMax. Understanding the logic of minimax means understanding the logic of negamax. So we will do that first. Then we will derive negamax from minimax.

Decision Tree

We will use the decision tree shown below, to walk through minimax search. Let’s assume it represents a chess game, half way through, between two players named Nimzo and Capa. Each node in the tree represents a board position. Every branch represents a possible move. Root node represents current board position. Every level represents alternating turns between Nimzo and Capa. So the top three branches represent three possible moves available to Nimzo. The next level down, branches represents possible move choices for Capa for every possible move that Nimzo can make now. Third level, in turn, represents Nimzo’s countermoves to Capa’s counter moves. The tree stops at four levels, called depth of the tree, but in real chess game this can carry on for many more levels.

Animation of minimax search

Animation of minimax search

Every node (i.e. chess board position) has a score associated. The score is from the perspective of the player whose turn it is at the root node. So in ours example, the score for each board will be from the perspective of Nimzo because it’s Nimzo’s turn at the root node. Now remember this is a zero-sum game. That means, sum of scores for the two players for each node must be zero. Therefore the score for a node from perspective of Nimzo will be -1 times the score for the same node from perspective of Capa. So we can call Nimzo the maximising player and Capa the minimising player. This point is crucial to our discussion here.

Our objective here is to find the best move for Nimzo.

MiniMax

Nimzo wants to make the best move which leaves Capa with worst best option possible. But then when Capa makes his move he will want to do the same thing, leaving Nimzo with the worst best option. Nimzo would want to factor Capa’s perspective into his calculations. But since Capa would like to do the same with regard to Nimzo, this turns into a recursive calculation. A chess game’s decision tree is impractically tall to calculate all the possible moves and countermoves so we have to limit the depth to a practical level. At that level each node will be a leaf node. Once the minimax search reaches that level, it uses a special function (called Evaluate()) which takes in the board position at each leaf node and uses some heuristics to return an integer which represents the score for that node.

As shown in the animated gif above, minimax works its way in a depth-first manner. Once it reaches the leaf nodes it calls Evaluate() to assign scores to leaf nodes. Then based upon those scores it assigns scores to nodes one level above the deepest level. That is either the maximum or the minimum of the leaf node scores depending upon whose move it is at the second to last level. It will be maximum if it is Nimzo’s turn and minimum if it is Capa’s turn.

Pseudo code and NegaMax

Finally we present pseudo code for minimax which then sets the stage for negamax.

int maxi( int depth ) {
if ( depth == 0 ) return evaluate();
int max = -oo;
for ( all moves) {
score = mini( depth – 1 );
if( score > max )
max = score;
}
return max;
}

int mini( int depth ) {
if ( depth == 0 ) return -evaluate();
int min = +oo;
for ( all moves) {
score = maxi( depth – 1 );
if( score < min )
min = score;
}
return min;
}

Here, the two methods maxi and mini recursively call into eachother until leaf nodes are reached and then work their way back up the tree assigning scores to nodes as they go.

Now recall the fact that it is a zero sum game where one player’s loss is the other gain. Mathematically:

max(a, b) == -min(-a, -b)

NegaMax combines the two methods above, maxi and mini, into a single method. It alternates the calculation of score for maximising and minimising players by reversing the arguments passed to its recursive calls. Following is the pseudocode for negamax:

int negaMax( int depth ) {
if ( depth == 0 ) return evaluate();
int max = -oo;
for ( all moves)  {
score = -negaMax( depth – 1 );
if( score > max )
max = score;
}
return max;
}

Before finishing, let’s quickly see why is there a need for more compact negamax. Minimax is a brute force search and when applied to problems like chess, due to the sheer size of the tree, it becomes impracticably slow. Therefore programmers rely on search optimisations like alpha-beta pruning. Those search optimisations are easier to apply when using a compact format like negamax rather than the two-method minimax format.

The pseudo codes and the animated gif above are from the excellent resource http://chessprogramming.wikispaces.com.

.NET Dictionaries and Overriding GetHashCode()

This post discusses why should we override GetHashCode method on our custom objects (i.e. objects of a class that we’ve written) when using them with .NET dictionaries and what will happen if we don’t. GetHashCode returns an integer hash code. Two of the properties of a hash code are relevant to our discussion here.

  • If two objects have different hash codes, those objects are definitely not equal.
  • If two objects have same hash codes, those two objects may or may not be equal.

Now here’s the central idea of this post: When using a custom object as key in a Dictionary<TKey,TValue>, if the object’s GetHashCode method returns a different value every time, regardless of its contents, then the dictionary’s indexer will always throw a KeyNotFound exception. To find out why, we need to know how Dictionary<TKey,TValue> is implemented.

Mental Model of Dictionary<TKey,TValue>’s Implementation

This description isn’t necessarily how .NET implements Dictionary<TKey,TValue> object but it is a good way to mentally model it so we can understand its behaviour. So think of a .NET dictionary as an array of LinkedList<T>s. The reason we say array of LinkedLists<T> will be clear if we consider the following scenario. Say you want to add an an object myVal1 with a key myKey1 to a Dictionary<TKey,TValue>, where myKey1 is a custom object. Dictionary<TKey,TValue> will call GetHashCode on myKey1 to get the hash code which will be an int. It will then modulo that hash code with the length of its array. That will return another int which will be within the range 0 to length of array – 1 (i.e. a valid array index). Dictionary<TKey,TValue> will then take the LinkedList<T> at that index of its array and add myVal1 to that LinkedList<T> if myKey1 doesn’t already exist or update the value with myVal1 if myKey1 already exists. This equality check is done using Equals method of myKey1.

Next when you want to access your store value using the key with a call like myDictionary[myKey1], the dictionary will again call GetHashCode on myKey1, modulo it by length of the array to arrive at the linked list. Now, the dictionary will iterate through the linked list, using Equals method to check if there is an element where myKey1 matches. If a match is found, then the dictionary returns myVal1. Otherwise a KeyNotFound exception is thrown.

Significance of GetHashCode

Now suppose the myKey1 object’s class implements GetHashCode in such a way that it always returns the same value, then technically, it can be used as a key in Dictionary<TKey,TValue>. However, it will come at a huge performance cost and negate all the benefits of using a dictionary rather than a linked list or a binary search tree for example.

On the other hand, if myKey1 returns a different value every time GetHashCode is called, the Dictionary<TKey,TValue> object will not be able to match two hash codes in the first place and will most likely throw a KeyNotFound exception.

Dynamic Memory Allocation on the Stack

Static memory is memory that is defined at compile time and dynamic memory is memory that is defined at run time. Most programmers are aware that static memory allocation takes place on the stack whereas dynamic memory allocation usually takes place on the heap. What isn’t very common knowledge is the fact that memory can be dynamically allocated on the stack too.  This however comes with its own benefits and caveats.

C programmers will be familiar with malloc() function which is used to dynamically allocate memory on heap. C language also provides alloca() function which dynamically allocates memory on the stack. Usage of alloca() is the same as malloc(). In following sections we will discuss benefits and caveats of alloca().

Benefits

Two key benefits of alloca() are automatic memory cleanup and better performance. They are explained below.

Memory Cleanup

To prevent memory leaks, every chunk of memory acquired by calling malloc() should be released by calling free(). In contrast, when memory is acquired by calling alloca() it doesn’t need to be released by calling an equivalent of free. This is because before the function starts executing its instructions, the frame pointer is saved. This stage is called function prolog. Then when the function completes execution the frame pointer is restored, effectively freeing up everything that was allocated on the stack for that function. This last stage is called function epilog. Thus prolog and epilog ensure that memory leaks don’t happen.

Performance

Memory allocation on the heap involves walking the heap and searching for a suitable chunk of memory to allocate. Memory allocation on stack involves just updating a CPU register. This is the stack pointer register (ESP on 32-bit and RSP on 64-bit CPUs). This makes alloc() much faster than malloc(). Moreover, since stack memory isn’t paged in and out, it doesn’t suffer from page faults. Memory on heap may encounter page faults which will put a small but noticeable drag on performance.

Caveat

Some would say the caveat of alloca() don’t necessarily outweigh its benefits. However, it is a very important caveat with practical implications. Stack memory is a lot smaller than heap. Therefore, it is more precious. What this means is that:

  1. it should be used sparingly and
  2. it shouldn’t be occupied for too long

Therefore, alloca() is suitable for small amounts of memory used by short-lived functions.

Conclusion

Given its advantages, alloca() is surprisingly little known. That however could be due to legacy reasons from the days when memory came at a premium. Now-a-days with 8GB RAMs in common use and RAM sizes having the potential to go much higher*, certainly programmers can do away with more frequent use dynamic stack memory.

——————-

*With the prevalent 64-bit architectures, a program can now potentially address up to 16 exabytes* (i.e. 16 giga-gigabytes) of memory. Most hardware architectures however allow 48-52 bits for addressing rather than all 64 bits. That still means a hefty 281GB – 4503 GB of addressable memory!