Fast-Forward search will be a totally dumb search engine: brute-force fixed depth with negamax. It’s only purpose is to search the tree at a maximum speed, not to be a good Chess Engine (won’t be for sure).
I based my work on the work of Srdja, because some of the ideas in it’s implementation are great, it’s based on Glaurung and thus Magic Bitboards, and I found a way to avoid divergence as much as possible, limiting memory read-write and register usage.
The basic pinciple is to be as simple as possible. Negamax will be implemented, but that’s all for the first iteration of the project. Evaluation will be pure material evaluation at first, that will be done during deepening, not during leaf evaluation. So in fact there will be no leaf evaluation per-se on the first iteration.
The main difference with Srdja work is the move generation. Removing move ordering is mandatory for full speed, and thus there’s no need to construct a list of move, instead bbWork+bbMove (16 bytes) variables are enough to obtain the next move in any position, we just have to store them on each depth of the tree for each thread, Bitboard to represent the board as well a bonus/malus or material delta could lay in registers being updated back when descending the tree. It will enable us to have a maximum number of threads flying on the GPU, relying on L1 cache (anbd L2 cache) to store the bbWord+bbMove registers.
Symmetry = few divergence
To do a move or undo it are symmetrical operations in Fast-Forward. Essentially you take the next move using bbWork+bbMove, with 2 little point of divergence.
When you get the next move, you don’t update bbWork+bbMove, you keep them as-is, to be able to get it again (and update register this time) while descending a level of the tree, to undo it.
Having a move, the difference on Bitboard update is just a matter of swapping from to etc, that are fast operations, the slow 64bit logical operations could then be totally identical!
You may add that bonus/malus or material should be added when going deep, and substracted when going back to the root of the tree, no matter!
If your doMove() and unodMove() are essentially identical except a few intermediate instructions, bbWord+bbMove update on one side, some register swapping on the other, you may group these operations together on the same point of the code: we are folding the code to have every core (or each SIMD unit) to follow the same execution path wether the go deep on the tree, or go-back.
Parsing the search tree is essentially that, going forward on the next-level as fast as possible, up to a leaf, then using dynamic evaluation and bonus/malus, going back, eventually cuting a branch using negamax.
The target is to be able to parse a maximum number of node and leaf at a warp speed, and to have warp speed on GPGPU you should limit the number of registers, memory usage and avoid divergence as much as possible.
Because the Engine will never have backward operations, it will always proceed with the same operation, that is search for next move then execute it, only a few instructions will be conditionals to reverse their behaviors to enable to go back near the root of the tree.