Researchers reverse-engineered a convolutional recurrent neural network (RNN) trained using model-free reinforcement learning to play the game Sokoban.
The RNN was found to solve more levels with increased test-time compute, resembling classic bidirectional search.
The RNN plans movements by representing them in activations associated with specific directions for each square.
These state-action activations function analogously to a value function, determining backtrack and plan survival during pruning.
Specialized kernels extend these activations forward and backward to create paths, forming a transition model.
The RNN deviates from classical search methods as it does not have a unified state representation; it addresses each box individually.
Each layer in the network has its own plan representation and value function, increasing search depth.
The mechanisms in the network leveraging test-time compute learned through model-free training are understandable in familiar terms.