Neural networks are often considered black boxes, making it challenging to understand their internal workings.
A new perspective suggests that neural networks display patterns in their raw population activity reflecting regularities in the training data, termed the Reflection Hypothesis.
Methods of chunking are proposed to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts.
Three methods are presented to extract these entities, demonstrating effectiveness across different model sizes and architectures, pointing towards a new direction for interpretability in complex learning systems.