Researchers have introduced a novel task of clustering trajectories from offline reinforcement learning datasets, where each cluster center represents the policy that generated the trajectories.
The clustering objective is formulated based on the KL-divergence of offline trajectory distributions and a mixture of policy-induced distributions.
To address this task, Policy-Guided K-means (PG-Kmeans) and Centroid-Attracted Autoencoder (CAAE) are proposed.
PG-Kmeans trains behavior cloning policies and assigns trajectories based on policy generation probabilities, while CAAE guides the latent representations of trajectories toward specific codebook entries for clustering.
The finite-step convergence of PG-Kmeans is theoretically proven, highlighting a challenge in offline trajectory clustering due to policy-induced conflicts.
Experimental validation on the D4RL dataset and custom GridWorld environments demonstrates the effectiveness of PG-Kmeans and CAAE in partitioning trajectories into meaningful clusters.
The research suggests that these methods offer a promising framework for policy-based trajectory clustering, applicable in offline RL and beyond.