Large-scale machine learning models with sparse weight matrices are widely used to decrease computation and memory costs.Models with block-wise sparse weight matrices fit better with hardware accelerators and can further reduce costs during inference.However, existing methods for training block-wise sparse models are inefficient and start with full and dense models.The proposed efficient training algorithm decreases both computation and memory costs, while maintaining performance.