Found a great paper on transformer efficiency. The key insight: you can prune attention heads during inference without retraining.
Microarticles and updates.
Found a great paper on transformer efficiency. The key insight: you can prune attention heads during inference without retraining.