Worth reading

Found a great paper on transformer efficiency. The key insight: you can prune attention heads during inference without retraining.

Link to paper โ†’