Abstract: We investigate the problem of generating landmark-based manipulation instructions (e.g. move the blue block so that it touches the red block on the right) from image pairs showing a before and an after state in a visual scene. We present a transformer model with difference attention heads that learns to attend to target and landmark objects in consecutive images via a difference key. Our model outperforms the state-of-the-art for instruction generation on the BLOCKS dataset and particularly improves the accuracy of generated target and landmark references. Furthermore, our model outperforms state-of-the-art models on a difference spotting dataset.