A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks