A Multi-Modal Agent that Learns from Natural Language and Demonstrations