I finally figured this out. It looks like EMR takes care of the LZO compression for you, but for the sequence file format, you need to add the following HADOOP_INPUT_FORMAT field to your MRJob class:
class MyMRJob(MRJob):
HADOOP_INPUT_FORMAT = 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
def mapper(self, _, line):
# mapper code...
def reducer(self, key, value):
# reducer code...
There's another gotcha, too (quoting from the AWS-hosted Google NGrams page):
The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable.
That means each row is prepended with and extra Long + TAB, so any line parsing you do in your mapper method needs to account for the prepended info as well.