I'd like to define some functions for processing natural language text. Each of these functions adds some "annotations" to the text, e.g.:
class Annotation(val begin: Int, val end: Int)
class Sentence(begin: Int, end: Int) extends Annotation(begin, end)
class Token(begin: Int, end: Int) extends Annotation(begin, end)
So I might have a Tokenizer function that adds Token annotations, a SentenceSegmenter function that adds Sentence annotations, etc. These functions have some constraints on the order in which they can be run. For example, the Tokenizer might require Sentence annotations, so it would have to be run after the SentenceSegmenter. In this case, I'd like to get a compile error if I accidentally compose these functions in the wrong order. So sentenceSegmenter andThen tokenizer
should compile, but tokenizer andThen sentenceSegmenter
should not.
Below is my attempt. I defined a special container type for the text, where the type parameter specifies (via compound types) what annotations have been added to the text, and then the functions specify their type parameters appropriately to ensure that they can't be run until their prerequisites are part of the compound type.
trait AnalyzedText[T] {
def text: String
def ++[U](annotations: Iterator[U]): AnalyzedText[T with U]
}
val begin: (AnalyzedText[Any] => AnalyzedText[Any]) = identity
def sentenceSegmenter[T]: (AnalyzedText[T] => AnalyzedText[T with Sentence]) = ???
def tokenizer[T <: Sentence]: (AnalyzedText[T] => AnalyzedText[T with Token]) = ???
// compiles
val pipeline = begin andThen sentenceSegmenter andThen tokenizer
// fails to compile -- good!
//val brokenPipeline = begin andThen tokenizer andThen sentenceSegmenter
So far, so good. The problem arises when I try to actually define one of the functions. For example, I'd like to define tokenizer
something like:
def tokenizer[T <: Sentence]: (AnalyzedText[T] => AnalyzedText[T with Token]) =
text => text ++ "\\S+".r.findAllMatchIn(text.text).map(m => new Token(m.start, m.end))
But the Scala compiler can't figure out how to infer the type argument for the ++
method, and unless I manually specify the type parameter, text.++[Token](...)
, this produces the error:
type mismatch; found: Iterator[Token] required: Iterator[Nothing]
Is there a way to get this type parameter to be inferred? Or, alternatively, am I thinking about the problem wrong? Is there a better way to capture these kinds of function-composition constraints in Scala?