質問

I'd like to define some functions for processing natural language text. Each of these functions adds some "annotations" to the text, e.g.:

class Annotation(val begin: Int, val end: Int)
class Sentence(begin: Int, end: Int) extends Annotation(begin, end)
class Token(begin: Int, end: Int) extends Annotation(begin, end)

So I might have a Tokenizer function that adds Token annotations, a SentenceSegmenter function that adds Sentence annotations, etc. These functions have some constraints on the order in which they can be run. For example, the Tokenizer might require Sentence annotations, so it would have to be run after the SentenceSegmenter. In this case, I'd like to get a compile error if I accidentally compose these functions in the wrong order. So sentenceSegmenter andThen tokenizer should compile, but tokenizer andThen sentenceSegmenter should not.

Below is my attempt. I defined a special container type for the text, where the type parameter specifies (via compound types) what annotations have been added to the text, and then the functions specify their type parameters appropriately to ensure that they can't be run until their prerequisites are part of the compound type.

trait AnalyzedText[T] {
  def text: String
  def ++[U](annotations: Iterator[U]): AnalyzedText[T with U] 
}

val begin: (AnalyzedText[Any] => AnalyzedText[Any]) = identity
def sentenceSegmenter[T]: (AnalyzedText[T] => AnalyzedText[T with Sentence]) = ???
def tokenizer[T <: Sentence]: (AnalyzedText[T] => AnalyzedText[T with Token]) = ???

// compiles
val pipeline = begin andThen sentenceSegmenter andThen tokenizer
// fails to compile -- good!
//val brokenPipeline = begin andThen tokenizer andThen sentenceSegmenter

So far, so good. The problem arises when I try to actually define one of the functions. For example, I'd like to define tokenizer something like:

def tokenizer[T <: Sentence]: (AnalyzedText[T] => AnalyzedText[T with Token]) =
  text => text ++ "\\S+".r.findAllMatchIn(text.text).map(m => new Token(m.start, m.end))

But the Scala compiler can't figure out how to infer the type argument for the ++ method, and unless I manually specify the type parameter, text.++[Token](...), this produces the error:

type mismatch;  found: Iterator[Token]  required: Iterator[Nothing]

Is there a way to get this type parameter to be inferred? Or, alternatively, am I thinking about the problem wrong? Is there a better way to capture these kinds of function-composition constraints in Scala?

役に立ちましたか?

解決

This looks an awful lot like a bug. In the meantime there's a very simple workaround—just define your processor as a method and omit the return type:

def tokenizer[T <: Sentence](text: AnalyzedText[T]) =
  text ++ "\\S+".r.findAllMatchIn(text.text).map(m => new Token(m.start, m.end))

Now you can define your pipeline in exactly the same way and eta-expansion (§6.26.5) will turn the method into a function.


As a footnote: the weird part is that the following is just fine, given the definition of tokenizer above:

def tokFunc[T <: Sentence]: (AnalyzedText[T] => AnalyzedText[T with Token]) =
  tokenizer _

I glanced at the issue tracker but didn't find anything that was obviously relevant. It might be worth digging around some more and filing an issue or emailing one of the lists if you have the time.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top