| Class | StanfordParser::StandoffDocumentPreprocessor |
| In: |
lib/stanfordparser.rb
|
| Parent: | DocumentPreprocessor |
A preprocessor that segments text into sentences and tokens that contain character offset and token context information that can be used for standoff annotation.
# File lib/stanfordparser.rb, line 264 def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER) # PTBTokenizer.factory is a static function, so use RJB to call it # directly instead of going through a JavaObjectWrapper. We do it this # way because the Standford parser Java code does not provide a # constructor that allows you to specify the second parameter, # invertible, to true, and we need this to write character offset # information into the tokens. ptb_tokenizer_class = Rjb::import(tokenizer) # See the documentation for # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a # description of these parameters. ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false) super(ptb_tokenizer_factory) end
Returns a list of sentences in a string. This wraps the returned sentences in a StandoffSentence object.
# File lib/stanfordparser.rb, line 281 def getSentencesFromString(s) super(s).map!{|s| StandoffSentence.new(s)} end