Class StanfordParser::StandoffDocumentPreprocessor
In: lib/stanfordparser.rb
Parent: DocumentPreprocessor

A preprocessor that segments text into sentences and tokens that contain character offset and token context information that can be used for standoff annotation.

Methods

Public Class methods

[Source]

# File lib/stanfordparser.rb, line 264
    def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
      # PTBTokenizer.factory is a static function, so use RJB to call it
      # directly instead of going through a JavaObjectWrapper.  We do it this
      # way because the Standford parser Java code does not provide a
      # constructor that allows you to specify the second parameter,
      # invertible, to true, and we need this to write character offset
      # information into the tokens.
      ptb_tokenizer_class = Rjb::import(tokenizer)
      # See the documentation for
      # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
      # description of these parameters.
      ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
      super(ptb_tokenizer_factory)
    end

Public Instance methods

Returns a list of sentences in a string. This wraps the returned sentences in a StandoffSentence object.

[Source]

# File lib/stanfordparser.rb, line 281
    def getSentencesFromString(s)
      super(s).map!{|s| StandoffSentence.new(s)}
    end

[Validate]