Class EngTagger
In: lib/engtagger.rb
Parent: Object

English part-of-speech tagger class

Methods

Constants

VERSION = '0.1.1'
NUM = get_ext('cd')   Regexps to match XML-style part-of-speech tags
GER = get_ext('vbg')
ADJ = get_ext('jj[rs]*')
PART = get_ext('vbn')
NN = get_ext('nn[sp]*')
NNP = get_ext('nnp')
PREP = get_ext('in')
DET = get_ext('det')
PAREN = get_ext('[lr]rb')
QUOT = get_ext('ppr')
SEN = get_ext('pp')
WORD = get_ext('\w+')
TAGS = Hash[*tags]

Attributes

conf  [RW]  Hash storing config values:
  • :unknown_word_tag
     => (String) Tag to assign to unknown words
    
  • :stem
     => (Boolean) Stem single words using Porter module
    
  • :weight_noun_phrases
     => (Boolean) When returning occurrence counts for a noun phrase, multiply
         the valuethe number of words in the NP.
    
  • :longest_noun_phrase
     => (Integer) Will ignore noun phrases longer than this threshold. This
         affects only the get_words() and get_nouns() methods.
    
  • :relax
     => (Boolean) Relax the Hidden Markov Model: this may improve accuracy for
         uncommon words, particularly words used polysemously
    
  • :tag_lex
     => (String) Name of the YAML file containing a hash of adjacent part of
          speech tags and the probability of each
    
  • :word_lex
     => (String) Name of the YAML file containing a hash of words and corresponding
         parts of speech
    
  • :unknown_lex
     => (String) Name of the YAML file containing a hash of tags for unknown
         words and corresponding parts of speech
    
  • :tag_path
     => (String) Directory path of tag_lex
    
  • :word_path
     => (String) Directory path of word_lex and unknown_lex
    
  • :debug
     => (Boolean) Print debug messages
    

Public Class methods

Convert a Treebank-style, abbreviated tag into verbose definitions

Return a regexp from a string argument that matches an XML-style pos tag

Return a class variable that holds probability data

Return a class variable that holds lexical data

Take a hash of parameters that override default values. See above for details.

Public Instance methods

Examine the string provided and return it fully tagged in XML style

Given a preceding tag, assign a tag word. Called by the add_tags method. This method is a modified version of the Viterbi algorithm for part-of-speech tagging

This changes any word not appearing in the lexicon to identifiable classes of words handled by a simple unknown word classification metric. Called by the clean_word method.

Strip the provided text of HTML-style tags and separate off any punctuation in preparation for tagging

This method determines whether a word should be considered in its lower or upper case form. This is useful in considering proper nouns and words that begin sentences. Called by add_tags.

Given a POS-tagged text, this method returns only the maximal noun phrases. May be called directly, but is also used by get_noun_phrases

This returns a compiled regexp for extracting maximal noun phrases from a POS-tagged text.

Similar to get_words, but requires a POS-tagged text as an argument.

Given a POS-tagged text, this method returns all nouns and their occurrence frequencies.

Given a POS-tagged text, this method returns a hash of all proper nouns and their occurrence frequencies. The method is greedy and will return multi-word phrases, if possible, so it would find ``Linguistic Data Consortium’’ as a single unit, rather than as three individual proper nouns. This method does not stem the found words.

Return an easy-on-the-eyes tagged version of a text string. Applies add_tags and reformats to be easier to read.

Return an array of sentences (without POS tags) from a text.

Given a text string, return as many nouns and noun phrases as possible. Applies add_tags and involves three stages:

  • Tag the text
  • Extract all the maximal noun phrases
  • Recursively extract all noun phrases from the MNPs

Reads some included corpus data and saves it in a stored hash on the local file system. This is called automatically if the tagger can‘t find the stored lexicon.

Downcase the first letter of word

Load the 2-grams into a hash from YAML data: This is a naive (but fast) YAML data parser. It will load a YAML document with a collection of key: value entries ( {pos tag}: {probability} ) mapped onto single keys ( {tag} ). Each map is expected to be on a single line; i.e., det: { jj: 0.2, nn: 0.5, vb: 0.0002 }

Load the 2-grams into a hash from YAML data: This is a naive (but fast) YAML data parser. It will load a YAML document with a collection of key: value entries ( {pos tag}: {count} ) mapped onto single keys ( {a word} ). Each map is expected to be on a single line; i.e., key: { jj: 103, nn: 34, vb: 1 }

This method will reset the preceeding tag to a sentence ender (PP). This prepares the first word of a new sentence to be tagged correctly.

Separate punctuation from words, where appropriate. This leaves trailing periods in place to be dealt with later. Called by the clean_text method.

This handles all of the trailing periods, keeping those that belong on abbreviations and removing those that seem to be at the end of sentences. This method makes some assumptions about the use of capitalization in the incoming text

Return the word stem as given by Stemmable module. This can be turned off with the class parameter @conf[:stem] => false.

Return a text string with the part-of-speech tags removed

Upcase the first letter of word

Check whether the text is a valid string

[Validate]