| Class | EngTagger |
| In: |
lib/engtagger.rb
|
| Parent: | Object |
English part-of-speech tagger class
| VERSION | = | '0.1.1' | ||
| NUM | = | get_ext('cd') | Regexps to match XML-style part-of-speech tags | |
| GER | = | get_ext('vbg') | ||
| ADJ | = | get_ext('jj[rs]*') | ||
| PART | = | get_ext('vbn') | ||
| NN | = | get_ext('nn[sp]*') | ||
| NNP | = | get_ext('nnp') | ||
| PREP | = | get_ext('in') | ||
| DET | = | get_ext('det') | ||
| PAREN | = | get_ext('[lr]rb') | ||
| QUOT | = | get_ext('ppr') | ||
| SEN | = | get_ext('pp') | ||
| WORD | = | get_ext('\w+') | ||
| TAGS | = | Hash[*tags] |
| conf | [RW] |
Hash storing config values:
|
Given a preceding tag, assign a tag word. Called by the add_tags method. This method is a modified version of the Viterbi algorithm for part-of-speech tagging
This changes any word not appearing in the lexicon to identifiable classes of words handled by a simple unknown word classification metric. Called by the clean_word method.
Strip the provided text of HTML-style tags and separate off any punctuation in preparation for tagging
This method determines whether a word should be considered in its lower or upper case form. This is useful in considering proper nouns and words that begin sentences. Called by add_tags.
Given a POS-tagged text, this method returns only the maximal noun phrases. May be called directly, but is also used by get_noun_phrases
Given a POS-tagged text, this method returns a hash of all proper nouns and their occurrence frequencies. The method is greedy and will return multi-word phrases, if possible, so it would find ``Linguistic Data Consortium’’ as a single unit, rather than as three individual proper nouns. This method does not stem the found words.
Return an easy-on-the-eyes tagged version of a text string. Applies add_tags and reformats to be easier to read.
Given a text string, return as many nouns and noun phrases as possible. Applies add_tags and involves three stages:
Reads some included corpus data and saves it in a stored hash on the local file system. This is called automatically if the tagger can‘t find the stored lexicon.
Load the 2-grams into a hash from YAML data: This is a naive (but fast) YAML data parser. It will load a YAML document with a collection of key: value entries ( {pos tag}: {probability} ) mapped onto single keys ( {tag} ). Each map is expected to be on a single line; i.e., det: { jj: 0.2, nn: 0.5, vb: 0.0002 }
Load the 2-grams into a hash from YAML data: This is a naive (but fast) YAML data parser. It will load a YAML document with a collection of key: value entries ( {pos tag}: {count} ) mapped onto single keys ( {a word} ). Each map is expected to be on a single line; i.e., key: { jj: 103, nn: 34, vb: 1 }
Separate punctuation from words, where appropriate. This leaves trailing periods in place to be dealt with later. Called by the clean_text method.
This handles all of the trailing periods, keeping those that belong on abbreviations and removing those that seem to be at the end of sentences. This method makes some assumptions about the use of capitalization in the incoming text