Statistical techniques for natural language analysis



l The dog ate.

l Salespeople sold the dog biscuits.

The principle of choosing a part of speech

l Blunt - 90%

l Modern - 97%

l Man - 98%

Hidden Markov Models

Another approach (transformational tagging)

l Apply a dumb algorithm.

l There is a set of rules:

l Change the word tag X to tag Y, if the tag of the previous word - Z.

l Apply these rules a number of times.

l Work faster

l HMM vs. training TT training

(No starting base)


l Build trees on the basis of the sentence, using the existing grammatical rules.

l Example:

(s (np (det the) (noun stranger))

(vp (verb ate)

(np (det the) (noun donut)

(pp (prep with) (np (det a) (noun fork)))))

Own Statistical Parser

l Check

l There are ready examples from Pen treebank l Compare with them

l Finding the rules to apply

l Assign probabilities to rules

l Finding the most likely

PCFG (Probabilistic contextfree grammars)

l sp → np vp


l vp → verb np


l vp → verb np np


l np → det noun


l np → noun


l np → det noun noun


l np → np np


We consider the probability of a built tree

Build your own PCFG. Simple option.

l Take ready Pen treebank

l Read all the trees from it l Read each tree

l Add every new rule.

l P (rule) = number of occurrences divided by total

Two state-of-the-art statistical parsers. Markov grammars

l Solve the problem of the existence of very rare rules.

l Idea - instead of storing rules, we consider the probabilities that, for example

lnp = prep + ...
Lexicalized parsing p ( s , )  p ( h ( c ) m ( c ), t ( c ))  p ( r ( c ) h ( c ))


l Let us assign a word (head) to each vertex of the tree characterizing it.

l p (r | h) is the probability that the rule r will be applied for a node with a given h.

l p (h | m, t) is the probability that such h is a vertex child with head = m and has a tag t.

Lexicalized parsing

l Example

(S (NP The (ADJP most troublesome) report)

(VP may

(VP be

(NP (NP the August merchandise trade deficit)

(ADJP due (ADVP out) (NP tomorrow)))))

l p (h | m, t) = p (be | may, vp)

l p (r | h) = p (posvp → aux np | be)

Lexicalized parsing

l “the August merchandise trade deficit”

l rule = np → det propernoun noun noun noun

Conditioning events

p (“August”)

p (rule)


2.7 * 10 ^ (- 4)

3.8 * 10 ^ (- 5)

Part of speech

2.8 * 10 ^ (- 3)

9.4 * 10 ^ (- 5)

h (c) = “deficit”

1.9 * 10 ^ (- 1)

6.3 * 10 ^ (- 3)


