Deep Learning for NLP [PDF]

The features are often both over-specified and incomplete. The work has to be done again for each task/domain/… We mus

10 downloads 14 Views 8MB Size

Recommend Stories


Deep-Learning-NLP Documentation
Life isn't about getting and having, it's about giving and being. Kevin Kruse

Deep Learning for NLP (without Magic)
Make yourself a priority once in a while. It's not selfish. It's necessary. Anonymous

Reinforcement Learning for NLP
Don't be satisfied with stories, how things have gone with others. Unfold your own myth. Rumi

Reinforcement Learning for NLP
I want to sing like the birds sing, not worrying about who hears or what they think. Rumi

NLP & Learning
In the end only three things matter: how much you loved, how gently you lived, and how gracefully you

[PDF] Deep Learning
If you feel beautiful, then you are. Even if you don't, you still are. Terri Guillemets

Deep learning for neuroimaging
I tried to make sense of the Four Books, until love arrived, and it all became a single syllable. Yunus

R Deep Learning Cookbook Pdf
Learning never exhausts the mind. Leonardo da Vinci

Review PDF NLP for Teachers
You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

Machine Learning And Deep Learning For IIOT
Don't ruin a good today by thinking about a bad yesterday. Let it go. Anonymous

Idea Transcript


Deep Learning for NLP (without Magic)

Richard Socher

Stanford, MetaMind ML Summer School, Lisbon *with a big thank you to Chris Manning and Yoshua Bengio, with whom I did the previous versions of this lecture

Deep Learning NER

WordNet

SRL

Parser

Most current machine learning works well because of human-designed representations and input features Machine learning becomes just optimizing weights to best make a final prediction Representation learning attempts to automatically learn good features or representations Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction 1

A Deep Architecture Mainly, work has explored deep belief networks (DBNs), Markov Random Fields with multiple layers, and various types of multiple-layer neural networks Output layer Here predicting a supervised target

Hidden layers These learn more abstract representations as you head up

Input layer 2

Raw sensory inputs (roughly)

Part 1.1: The Basics

Five Reasons to Explore Deep Learning 3

#1 Learning representations Handcrafting features is time-consuming The features are often both over-specified and incomplete

The work has to be done again for each task/domain/… We must move beyond handcrafted features and simple ML

Humans develop representations for learning and reasoning Our computers should do the same

Deep learning provides a way of doing this 4

#2 The need for distributed representations Current NLP systems are incredibly fragile because of their atomic symbol representations

5

Crazy sentential complement, such as for “likes [(being) crazy]”

#2 The need for distributional & distributed representations Learned word representations help enormously in NLP They provide a powerful similarity model for words

Distributional similarity based word clusters greatly help most applications +1.4% F1 Dependency Parsing 15.2% error reduction (Koo & Collins 2008, Brown clustering) +3.4% F1 Named Entity Recognition 23.7% error reduction (Stanford NER, exchange clustering)

6

Distributed representations can do even better by representing more dimensions of similarity

C1

C2

#2 The need for distributed representations Clustering

MultiClustering

input

Learning features that are not mutually exclusive can be exponentially more efficient than nearest-neighbor-like or clustering-like models 7

C3

Distributed representations deal with the curse of dimensionality Generalizing locally (e.g., nearest neighbors) requires representative examples for all relevant variations! Classic solutions: • Manual feature design • Assuming a smooth target function (e.g., linear models) • Kernel methods (linear in terms of kernel based on data points) Neural networks parameterize and learn a “similarity” kernel 8

#3 Unsupervised feature and weight learning Today, most practical, good NLP& ML methods require labeled training data (i.e., supervised learning) But almost all data is unlabeled

Most information must be acquired unsupervised Fortunately, a good model of observed data can really help you learn classification decisions

9

#4 Learning multiple levels of representation Biologically inspired learning The cortex seems to have a generic learning algorithm The brain has a deep architecture Task 1 Output

We need good intermediate representations that can be shared across tasks Multiple levels of latent variables allow combinatorial sharing of statistical strength Insufficient model depth can be exponentially inefficient 10

Task 2 Output

Linguistic Input

Task 3 Output

#4 Learning multiple levels of representation [Lee et al. ICML 2009; Lee et al. NIPS 2009]

Successive model layers learn deeper intermediate representations

Layer 3

Layer 2

11

Layer 1

High-level linguistic representations

Handling the recursivity of human language zt−1

Human sentences are composed from words and phrases

We need compositionality in our ML models Recursion: the same operator (same parameters) is applied repeatedly on different components

xt−1

xt

xt+1

A small crowd quietly enters the historic church

S VP NP A small crowd

VP quietly enters

NP

Semantic Representations

NP Det. the

12

zt+1

zt

Adj. historic

N. church

#5 Why now? Despite prior investigation and understanding of many of the algorithmic techniques … Before 2006 training deep architectures was unsuccessful  What has changed? • Faster machines and more data help DL more than other algorithms • New methods for unsupervised pre-training have been developed (Restricted Boltzmann Machines = RBMs, autoencoders, contrastive estimation, etc.) • More efficient parameter estimation methods • Better understanding of model regularization, ++

Deep Learning models have already achieved impressive results for HLT Neural Language Model [Mikolov et al. Interspeech 2011]

Model \ WSJ ASR task

Eval WER

KN5 Baseline

17.2

Discriminative LM

16.9

Recurrent NN combination 14.4

MSR MAVIS Speech System [Dahl et al. 2012; Seide et al. 2011; following Mohamed et al. 2011]

Acoustic model & Recog training \ WER

RT03S Hub5 FSH SWB

GMM 40-mix, 1-pass 27.4 BMMI, SWB 309h −adapt “The algorithms represent the first time a company has released a deep-neuralnetworks (DNN)-based speech-recognition algorithm in a commercial product.” 14

DBN-DNN 7 layer x 2048, SWB 309h

23.6

1-pass 18.5 16.1 −adapt (−33%) (−32%)

GMM 72-mix, k-pass 18.6 BMMI, FSH 2000h +adapt

17.1

Deep Learn Models Have Interesting Performance Characteristics Deep learning models can now be very fast in some circumstances • SENNA [Collobert et al. 2011] can do POS or NER faster than other SOTA taggers (16x to 122x), using 25x less memory • WSJ POS 97.29% acc; CoNLL NER 89.59% F1; CoNLL Chunking 94.32% F1

Changes in computing technology favor deep learning • In NLP, speed has traditionally come from exploiting sparsity • But with modern machines, branches and widely spaced memory accesses are costly • Uniform parallel operations on dense vectors are faster These trends are even stronger with multi-core CPUs and GPUs 15

16

Outline of the Tutorial 1. The Basics 1. Motivations 2. From logistic regression to neural networks 3. Word representations 4. Unsupervised word vector learning 5. Backpropagation Training 6. Learning word-level classifiers: POS and NER 7. Sharing statistical strength 2. Recursive Neural Networks 3. Applications, Discussion, and Resources 17

Outline of the Tutorial 1. The Basics 2. Recursive Neural Networks 1. Motivation 2. Recursive Neural Networks for Parsing 3. Optimization and Backpropagation Through Structure 4. Compositional Vector Grammars: Parsing 5. Recursive Autoencoders (short): Paraphrase Detection 6. Matrix-Vector RNNs (short): Relation classification 7. Recursive Neural Tensor Networks: Sentiment Analysis 8. Dependency Tree RNNs: Sentence-Image Search 3. Applications, Discussion, and Resources 18

Outline of the Tutorial 1. The Basics 2. Recursive Neural Networks 3. Other Models, Applications, Discussion, and Resources 1. Assorted Speech and NLP Applications 2. Deep Learning: General Strategy and Tricks 3. Resources (readings, code, …) 4. Discussion

19

Part 1.2: The Basics

From logistic regression to neural nets

20

Demystifying neural networks Neural networks come with their own terminological baggage

A single neuron A computational unit with n (3) inputs and 1 output and parameters W, b

… just like SVMs

But if you understand how logistic regression or maxent models work Then you already understand the operation of a basic neural network neuron! 21

Inputs

Activation function

Output

Bias unit corresponds to intercept term

From Maxent Classifiers to Neural Networks In NLP, a maxent classifier is normally written as: exp å li fi (c, d) i P(c | d, l ) = å exp å li fi (c¢, d) c¢ÎC

i

Supervised learning gives us a distribution for datum d over classes in C

Vector form:

P(c | d, l ) =

e

å

l T f (c,d )



e

l T f ( c¢,d )

Such a classifier is used as-is in a neural network (“a softmax layer”) • Often as the top layer: J = softmax(λ·x) But for now we’ll derive a two-class logistic model for one neuron 22

From Maxent Classifiers to Neural Networks P(c | d, l ) =

Vector form: Make two class:

P(c1 | d, l ) = =

e

e

l T f (c1,d )

e

å

l T f (c,d )



e

l T f ( c¢,d )

l T f (c1,d )

+e

l T f (c2 ,d )

1 1+ el

T

[ f (c2 ,d )- f (c1 ,d )]

e

l T f (c1,d )

- l T f (c1,d )

e = l T f (c ,d ) l T f (c ,d ) × - l T f (c ,d ) 1 e 1 +e 2 e =

1 1+ e- l

T

x

for x = f (c1, d) - f (c2, d)

= f (l T x) for f(z) = 1/(1 + exp(−z)), the logistic function – a sigmoid non-linearity. 23

This is exactly what a neuron computes

hw,b (x) = f (w x + b) T

b: We can have an “always on” feature, which gives a class prior, or separate it out, as a bias term

1 f (z) = 1+ e-z

24

w, b are the parameters of this neuron i.e., this logistic regression model

A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logistic regression functions, then we get a vector of outputs … But we don’t have to decide ahead of time what variables these logistic regressions are trying to predict!

25

A neural network = running several logistic regressions at the same time … which we can feed into another logistic regression function It is the training criterion that will direct what the intermediate hidden variables should be, so as to do a good job at predicting the targets for the next layer, etc. 26

A neural network = running several logistic regressions at the same time Before we know it, we have a multilayer neural network….

27

Matrix notation for a layer We have a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 ) a2 = f (W21 x1 + W22 x2 +W23 x3 + b2 )

W12 a1

etc.

In matrix notation

a2

z = Wx + b

a3

a = f (z) where f is applied element-wise: f ([z1, z2, z3 ]) = [ f (z1 ), f (z2 ), f (z3 )] 28

b3

How do we train the weights W? • For a single supervised layer, we train just like a maxent model – we calculate and use error derivatives (gradients) to improve • Online learning: Stochastic gradient descent (SGD) • Batch learning: Conjugate gradient or L-BFGS • A multilayer net could be more complex because the internal (“hidden”) logistic units make the function non-convex … just as for hidden CRFs [Quattoni et al. 2005, Gunawardana et al. 2005] • But we can use the same ideas and techniques • Just without guarantees …

• We “backpropagate” error derivatives through the model 29

Non-linearities: Why they’re needed • For logistic regression: map to probabilities • Here: function approximation, e.g., regression or classification • Without non-linearities, deep neural networks can’t do anything more than a linear transform • Extra layers could just be compiled down into a single linear transform • Probabilistic interpretation unnecessary except in the Boltzmann machine/graphical models • People often use other non-linearities, such as tanh, as we’ll discuss in part 3

30

Summary Knowing the meaning of words! You now understand the basics and the relation to other models • Neuron = logistic regression or similar function • Input layer = input training/test vector • Bias unit = intercept term/always on feature • Activation = response • Activation function is a logistic (or similar “sigmoid” nonlinearity) • Backpropagation = running stochastic gradient descent backward layer-by-layer in a multilayer network • Weight decay = regularization / Bayesian prior 31

Effective deep learning became possible through unsupervised pre-training [Erhan et al., JMLR 2010] (with RBMs and Denoising Auto-Encoders) Purely supervised neural net

32

With unsupervised pre-training

0–9 handwritten digit recognition error rate (MNIST data)

Part 1.3: The Basics

Word Representations

33

The standard word representation The vast majority of rule-based and statistical NLP work regards words as atomic symbols: hotel, conference, walk In vector space terms, this is a vector with one 1 and a lot of zeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] Dimensionality: 20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)

We call this a “one-hot” representation. Its problem: motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0 34

Distributional similarity based representations You can get a lot of value by representing a word by means of its neighbors “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)

One of the most successful ideas of modern statistical NLP government debt problems turning into banking crises as has happened in saying that Europe needs unified banking regulation to replace the hodgepodge

 These words will represent banking  35

You can vary whether you use local or large context to get a more syntactic or semantic clustering

Class-based (hard) and soft clustering word representations Class based models learn word classes of similar words based on distributional information ( ~ class HMM) • Brown clustering (Brown et al. 1992) • Exchange clustering (Martin et al. 1998, Clark 2003) • Desparsification and great example of unsupervised pre-training Soft clustering models learn for each cluster/topic a distribution over words of how likely that word is in each cluster • Latent Semantic Analysis (LSA/LSI), Random projections • Latent Dirichlet Analysis (LDA), HMM clustering 36

Neural word embeddings as a distributed representation Similar idea Combine vector space semantics with the prediction of probabilistic models (Bengio et al. 2003, Collobert & Weston 2008, Turian et al. 2010) In all of these approaches, including deep learning models, a word is represented as a dense vector 37

linguistics =

0.286 0.792 −0.177 −0.107 0.109 −0.542 0.349 0.271

Neural word embeddings - visualization

38

Stunning new result at this conference! Mikolov, Yih & Zweig (NAACL 2013) These representations are way better at encoding dimensions of similarity than we realized! • Analogies testing dimensions of similarity can be solved quite well just by doing vector subtraction in the embedding space Syntactically • xapple − xapples ≈ xcar − xcars ≈ xfamily − xfamilies • Similarly for verb and adjective morphological forms Semantically (Semeval 2012 task 2) • xshirt − xclothing ≈ xchair − xfurniture 39

Stunning new result at this conference! Mikolov, Yih & Zweig (NAACL 2013)

40

Method

Syntax % correct

LSA 320 dim

16.5 [best]

RNN 80 dim

16.2

RNN 320 dim

28.5

RNN 1600 dim

39.6

Method

Semantics Spearm ρ

UTD-NB (Rink & H. 2012)

0.230 [Semeval win]

LSA 640

0.149

RNN 80

0.211

RNN 1600

0.275

Advantages of the neural word embedding approach Compared to a method like LSA, neural word embeddings can become more meaningful through adding supervision from one or multiple tasks “Discriminative fine-tuning” For instance, sentiment is usually not captured in unsupervised word embeddings but can be in neural word vectors

We can build representations for large linguistic units See part 2 41

Part 1.4: The Basics

Unsupervised word vector learning

42

A neural network for learning word vectors (Collobert et al. JMLR 2011)

Idea: A word and its context is a positive training sample; a random word in that same context gives a negative training sample: cat chills on a mat

cat chills Jeju a mat

Similar: Implicit negative evidence in Contrastive Estimation, (Smith and Eisner 2005)

43

A neural network for learning word vectors How do we formalize this idea? Ask that score(cat chills on a mat) > score(cat chills Jeju a mat)

How do we compute the score? • With a neural network • Each word is associated with an n-dimensional vector 44

Word embedding matrix • Initialize all word vectors randomly to form a word embedding matrix |V| L =

[



]

n

the cat mat … • These are the word features we want to learn • Also called a look-up table • Conceptually you get a word’s vector by left multiplying a one-hot vector e by L: x = Le 45

Word vectors as input to a neural network • score(cat chills on a mat) • To describe a phrase, retrieve (via index) the corresponding vectors from L

cat chills on a mat • Then concatenate them to 5n vector: • x =[ • How do we then compute score(x)? 46

]

A Single Layer Neural Network

• A single layer was a combination of a linear layer and a nonlinearity: • The neural activations a can then be used to compute some function • For instance, the score we care about:

47

Summary: Feed-forward Computation

Computing a window’s score with a 3-layer Neural Net: s = score(cat chills on a mat)

cat 48

chills

on

a

mat

Summary: Feed-forward Computation

• s = score(cat chills on a mat) • sc = score(cat chills Jeju a mat) • Idea for training objective: make score of true window larger and corrupt window’s score lower (until they’re good enough): minimize

• This is continuous, can perform SGD 49

Training with Backpropagation

Assuming cost J is > 0, it is simple to see that we can compute the derivatives of s and sc wrt all the involved variables: U, W, b, x

50

Training with Backpropagation • Let’s consider the derivative of a single weight Wij

• This only appears inside ai

• For example: W23 is only used to compute a2 x1 51

U2

s

a1

a2

W23

x2

x3

+1

Training with Backpropagation

Derivative of weight Wij:

U2

s

x1 52

a1

a2

W23

x2

x3

+1

Training with Backpropagation

Derivative of single weight Wij :

U2

s

Local error signal 53

where

a1

a2

W23

x2

x3

+1

Local input signal for logistic f

x1

Training with Backpropagation

• From single weight Wij to full W:

• We want all combinations of i = 1, 2 and j = 1, 2, 3 • Solution: Outer product: where is the “responsibility” coming from each activation a 54

U2

s

x1

a1

a2

W23

x2

x3

+1

Training with Backpropagation

• For biases b, we get:

U2

s

x1 55

a1

a2

W23

x2

x3

+1

Training with Backpropagation That’s almost backpropagation It’s simply taking derivatives and using the chain rule!

Remaining trick: we can re-use derivatives computed for higher layers in computing derivatives for lower layers

Example: last derivatives of model, the word vectors in x 56

Training with Backpropagation • Take derivative of score with respect to single word vector (for simplicity a 1d vector, but same if it was longer) • Now, we cannot just take into consideration one ai because each xj is connected to all the neurons above and hence xj influences the overall score through all of these, hence: 57

Re-used part of previous derivative

Simple Window Model Good for illustration of backpropagation Now obsolete when it comes to word vectors In section 3 we will learn about better models for single word vectors.

But! The window based model family is indeed useful!

Training with Backpropagation: softmax What is the major benefit of deep learned word vectors? Ability to also propagate labeled information into them, via softmax/maxent and hidden layer: c c c 1

P(c | d, l ) =

e

å

3

l T f (c,d )



e

S

l T f ( c¢,d )

x1 59

2

a1

a2

x2

x3

+1

Part 1.5: The Basics

Backpropagation Training

60

Back-Prop • Compute gradient of example-wise loss wrt parameters • Simply applying the derivative chain rule wisely

• If computing the loss(example, parameters) is O(n) computation, then so is computing the gradient 61

Simple Chain Rule

62

Multiple Paths Chain Rule

63

Multiple Paths Chain Rule - General



64

Chain Rule in Flow Graph Flow graph: any directed acyclic graph node = computation result arc = computation dependency

… …

65



= successors of

Back-Prop in Multi-Layer Net

… … 66

h = sigmoid(Vx)

Back-Prop in General Flow Graph Single scalar output

… …

1. Fprop: visit nodes in topo-sort order - Compute value of node given predecessors 2. Bprop: - initialize output gradient = 1 - visit nodes in reverse order: Compute gradient wrt each node using gradient wrt successors

= successors of

… 67

Automatic Differentiation • The gradient computation can be automatically inferred from the symbolic expression of the fprop. • Each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output. • Easy and fast prototyping 68

Part 1.6: The Basics

Learning word-level classifiers: POS and NER 69

The Model (Collobert & Weston 2008; Collobert et al. 2011)

• Similar to word vector learning but replaces the single scalar score with a Softmax/Maxent classifier • Training is again done via backpropagation which gives an error similar to the score in the unsupervised word vector learning model 70

c1

c2

c3

S

x1

a1

a2

x2

x3

+1

The Model - Training • We already know the softmax classifier and how to optimize it • The interesting twist in deep learning is that the input features are also learned, similar to learning word vectors with a score:

U2

s

x1 71

c1

c2

c3

S

a1

a2

W23

x2

x3

+1

x1

a1

a2

x2

x3

+1

The secret sauce is the unsupervised word vector pre-training on a large text collection POS WSJ (acc.)

NER CoNLL (F1)

97.24 96.37 97.20

89.31 81.47 88.87

+ hand-crafted features*** 97.29

89.59

State-of-the-art* Supervised NN

Unsupervised pre-training followed by supervised NN**

* Representative systems: POS: (Toutanova et al. 2003), NER: (Ando & Zhang 2005) ** 130,000-word embedding trained on Wikipedia and Reuters with 11 word window, 100 unit hidden layer – for 7 weeks! – then supervised task training ***Features are character suffixes for POS and a gazetteer for NER 72

Supervised refinement of the unsupervised word representation helps

Supervised NN NN with Brown clusters Fixed embeddings* C&W 2011**

POS WSJ (acc.)

NER CoNLL (F1)

96.37 96.92 97.10 97.29

81.47 87.15 88.87 89.59

* Same architecture as C&W 2011, but word embeddings are kept constant during the supervised training phase ** C&W is unsupervised pre-train + supervised NN + features model of last slide 73

Part 1.7

Sharing statistical strength

74

Multi-Task Learning • Generalizing better to new tasks is crucial to approach AI • Deep architectures learn good intermediate representations that can be shared across tasks • Good representations make sense for many tasks

task 1 output y1

task 2 output y2

shared intermediate representation h

raw input x

75

task 3 output y3

Combining Multiple Sources of Evidence with Shared Embeddings • • • •

76

Relational learning Multiple sources of information / relations Some symbols (e.g. words, Wikipedia entries) shared Shared embeddings help propagate information among data sources: e.g., WordNet, XWN, Wikipedia, FreeBase, …

Sharing Statistical Strength • Besides very fast prediction, the main advantage of deep learning is statistical • Potential to learn from less labeled examples because of sharing of statistical strength: • Unsupervised pre-training & multi-task learning • Semi-supervised learning 

77

Semi-Supervised Learning • Hypothesis: P(c|x) can be more accurately computed using shared structure with P(x) purely supervised

78

Semi-Supervised Learning • Hypothesis: P(c|x) can be more accurately computed using shared structure with P(x) semisupervised

79

Deep autoencoders Alternative to contrastive unsupervised word learning • Another is RBMs (Hinton et al. 2006), which we don’t cover today Works well for fixed input representations 1. Definition, intuition and variants of autoencoders 2. Stacking for deep autoencoders 3. Why do autoencoders improve deep neural nets so much?

80

Auto-Encoders • Multilayer neural net with target output = input • Reconstruction=decoder(encoder(input))



reconstruction

decoder

• Probable inputs have small reconstruction error

code= latent features encoder

… 81

input

PCA = Linear Manifold = Linear Auto-Encoder input x, 0-mean features=code=h(x)=W x reconstruction(x)=WT h(x) = WT W x W = principal eigen-basis of Cov(X)

Linear manifold

reconstruction(x) reconstruction error vector

x LSA example: x = (normalized) distribution of co-occurrence frequencies 82

The Manifold Learning Hypothesis • Examples concentrate near a lower dimensional “manifold” (region of high density where small changes are only allowed in certain directions)

83

Auto-Encoders Learn Salient Variations, like a non-linear PCA

Minimizing reconstruction error forces latent representation of “similar inputs” to stay on manifold

84

Auto-Encoder Variants • Discrete inputs: cross-entropy or log-likelihood reconstruction criterion (similar to used for discrete targets for MLPs) • Preventing them to learn the identity everywhere: • Undercomplete (eg PCA): bottleneck code smaller than input • Sparsity: penalize hidden unit activations so at or near 0 [Goodfellow et al 2009] • Denoising: predict true input from corrupted input [Vincent et al 2008] • Contractive: force encoder to have small derivatives [Rifai et al 2011] 85

Sparse autoencoder illustration for images 50

Natural Images

100

150

200

50

250

100

Learned bases: “Edges”

300

150 350

200 400

250

50

300

100

450

500 50

100

150

200

350

250

300

350

400

450 150

500

200

400

250

450 300

500 50

100

150

350 200

250

300

350

100

150

400

450

500

400

450

500 50

200

250

300

350

400

450

500

Test example  0.8 *

86

+ 0.3 *

+ 0.5 *

[a1, …, a64] = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, 0] (feature representation)

Stacking Auto-Encoders • Can be stacked successfully (Bengio et al NIPS’2006) to form highly non-linear representations

87

Layer-wise Unsupervised Learning

input 88



Layer-wise Unsupervised Pre-training

89

features



input



Layer-wise Unsupervised Pre-training

reconstruction of input

90



features



input



? =



input

Layer-wise Unsupervised Pre-training

91

features



input



Layer-wise Unsupervised Pre-training

92

More abstract features



features



input



Layer-Wise Unsupervised Pre-training

Layer-wise Unsupervised Learning

reconstruction of features

93



More abstract features



features



input



? =



Layer-wise Unsupervised Pre-training

94

More abstract features



features



input



Layer-wise Unsupervised Learning

Even more abstract features

95



More abstract features



features



input



Supervised Fine-Tuning Output f(X) six Even more abstract features

96



More abstract features



features



input



Target ? = Y two!

Why is unsupervised pre-training working so well? • Regularization hypothesis: • Representations good for P(x) are good for P(y|x) • Optimization hypothesis: • Unsupervised initializations start near better local minimum of supervised training error • Minima otherwise not achievable by random initialization Erhan, Courville, Manzagol, Vincent, Bengio (JMLR, 2010) 97

Part 2

Recursive Deep Learning

98

Building on Word Vector Space Models x2

1 5

5

4 3

Germany

2

France

1 0

1

2

3

4

1.1 4

1 3 2 2.5

5

Monday Tuesday 6

7

8

9

10

9 2 9.5 1.5

x1

the country of my birth the place where I was born

But how can we represent the meaning of longer phrases? 99

By mapping them into the same vector space!

How should we map phrases into a vector space? Use principle of compositionality

x2 the country of my birth

5

The meaning (vector) of a sentence is determined by (1) the meanings of its words and (2) the rules that combine them.

the place where I was born

4

Germany

3

France

Monday

2

Tuesday 1 0

1 5 5.5 6.1 1 3.5

2.5 3.8

0.4 0.3

2.1 3.3

7 7

4 4.5

2.3 3.6

the

country

of

my

birth

1

2

3

4

5

6

7

8

9

10

x1

Models in this section can jointly learn parse trees and compositional vector representations 100

Semantic Vector Spaces Vectors representing Phrases and Sentences that do not ignore word order and capture semantics for NLP tasks

Single Word Vectors

Documents Vectors

• Distributional Techniques • Brown Clusters • Useful as features inside models, e.g. CRFs for NER, etc. • Cannot capture longer phrases

• Bag of words models • LSA, LDA • Great for IR, document exploration, etc. • Ignore word order, no detailed understanding

Recursive Deep Learning 1. 2. 3. 4. 5. 6. 7.

102

Motivation Recursive Neural Networks for Parsing Optimization and Backpropagation Through Structure Compositional Vector Grammars: Parsing Recursive Autoencoders: Paraphrase Detection Matrix-Vector RNNs: Relation classification Recursive Neural Tensor Networks: Sentiment Analysis

Sentence Parsing: What we want S VP

PP

NP NP

103

9 1

5 3

7 1

8 5

9 1

The

cat

sat

on

the

4 3

mat.

Learn Structure and Representation 5 4

S 7 3

VP

8 3

5 2

104

PP

NP

3 3

9 1

5 3

7 1

The

cat

sat

8 5

9 1

on

the

NP

4 3

mat.

Recursive Neural Networks for Structure Prediction Inputs: two candidate children’s representations Outputs: 1. The semantic representation if the two nodes are merged. 2. Score of how plausible the new node would be. 1.3

8 3

8 3

3 3

Neural Network

8 5 105

3 3

8 5

on

9 1

the

4 3

mat.

Recursive Neural Network Definition score =

1.3

8 3

= parent score = UTp

Neural Network

c

p = tanh(W c1 + b), 2

8 5

c1

106

3 3

c2

Same W parameters at all nodes of the tree

Related Work to Socher et al. (ICML 2011) Pollack (1990): Recursive auto-associative memories Previous Recursive Neural Networks work by Goller & Küchler (1996), Costa et al. (2003) assumed fixed tree structure and used one hot vectors. Hinton (1990) and Bottou (2011): Related ideas about recursive models and recursive operators as smooth versions of logic operations

107

Parsing a sentence with an RNN

5 2

3.1

Neural Network

108

0.3

0 1

0.1

Neural Network

2 0

0.4

Neural Network

Neural Network

9 1

5 3

7 1

The

cat

sat

1 0

8 5

9 1

on

the

3 3

2.3

Neural Network

4 3

mat.

Parsing a sentence 2 1

1.1

Neural Network

0.1

5 2

109

2 0

0.4

Neural Network

Neural Network

9 1

5 3

7 1

The

cat

sat

1 0

8 5

9 1

on

the

3 3

2.3

Neural Network

4 3

mat.

Parsing a sentence 2 1

1.1

Neural Network

3.6

0.1

5 2

110

Neural Network

2 0

Neural Network

9 1

5 3

7 1

The

cat

sat

8 3

3 3

8 5

9 1

on

the

4 3

mat.

Parsing a sentence 5 4

7 3 8 3

5 2 3 3

111

9 1

5 3

7 1

The

cat

sat

8 5

9 1

on

the

4 3

mat.

Max-Margin Framework - Details • The score of a tree is computed by the sum of the parsing decision scores at each node.

1.3

8 3

RNN 8 5

3 3

• Similar to max-margin parsing (Taskar et al. 2004), a supervised max-margin objective

• The loss penalizes all incorrect decisions • Structure search for A(x) was maximally greedy • Instead: Beam Search with Chart 112

Backpropagation Through Structure Introduced by Goller & Küchler (1996) Principally the same as general backpropagation Two differences resulting from the tree structure: • Split derivatives at each node

• Sum derivatives of W from all nodes 113

BTS: Split derivatives at each node During forward prop, the parent is computed using 2 children 8 3

c

8 5

3 3

c1

c2

p = tanh(W c1 + b) 2

Hence, the errors need to be computed wrt each of them: 8 3

where each child’s error is n-dimensional 8 5

114

c1

3 3

c2

BTS: Sum derivatives of all nodes You can actually assume it’s a different W at each node Intuition via example:

If take separate derivatives of each occurrence, we get same:

115

BTS: Optimization • As before, we can plug the gradients into a standard off-the-shelf L-BFGS optimizer • Best results with AdaGrad (Duchi et al, 2011):

• For non-continuous objective use subgradient method (Ratliff et al. 2007) 116

Discussion: Simple RNN • Good results with single matrix RNN (more later) • Single weight matrix RNN could capture some phenomena but not adequate for more complex, higher order composition and parsing long sentences Wscore W c1

s p c2

• The composition function is the same for all syntactic categories, punctuation, etc

Solution: Syntactically-Untied RNN • Idea: Condition the composition function on the syntactic categories, “untie the weights” • Allows for different composition functions for pairs of syntactic categories, e.g. Adv + AdjP, VP + NP • Combines discrete syntactic categories with continuous semantic information

Solution: CVG = PCFG + Syntactically-Untied RNN • Problem: Speed. Every candidate score in beam search needs a matrix-vector product. • Solution: Compute score using a linear combination of the log-likelihood from a simple PCFG + RNN • Prunes very unlikely candidates for speed • Provides coarse syntactic categories of the children for each beam candidate • Compositional Vector Grammars: CVG = PCFG + RNN

Details: Compositional Vector Grammar • Scores at each node computed by combination of PCFG and SU-RNN:

• Interpretation: Factoring discrete and continuous parsing in one model:

• Socher et al. (2013): More details at ACL

Related Work • Resulting CVG Parser is related to previous work that extends PCFG parsers • Klein and Manning (2003a) : manual feature engineering • Petrov et al. (2006) : learning algorithm that splits and merges syntactic categories • Lexicalized parsers (Collins, 2003; Charniak, 2000): describe each category with a lexical item • Hall and Klein (2012) combine several such annotation schemes in a factored parser. • CVGs extend these ideas from discrete representations to richer continuous ones • Hermann & Blunsom (2013): Combine Combinatory Categorial Grammars with RNNs and also untie weights, see upcoming ACL 2013

Experiments • • • •

Standard WSJ split, labeled F1 Based on simple PCFG with fewer states Fast pruning of search space, few matrix-vector products 3.8% higher F1, 20% faster than Stanford factored parser

Parser

Test, All Sentences

Stanford PCFG, (Klein and Manning, 2003a)

85.5

Stanford Factored (Klein and Manning, 2003b)

86.6

Factored PCFGs (Hall and Klein, 2012)

89.4

Collins (Collins, 1997)

87.7

SSN (Henderson, 2004)

89.4

Berkeley Parser (Petrov and Klein, 2007)

90.1

CVG (RNN) (Socher et al., ACL 2013)

85.0

CVG (SU-RNN) (Socher et al., ACL 2013)

90.4

Charniak - Self Trained (McClosky et al. 2006)

91.0

Charniak - Self Trained-ReRanked (McClosky et al. 2006)

92.1

SU-RNN Analysis

• Learns notion of soft head words DT-NP

VP-NP

Analysis of resulting vector representations All the figures are adjusted for seasonal variations 1. All the numbers are adjusted for seasonal fluctuations 2. All the figures are adjusted to remove usual seasonal patterns Knight-Ridder wouldn’t comment on the offer 1. Harsco declined to say what country placed the order 2. Coastal wouldn’t disclose the terms Sales grew almost 7% to $UNK m. from $UNK m. 1. Sales rose more than 7% to $94.9 m. from $88.3 m. 2. Sales surged 40% to UNK b. yen from UNK b.

SU-RNN Analysis • Can transfer semantic information from single related example • Train sentences: • He eats spaghetti with a fork. • She eats spaghetti with pork.

• Test sentences • He eats spaghetti with a spoon. • He eats spaghetti with meat.

SU-RNN Analysis

Labeling in Recursive Neural Networks NP

• We can use each node’s representation as features for a softmax classifier:

Softmax Layer

8 3

• Training similar to model in part 1 with standard cross-entropy error + scores 127

Neural Network

Scene Parsing Similar principle of compositionality. • The meaning of a scene image is also a function of smaller regions, • how they combine as parts to form larger objects, •

and how the objects interact.

128

Algorithm for Parsing Images Same Recursive Neural Network as for natural language parsing! (Socher et al. ICML 2011) Parsing Natural Scene Images Grass

People Building

Tree

Semantic Representations Features Segments 129

Multi-class segmentation

Method

Accuracy

Pixel CRF (Gould et al., ICCV 2009)

74.3

Classifier on superpixel features

75.9

Region-based energy (Gould et al., ICCV 2009)

76.4

Local labelling (Tighe & Lazebnik, ECCV 2010)

76.9

Superpixel MRF (Tighe & Lazebnik, ECCV 2010)

77.5

Simultaneous MRF (Tighe & Lazebnik, ECCV 2010)

77.5

Recursive Neural Network

78.1

130 Stanford Background Dataset (Gould et al. 2009)

Recursive Deep Learning 1. 2. 3. 4. 5. 6. 7.

131

Motivation Recursive Neural Networks for Parsing Theory: Backpropagation Through Structure Compositional Vector Grammars: Parsing Recursive Autoencoders: Paraphrase Detection Matrix-Vector RNNs: Relation classification Recursive Neural Tensor Networks: Sentiment Analysis

Paraphrase Detection Pollack said the plaintiffs failed to show that Merrill and Blodget directly caused their losses Basically , the plaintiffs did not show that omissions in Merrill’s research caused the claimed losses

The initial report was made to Modesto Police December 28 It stems from a Modesto police report 132

How to compare the meaning of two sentences?

133

Unsupervised Recursive Autoencoders Similar to Recursive Neural Net but instead of a supervised score we compute a reconstruction error at each node (Socher et al. EMNLP 2011)

y2=f(W[x1;y1] + b) y1=f(W[x2;x3] + b) 134

x1

x2

x3

Unsupervised unfolding RAE Attempt to encode entire tree structure at each node

135

Recursive Autoencoders for Full Sentence Paraphrase Detection Unsupervised Unfolding RAE and a pair-wise sentence comparison of nodes in parsed trees (Socher et al. (NIPS 2011)

136

Recursive Autoencoders for Full Sentence Paraphrase Detection Experiments on Microsoft Research Paraphrase Corpus (Dolan et al. 2004) Method

Acc.

F1

Rus et al.(2008)

70.6

80.5

Mihalcea et al.(2006)

70.3

81.3

Islam et al.(2007)

72.6

81.3

Qiu et al.(2006)

72.0

81.6

Fernando et al.(2008)

74.1

82.4

Wan et al.(2006)

75.6

83.0

Das and Smith (2009)

73.9

82.3

Das and Smith (2009) + 18 Surface Features

76.1

82.7

F. Bu et al. (ACL 2012): String Re-writing Kernel

76.3

--

Unfolding Recursive Autoencoder (NIPS 2011)

76.8

83.6

137

Recursive Autoencoders for Full Sentence Paraphrase Detection

138

Recursive Deep Learning 1. 2. 3. 4. 5. 6. 7.

139

Motivation Recursive Neural Networks for Parsing Theory: Backpropagation Through Structure Compositional Vector Grammars: Parsing Recursive Autoencoders: Paraphrase Detection Matrix-Vector RNNs: Relation classification Recursive Neural Tensor Networks: Sentiment Analysis

Compositionality Through Recursive MatrixVector Spaces p = tanh(W

c1 + b) c2

One way to make the composition function more powerful was by untying the weights W But what if words act mostly as an operator, e.g. “very” in very good Proposal: A new composition function 140

Compositionality Through Recursive MatrixVector Recursive Neural Networks p = tanh(W

141

c1 + b) c2

p = tanh(W

C2c1 + b) C1c2

Predicting Sentiment Distributions Good example for non-linearity in language

142

MV-RNN for Relationship Classification

Relationship

Sentence with labeled nouns for which to predict relationships

CauseEffect(e2,e1)

Avian [influenza]e1 is an infectious disease caused by type a strains of the influenza [virus]e2.

EntityOrigin(e1,e2)

The [mother]e1 left her native [land]e2 about the same time and they were married in that city.

MessageTopic(e2,e1)

Roadside [attractions]e1 are frequently advertised with [billboards]e2 to attract tourists.

143

Sentiment Detection Sentiment detection is crucial to business intelligence, stock trading, …

144

Sentiment Detection and Bag-of-Words Models

Most methods start with a bag of words + linguistic features/processing/lexica

But such methods (including tf-idf) can’t distinguish: + white blood cells destroying an infection

− an infection destroying white blood cells 145

Sentiment Detection and Bag-of-Words Models • Sentiment is that sentiment is “easy” • Detection accuracy for longer documents ∼90% • Lots of easy cases (… horrible … or … awesome …)

• For dataset of single sentence movie reviews (Pang and Lee, 2005) accuracy never reached above 80% for >7 years • Harder cases require actual understanding of negation and its scope and other semantic effects

Data: Movie Reviews

Stealing Harvard doesn’t care about cleverness, wit or any other kind of intelligent humor. There are slow and repetitive parts but it has just enough spice to keep it interesting. 147

Two missing pieces for improving sentiment

1. Compositional Training Data 2. Better Compositional model

1. New Sentiment Treebank

1. New Sentiment Treebank • Parse trees of 11,855 sentences • 215,154 phrases with labels • Allows training and evaluating with compositional information

2. New Compositional Model • Recursive Neural Tensor Network • More expressive than any other RNN so far • Idea: Allow more interactions of vectors

2. New Compositional Model • Recursive Neural Tensor Network

2. New Compositional Model • Recursive Neural Tensor Network

Recursive Neural Tensor Network

Experimental Result on Treebank

Experimental Result on Treebank • RNTN can capture X but Y • RNTN accuracy of 72%, compared to MV-RNN (65), biNB (58) and RNN (54)

Negation Results

Negation Results • Most methods capture that negation often makes things more negative (See Potts, 2010) • Analysis on negation dataset

Negation Results • But how about negating negatives? • Positive activation should increase!

160

Visualizing Deep Learning: Word Embeddings

Visual Grounding • Idea: Map sentences and images into a joint space

Discussion: Compositional Structure • Recursive Neural Networks so far used constituency trees which results in more syntactically influenced representations • Instead: Use dependency trees which capture more semantic structure

Dependency Tree - Recursive Neural Network

• l(i) = number of leaf nodes at node i

Convolutional Neural Network for Images

• CNN trained on ImageNet (Le et al. 2013) • RNN trained to give large inner products between sentence and image vectors:

Results 

   

 

 

  

Results 



    



Describing Images

Mean Rank

Image Search

Mean Rank

Random

92.1

Random

52.1

Bag of Words

21.1

Bag of Words

14.6

CT-RNN

23.9

CT-RNN

16.1

Recurrent Neural Network

27.1

Recurrent Neural Network

19.2

Kernelized Canonical Correlation Analysis

18.0

Kernelized Canonical Correlation Analysis

15.9

DT-RNN

16.9

DT-RNN

12.5

Overview of RNN Model Variations • Objective Functions • Supervised Scores for Structure Prediction • Classifier for Sentiment, Relations, Visual Objects, Logic • Unsupervised autoencoding immediate children or entire tree structure • Composition Functions • Syntactically-Untied Weights • Matrix Vector RNN • Tensor-Based Models • Tree Structures • Constituency Parse Trees • Dependency Parse Trees • Combinatory Categorial Grammar Trees (Hermann and Blunsom, 2013) • Fixed Tree Structures (Connections to CNNs) 168

Summary: Recursive Deep Learning • Recursive Deep Learning can predict hierarchical structure and classify the structured output using compositional vectors • State-of-the-art performance (all with code on www.socher.org) • Parsing on the WSJ (Java code soon) • Sentiment Analysis on multiple corpora • Paraphrase detection with unsupervised RNNs • Relation Classification on SemEval 2011, Task8 • Object detection on Stanford background and MSRC datasets Parsing Natural Scene Images Grass

People Building

Tree

Semantic Representations Features Segments

Parsing Natural Language Sentences S

A small crowd quietly enters the historic church

VP NP A small crowd

169

VP quietly enters

NP NP Det. the

Adj. historic

N.

Semantic Representations Indices church Words

Part 3 1. 2. 3. 4.

170

Assorted Speech and NLP Applications Deep Learning: General Strategy and Tricks Resources (readings, code, …) Discussion

Part 3.1: Applications

Assorted Speech and NLP Applications

171

Existing NLP Applications • • • • • • • • • • • •

Language Modeling (Speech Recognition, Machine Translation) Word-Sense Learning and Disambiguation Reasoning over Knowledge Bases Acoustic Modeling Part-Of-Speech Tagging Chunking Named Entity Recognition Semantic Role Labeling Parsing Sentiment Analysis Paraphrasing 172 Question-Answering

Convolutional Neural Networks! • Phil will talk about them in the evening

173

Language Modeling •

Predict P(next word | previous word)



Gives a probability for a longer sequence



Applications to Speech, Translation and Compression



Computational bottleneck: large vocabulary V means that computing the output costs #hidden units × |V|

174

Neural Language Model • Bengio et al NIPS 2000 and JMLR 2003 “A

Neural Probabilistic Language Model” • Each word represented by a distributed continuousvalued code • Generalizes to sequences of words that are semantically similar to training sequences

175

Application to Machine Translation • Language model conditioned on source and target

• Speed improvements (pre-computation, unnormalized output) • Devlin et al. ACL 2014, best paper 176

Word Vectors have linear relationships

179

• Mikolov et al. 2013

Word Vectors have linear relationships

• Mikolov et al. 2013

180

Best word vectors are not “deep” • They all capture co-occurrence statistics in some way • Global Methods: • LSA (Deerwester et al.), LDA (Blei et al.), HAL (Lund & Burgess), Hellinger-PCA (Lebret & Collobert) • Scale with vocabulary size and efficient usage of statistics

• Window based / Neural Network based models • NNLM, HLBL, RNN, ivLBL, Skip-gram/CBOW, (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton; Mnih & Kavukcuoglu; Mikolov et al.)

Word Vectors: Recent development • Capturing local co-occurrence statistics

Pennington et al. 2014

• Produces state of the art linear semantic relationships • Efficient use of statistics: Can train on (comparably) little data and gigantic data! • Fast, only non-zero counts matter • Good performance with small (100-300) dimensions: Important for downstream tasks

Word Vectors: Recent development • Spearman correlation between human judgments and cosine similarity between word vectors

Word Vectors: Recent development • Linear relationship prediction accuracy

Learning Multiple Word Vectors • Tackles problems with polysemous words • Can be done with both standard tf-idf based methods [Reisinger and Mooney, NAACL 2010] • Neural word vector model by [Huang et al. ACL 2012] learns multiple prototypes using both local and global context • State of the art correlations with human similarity judgments 185

Learning Multiple Word Vectors Visualization of learned word vectors from Huang et al. (ACL 2012)

186

Common Sense Reasoning Inside Knowledge Bases Question: Can Neural Networks learn to capture logical inference, set inclusions, part-of and hypernym relationships?

187

Neural Networks for Reasoning over Relationships • Higher scores for each triplet T = (e1,R,e2) indicate that entities are more likely in relationship • Training uses contrastive estimation function, similar to word vector learning • NTN scoring function: • Cost: 188

Accuracy of Predicting True and False Relationships

• Related Work • Bordes, Weston, Collobert & Bengio, AAAI 2011 • Bordes, Glorot, Weston & Bengio, AISTATS 2012

Model

FreeBase

WordNet

Distance Model

68.3

61.0

Hadamard Model

80.0

68.8

Standard Layer Model (

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.