Attractor Neural Networks with Hypercolumns - CiteSeerX [PDF]

Abstract. We investigate attractor neural networks with a modular structure, where a local winner-takes-all rule acts wi

3 downloads 5 Views 69KB Size

Recommend Stories


[PDF] Download Neural Networks
I want to sing like the birds sing, not worrying about who hears or what they think. Rumi

[PDF] Download Neural Networks
Sorrow prepares you for joy. It violently sweeps everything out of your house, so that new joy can find

Collaborative Filtering with Neural Networks
Be who you needed when you were younger. Anonymous

Neural Networks
You have to expect things of yourself before you can do them. Michael Jordan

Neural Networks
Where there is ruin, there is hope for a treasure. Rumi

neural networks
Learning never exhausts the mind. Leonardo da Vinci

Neural Networks
Almost everything will work again if you unplug it for a few minutes, including you. Anne Lamott

neural networks
Open your mouth only if what you are going to say is more beautiful than the silience. BUDDHA

Spatiotemporal discrimination in attractor networks with short-term synaptic plasticity
Just as there is no loss of basic energy in the universe, so no thought or action is without its effects,

neural networks and neural computers
Everything in the universe is within you. Ask all from yourself. Rumi

Idea Transcript


Attractor Neural Networks with Hypercolumns Christopher Johansson, Anders Sandberg and Anders Lansner Department of Numerical Analysis and Computing Science, Royal Institute of Technology, 100 44 Stockholm, Sweden {cjo, asa, ala}@nada.kth.se

Abstract. We investigate attractor neural networks with a modular structure, where a local winner-takes-all rule acts within the modules (called hypercolumns). We make a signal-to-noise analysis of storage capacity and noise tolerance, and compare the results with those from simulations. Introducing local winner-takes-all dynamics improves storage capacity and noise tolerance, while the optimal size of the hypercolumns depends on network size and noise level.

Introduction A problem that arises in many kinds of recurrent neural networks (NNs) is that of properly regulating network activity, i.e. keeping an appropriate level of activity in the network. This is one motivation for investigating a local winner-takes-all dynamics that regulates the total activity within network modules. This form of activity control, while not as sophisticated as other control strategies is easy to implement and also somewhat biologically plausible. Here, these local winner-take-all modules are called hypercolumns in analogy with the hypercolumns described in cortex [1]. A unit in such a hypercolumn is assumed to correspond to a cortical minicolumn rather than to an individual neuron. Another motivation for exploring networks composed of modules with internal activity control comes from the study of Bayesian Confidence Propagation Neural Networks (BCPNN), a class of networks derived from statistical considerations [2, 3]. The hypercolumnar structure is a natural consequence of their derivation [4]. The units of a hypercolumn typically represent the different values of a discrete feature, e.g. color. Computational experiments have shown that such a hypercolumnar structure improves noise tolerance, convergence speed and capacity compared to a non-modular structure [5]. It thus becomes relevant to examine whether this effect is universal among different types of attractor NNs, which is what we contribute to here. The hypercolumn architecture used here is somewhat related to the Potts neuron models [6], although the latter focus on combinatorial optimization without learning rather than associative memory. There are also some similarities to other kinds of modular networks [7, 8]; the difference being that we do not consider autoassociation within the hypercolumns.

Neural Networks with Hypercolumns The hypercolumn NNs are constructed with N units divided equally into H hypercolumns each containing U units. For simplicity we let all hypercolumns consist of the same number of units; N=HU. Within each hypercolumn the activity always sums to one, which is ensured by the use of a winner-take-all function. We use patterns, ξ∈{0,1}N, constructed with a fraction a=1/U of the positions set to 1. The patterns are − as the NNs − divided into H hypercolumns and in each hypercolumn only one of the positions is set to 1 and the rest are set to 0. Assuming that each hypercolumn in the patterns are independent to all others and each unit is equally likely to be active, the information Ih contained in one hypercolumn becomes U 1 1 (1) I h = −∑ P( xi ) log 2 P( xi ) = −∑ log 2 = log 2 U U U i and if we multiply eq. (1) with the number of hypercolumns in the NN we get the amount of information IP in one pattern: (2) I P = I h H = H log 2 U Learning and Update Rule The learning rule used is that of the sparse Hopfield network (eq. (3)). After learning P patterns the weights become 1 (3) wij = ∑ (ξ iµ − a )(ξ jµ − a) N µ The introduction of the hypercolumns does not affect the learning rule. All weights wij where i=j are set to 0. The update-rule consists of summing the contributions of other units (4) hi = ∑ wij o j j

and applying a winner-takes-all function to the units within each hypercolumn: the unit with the highest hi has its output set to 1, the others to 0. Analysis of NN with Hypercolumns We define a stochastic variable X≡(ξi-a)(ξj-a). Evaluation of the product (ξi-a)(ξj-a) yields three different results x1=(1-a)(1-a), x2=-(1-a)a and x3=a2. These three values are attained with the probabilities pX(x1)=1/U2, pX(x2)=2(U−1)/U2 and pX(x3)=(U−1)2/U2. The mean and variance of X can now be identified: 3

E ( X ) = ∑ xi p X ( xi ) = 0

(5)

i =1

(U − 1)2 (6) U4 i =1 If we consider the case where the activity is set to that of a pattern (retrieval cue) v: o=ξv, the support of each unit can then be written as 3

V ( X ) = ∑ ( xi − E ( X ))2 p X ( xi ) =

1 1 (7) X ijµξ vj ∑∑ (ξiµ − a)(ξ jµ − a)ξ vj = N ∑∑ N j µ µ j where we make the important assumption that each instance of X is independent. Now we want to estimate the mean and variance of the support values h− of the units that do not participate in the active pattern: 1 N P 1 H P HP P (8) E (h − ) = E ( ∑∑ X µj ξ vj ) = ∑∑ E ( X µj ) = E( X ) = E( X ) = 0 N j µ N j µ N U hi = ∑ wij o j = j

1 N P µ v 1 H P HP P P(U − 1)2 (9) X j ξ j ) = 2 ∑∑ V ( X µj ) = 2 V ( X ) = V (X ) = ∑∑ N j µ N j µ N NU NU 5 Next we want to estimate the mean and variance of all the support values h+ that participate in the active pattern v. The parameter r corresponds to the number of hypercolumns that are not participating in building up the support of the active pattern. The default value of r is 1 since the units have no recurrent connections to themselves. Increasing the value of r can simulate noise in the input patterns. H P −1  1 N P 1  H −r E (h + ) = E ( ∑∑ X µj ξ vj ) =  ∑ E ( X vj ) + ∑∑ E ( X µj )  = (10) N j µ N j j µ ≠v  V (h − ) = V (

1 ( H − r ) x1 (rU − N )(U − 1)2 =− ( ( H − r ) E ( x1 ) + H ( P − 1) E ( X ) ) = N N NU 3 H P −1  1 N P 1  H −r V (h + ) = V ( ∑∑ X µj ξ vj ) = 2  ∑ V ( X vj ) + ∑∑ V ( X µj )  = (11) N j µ N  j j µ ≠v  2 1 ( P − 1)(U − 1) H ( P − 1) = 2 ( ( H − r )V ( x1 ) + H ( P − 1)V ( X ) ) = V (X ) = N N2 NU 5 Now we are in a position were we can formulate the mean and standard deviation of a normal distribution that expresses the probability that the support value of a unit participating in the active pattern will be larger than that of a non-participating unit. We form the stochastic variable Y=h+−h− which has the mean and standard deviation; (rU − N )(U − 1)2 (12) mY = E (h + ) − E (h − ) = − NU 3 (2 P − 1)(U − 1) 2 (13) σ Y = V (h + ) + V ( h − ) = NU 5 which has the normal distribution Y∈N(mY , σY). We can now calculate the probability of occurrence of an error, i.e. the probability that the support of a unit not participating in a pattern will be larger than that of a participating unit: − mY / σ Y 2  −m  1 (14) puerr = P (Y ≤ 0) = Φ  Y  = e − t / 2 dt ∫ 2π −∞  σY  =

We now turn to consider a single hypercolumn. We know that P(Y≤0) must not occur if we want the support of the unit participating in a pattern to be the largest one in the hypercolumn. We can thus write the probability of a hypercolumn producing a correct output as pHcorr=(1−puerr)U−1. The probability that the network will produce a correct output i.e. that the active pattern will remain stable can now be stated. Let Z represent the number of

hypercolumns with a correct output, Z∈Bin(H, pHcorr). If we consider the case where no errors are allowed in the retrieved pattern, the probability of a correct output from the NN is; pnetcorr=pZ(0)= pHcorrH. In order to simplify the calculations we rewrite H=Nb and U=N(1-b) where b∈(0,1). This reduces the number of parameters in the expression for pnetcorr to three; N, P and b. The storage capacity of a particular network (N and b) can be estimated by solving the equation pnetcorr=0.9 for P.1 The amount of information stored in a network is attained by multiplying the information content in one pattern with the number of patterns stored in the NN. Numerical solutions of pnetcorr=0.9 for P when a>0 and N→∞ indicates that the number of stored patterns roughly increases as N2-b.

Simulation Results The storage capacity of four different types of NNs is presented in Fig. 1a. A retrieved pattern without any errors was classified as a correct. The number of stored patterns was measured as the point were 90% of the patterns were correctly retrieved. The sparse Hopfield NN stored more patterns than the standard Hopfield NN and the Hopfield NN with hypercolumns stored more patterns than the sparse Hopfield NN. The information content was smallest in a hypercolumn pattern, it was slightly larger in a pattern used with the sparse Hopfield NN and it was largest in a pattern used with the Hopfield NN. This meant that the amount of stored information in the three Retrieved patterns

70

Information

70

Sparse Hopfield Hopfield with Hypercolumns BCPNN

60

60 Retrieved Patterns

# Patterns / 100 bits

50

50 40 30 20

40

30

20

10 10

0 Hopfield

Hopfield Sparse

(a)

Hopfield with Hypercolumns

BCPNN 0

0

1

2

3

4 5 6 Errors in Inputpattern

7

8

9

10

(b)

Fig. 1. The storage capacity of four different NNs each consisting of 100 units. The sparse Hopfield NN was used with sparse random patterns with a fraction a of the units set to 1. The sparse Hopfield NN is very sensitive to how the threshold is set (threshold was set to 0.04). The hypercolumn NNs were partitioned into 10 hypercolumns (b=0.5). (a) The left bar represents the number of patterns stored in the NN and the right bar represents the amount of information stored in the NN. (b) Successfully retrieved patterns as a function of noise. The number of errors in the retrieval cues was varied from zero to ten.

1

The factor 0.9 represents the probability of retrieving a pattern and this probability roughly corresponds to the fraction of patterns that can be retrieved out of the P patterns stored in the NN.

100

2000

Exper. data, 0 err Model, 0 err Exper. data, 1 err Model, 1 err

90 80

1600 1400 Retrieved Patterns

Retrieved Patterns

70 60 50 40

1200 1000 800

30

600

20

400

10

200

0

Exper. Data Model

1800

0

0.1

0.2

0.3

0.4

0.5 b

0.6

0.7

0.8

0.9

0

1

0

200

400

600

(a)

800 N

1000

1200

1400

1600

(b)

Fig. 2. Empirical fit of network capacity to the analytic model. (a) Capacity as a function of b (number of hypercolumns = Nb). (b) Capacity as a function of network size N for b=0.5.

different types of Hopfield NN was almost equal. Fig. 1b shows how the number of retrieved patterns depends on the number of errors (n:o flipped active units) in the retrieval cue. The sparse Hopfield NN has a sharp drop in the number of retrieved patterns when the number of errors is increased beyond 3; this can be avoided by using an active threshold regulation. Empirical data was fitted against the analytic model of the sparse Hopfield NN with hypercolumns (Fig. 2). Fig 2a shows how the number of retrieved patterns depends on the number of hypercolumns. A NN with 120 units was studied with zero and one error in the retrieval cues. When there are no errors in the cue the maximal capacity is reached for small b, i.e. few hypercolumns, while noisy cues favour a greater number of hypercolumns. The storage capacity increases with the number of units in the network. Fig. 2b shows that the model provides a tight fit to the empirical data when the NN is scaled up. The NN was partitioned into √N hypercolumns (b=0.5). 1

1

N=102 0.9

0.9

N=106

0.8 0.7 0.6 0.5 0.4 0.3

0.6 0.5 0.4 0.3 0.2

0.1

0.1

0

0.1

0.2

0.3

0.4

0.5 b

(a)

0.6

0.7

0.8

0.9

N=106

0.7

0.2

0

N=104

0.8 Normalized Information

Normalized Information

N=102

N=104

1

0

0

0.1

0.2

0.3

0.4

0.5 b

0.6

0.7

0.8

0.9

1

(b)

Fig. 3. The normalized amount of information stored plotted against the size and number of hypercolumns. (a) The optimal value of b is shifted towards zero as the size is increased. (b) When there is noise in the retrieval cue the optimal value is shifted towards one.

When the size of the NN is increased the value of b maximizing the information stored is shifted towards zero if there is no noise in the NN (Fig. 3a). If noise is introduced the optimal value of b is shifted towards one (Fig. 3b). In this simulation the input patterns had a constant level of 10 errors.

Discussion The introduction of hypercolumns into the sparse Hopfield NN improved its performance significantly, beyond the performance increment due to the more structured hypercolumn patterns. The hypercolumn structure increased the noise tolerance and capacity of the network. The theoretical model provides a good fit to the empirical data even though it does not consider the iterative relaxation process. A conclusion from the results is that the optimal partitioning of the NN, in terms of hypercolumns, depends on the noise level of the cues expected. If cues are noiseless, large networks achieve optimal capacity through very large hypercolumns, while more numerous and smaller hypercolumns are more efficient in dealing with noisy cues. The three different types of the Hopfield NN roughly have an equal storage capacity measured in bits. There are learning rules that achieve a higher storage capacity, e.g. the BCPNN. Finally, such a division into smaller modules is also relevant for parallel implementation of large networks. Partitioning of the NN into modules makes it possible to create a computational grain size that suits the hardware and also it helps to reduce the communication between the units.

References 1. Hubel, D.H. and T.N. Wiesel: Uniformity of monkey striate cortex: A parallel relationship between field size, scatter and magnification factor. J. Comp. Neurol., 1974(158): p. 295306. 2. Holst, A. and A. Lansner: A Higher Order Bayesian Neural Network for Classification and Diagnosis, in Applied Decision Technologies: Computational Learning and Probabilistic Reasoning, A. Gammerman, Editor. 1996, John Wiley & Sons Ltd.: New York. p. 251-260. 3. Lansner, A. and Ö. Ekeberg: A one-layer feedback artificial neural network with a Bayesian learning rule. Int. J. Neural Systems, 1989. Vol. 1: p. 77-87. 4. Sandberg, A., et al.: A Bayesian attractor network with incremental learning. to appear in Network: Computation in Neural Systems, 2002. Vol. 13(2). 5. Ström, B.: A Model of Neocortical Memory, Using Hypercolumns and a Bayesian Learning Rule. 2000, Nada, KTH: Stockholm. 6. Peterson, C. and B. Söderberg: A New Method for Mapping Optimization Promlems onto Neural Networks. International Journal of Neural Systems, 1989. Vol. 1(3). 7. Levy, N., D. Horn, and E. Ruppin: Associative Memory in a Multi-modular Network. Neural Computation, 1999. Vol. 11: p. 1717-1737. 8. O'Kane, D. and A. Treves: Short- and long-range connections in autoassociative memory. J. Phys. A, 1992. Vol. 25: p. 5055-5069.

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.