Reinforcement Learning Combined With Radial Basis ... - isbitm [PDF]

Paradigma Sistem Cerdas, PT Bayumedia (2004). Sutton, Richard S. dan Barto, Andrew G. (1998). Reinforcement Learning: An

17 downloads 4 Views 720KB Size

Report

Download PDF

PNG Network

Recommend Stories

Radial Basis Functions I

Don’t grieve. Anything you lose comes round in another form. Rumi

Reinforcement Learning

Raise your words, not voice. It is rain that grows flowers, not thunder. Rumi

Reinforcement Learning

Love only grows by sharing. You can only have more for yourself by giving it away to others. Brian

Reinforcement Learning

The best time to plant a tree was 20 years ago. The second best time is now. Chinese Proverb

Value Function Approximation in Reinforcement Learning using the Fourier Basis

If your life's work can be accomplished in your lifetime, you're not thinking big enough. Wes Jacks

Device Placement Optimization with Reinforcement Learning

Be who you needed when you were younger. Anonymous

Reinforcement Learning med Q-learning

What we think, what we become. Buddha

Reinforcement Learning with Local Shaping Rewards

Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

Motivated Reinforcement Learning

Your big opportunity may be right where you are now. Napoleon Hill

Relational Reinforcement Learning

Your task is not to seek for love, but merely to seek and find all the barriers within yourself that

Idea Transcript

Progress in Business Innovation & Technology Management 002 (2012) 031-038

Contents lists available at ProBITM

Progress in Business Innovation & Technology Management APBITMS Homepage: http://apbitm.org

Reinforcement Learning Combined With Radial Basis Function Neural Network To Solve Job-Shop Scheduling Problem Kuswara Setiawan* Faculty of Computer Science, Universitas Pelita Harapan Surabaya, CITO Superblock Waru, Surabaya, Indonesia, 60234

ABSTRACT To complete jobs/tasks within their designated time periods, manufacturing companies utilize multiple machines. Job-shop scheduling is a critical element in job/task completion. This schedule consists of a sequence of doing consecutive jobs in a minimum amount of time. In addition, any conflict between the raw materials used in each job and its resource pool are to be avoided. This research applied the Reinforcement Learning (RL) method which is implemented in Temporal Difference Learning (TDL). Furthermore, the TDL focused on the Gradient-Descent method in which the Radial Basis Function Neural Network served as the approximation function. The input of this research was an initial critical path with no conflict-free schedule. Using the above methods, the conflict(s) could be eliminated gradually. Thus, the flexible job-shop scheduling can readily be made by any manufacturing company. Language used for this research is the Borland Delphi 7.0. All object structure and methods are made as easy as possible so that it can be implemented on the same problem with different application. © 2012 APBITM Society. All rights reserved. Keywords: Machine, Industry, Job-shop scheduling, Reinforcement learning, Radial basis function neural network

I. Introduction In the past few years many manufacturing industries tried to calculate time needed by some machines to finish some jobs/tasks manually. It was time consuming and human errors might happen. Computer technology, especially computer science, may help to overcome this problem by using several methods to yield automatically what is called the job-shop scheduling. This research will apply Reinforcement Learning (RL) method. It is implemented in Temporal 

Email: [email protected]

Kuswara Setiawan /Progress in Business Innovation & Technology Management 002 (2012) 031-038

Difference Learning (TDL) which is focused on Gradient-Descent method. Radial Basis Function Neural Network will appear as the approximation function. II. Job-Shop Scheduling Job-shop scheduling is defined as arranging some jobs/tasks in some machines according to a proper order. Each machine can only process one job (task) in a certain time and the raw materials are assumed enough at least for that machine. It means that job-shop scheduling is a problem to schedule a set of machines which are able to process one job in a certain time, and there are many jobs that must be handled by those machines. The result is a schedule - consists of the job sequence for each machine without any conflict in using raw materials – in which the total time required is really minimum. There are many methods in getting the job-shop schedule; one of those is CPM (stands for Critical Path Method) but the result is not conflict-free, or it is not optimal. The CPM will be used as the initial schedule and by using other methods mentioned above, this initial state is gradually improved, and finally yields a conflict-free schedule. Figure 1 shows 8 tasks with their durations and predecessors, also the resource types.

Figure 1: Simple scheduling problem Before making the critical path, the constraint diagram must be drawn.

Figure 2: Constraint diagram of simple scheduling problem (from Fig.1) Resource Pool

Type

Capacity

r11

R1

5

r12

R1

3

r21

R2

5

r22

R2

3

Figure 3: Resource pools of simple scheduling problem (from Fig.1)

32

Kuswara Setiawan /Progress in Business Innovation & Technology Management 002 (2012) 031-038

III. Reinforced Learning RL is the way how a computer must take action by seeing its environment. Since the computer does not know what the action is, it must try to take one possible alternative solution. By doing several trials it might come to the desired target that is optimal. Every time it succeeds or fails, a reward will be given. These two components – trial and error search and delayed reward – are the most important things in using RL. The basic idea of RL is to catch an important aspect of the problem faced by an agent while interacting with its environment to get a specific goal. Agent is defined as something that learns about its environment continuously and makes a decision soon according to the situation. Reward is a feedback given to the agent after it has made the decision. The advantages of RL, - not available in other learning methods – are changes between exploration and exploitation. To get as many rewards as possible, an agent has to predict to take the best action compared with that previously taken. Also while seeking the best action, this agent must try other actions that never been taken and has to explore some better actions for near future. Exploration and exploitation should be combined to achieve the best probability of better result. In completing the job-shop scheduling, the agent must be careful when manipulating the huge state space. This space will follow the Markov model. After the agent decides one new state, the transition between states must be known and able to find the pattern of it, so the scheduling will be better.

Figure 4: RL Modeling The easiest approach to define reward in RL is the length of the schedule. Because there are so many variations, the schedule length might be normalized, that is using the length of the critical path schedule. Based on critical path schedule, the utilization of the resource must be known. The following indices are useful as a measure of resource utilization: 1. capacity(i) is a fixed capacity for resource type i, that is the capacity of all resource pools for resource type i; u(i,t) is the utilization of resource type i in scheduling during time t. If u(i,t) > capacity(i) then the utilization of resource type i exceeds the capacity during time t. Resource Utilization Index, RUI(i,t) for resource type i during time t, is:  u i, t   RUI i, t   max 1,   capacity i  If the resource does not exceed the limit, then RUI(i,t) = 1, if not then the value becomes the 2.

fractional part of the number. Total Resource Utilization Index TRUI(s) for scheduling s, whose length is l, is the sum of 33

Kuswara Setiawan /Progress in Business Innovation & Technology Management 002 (2012) 031-038

resource utilization index for all resource type n in all time t: n

l

TRUI s    RUI i, t  i 1 t 1

3.

4.

Total resource utilization index does not consider the location of resource pool, but it is determined by scheduling time. This index is useful for obtaining the size of a schedule without paying attention where the resource is located. capacity(i,j) is a fixed resource pool capacity j for resource type i. During time t, u(i,j,t) is the use of resource pool j at that time. If u(i,j,t) > capacity(i,j) then resource pool j exceeds the limit at time t. Pool Utilization Index PUI(i,j,t) for resource type i during time t is:  u i, j, t   PUI i, j, t   max 1,   capacity i, j  If the resource does not exceed the limit then PUI(i,j,t) = 1, if not then the value becomes the fractional part of the number. Total Pool Utilization Index TPUI(s) for scheduling s whose length is l is the sum of pool utilization index for all resource type n in all time t: n  1 TPUI s     i 1  mi

5.

mi

l



 PUI i, j, t  j 1 t 1



Resource Dilation Factor (RDF) for scheduling s is the comparison between total pool utilization index s and total resource utilization index of the initial critical path schedule s0: TPUI s  RDF s   TRUI s0 

Resource Dilation Factor (RDF) is a factor used for arranging the variation of scheduling length in many scheduling problems. The latest state s in all scheduling solutions always occurs: TRUI(s)=TPUI(s). This happens because in the latest scheduling there is no resource that exceeds its limitation, so RUI(i,t) = PUI(i,j,t) = 1. Total resource utilization index in the initial schedule, TRUI(s0),will count the number of resources that exceeds the limit at initial state,s0, and yields rough indicator how difficult is the schedule problem. This index will also give a fixed value for all critical path schedule without considering how the resource pool is allocated. Thus this index is useful in normalizing the final schedule length to yield Resource Dilation Factor (RDF). IV. Deterministic Decision Process Scheduling problem using reinforcement learning can be seen as Deterministic Decision Process. Each state of this process consists of complete scheduling whose temporary constraint is fulfilled but might be any resource conflict. The initial state, s0, is the critical path schedule. Two operations used to improve this initial schedule are pool-reassignment and task-move. The final state is a conflict-free schedule. This shows that the process is deterministic and each operation will yield a unique schedule. The scope of this problem contains some cycles (repetitions). For example: moving one task before its time interval in order to 34

Kuswara Setiawan /Progress in Business Innovation & Technology Management 002 (2012) 031-038

overcome one conflict in one resource type and then move it back to its initial position because there is again a conflict in another resource type. Reinforcement learning R(s,a,s’) is determined to give negative reinforcement value –c = -0.001. It is fixed for each action not yielding a conflict-free schedule. This encourages the reinforcement learning in taking action to find the desired schedule as soon as possible. It also helps the learning process to be smart in avoiding cycling procedure. For each action giving conflict-free schedule s’, reinforcement value R(s,a,s’) will be equal to –RDF(s’). This value will cause the searching process be more courageous to find a short route schedule, since it causes a bigger penalty when the route is longer. The use of final positive reward value makes a brave search and has guaranteed to be able to accomplish final schedule, because it has already used a mechanism to avoid repetitions. Thus, the reinforcement function used is:

 

 RDF s ' , s'  finalstate R s , a, s '    0.001, otherwise





The use of fixed cost for non-final action is based not only on the desire to find a shortest route from initial state to final one, but also to minimize total CPU time in finding that schedule. Generally, move operation needs more time than reassign-pool one. However, the time difference between execution of these two operations is relatively small compared to time needed for some improvements. For one improvement, CPU needs more time for testing and achieving all possibilities of improvement and evaluation. The cost c is determined by 2 factors: relative permitted error RDF ε RDF to choose better final schedule and the step difference Steps to get a better schedule with ε RDF same . So it can be written as: ε c  RDF Steps The smaller the c, the more accurate the reinforcement learning model achieved, so that the optimal solution (schedule) is found. If however, a function approximation is used, then the value of c which is too small, the effect of c is vanished. In this research, c is determined as 0.001, that means = ε RDF 0,01 and Steps = 10 . It can be said that any schedule will be considered the best if its error ε RDF < 0.01 and could be detected in 10 steps ahead.

As a value function, reward is not a cumulative discount, since final reward RDF is very important and should not be discounted. Thus the real value function is a linear combination between final RDF and the amount of improvements needed to achieve final schedule. For any state s, the policy value  as can be seen in the following formula where sN is the conflict-free schedule and N is the amount of steps taken to get sN .

35

Kuswara Setiawan /Progress in Business Innovation & Technology Management 002 (2012) 031-038 

f  s    R( s j , a, s j 1 ) s0  s j 0

N 1

  Rs j , a, s j 1  j 0

  RDF sN   0.001N  1

V. Temporal Difference Learning TD(λ) It is not necessary to arrange a policy that covers all state spaces. The learning process is directed to arrange one policy only to search an interesting part in the state space. TD(λ) is the best algorithm to overcome the above problem. TD(λ) could be combined with feed-forward neural network as a value function. In this research neural network used is Radial Basis Function Neural Network, which will accelerate the learning process and the execution. Value function is symbolized as f(s,W) where W is the weight vector in the neural network. TD(λ) will work as follows: assume that s0, s1, s2, …, sn are the state sequence which have already been visited using policy value  and the reinforcement will be R(s1), R(s2), …, R(sn). In step j+1, temporal difference error in step j will be: J j  [ f (s j 1 ,W )  R(s j 1 )]  f (s j ,W ) Then TD(λ) will calculate smoothed gradient (increscent): e j   W f (s j ,W )  e j 1 and change the weights of neural network according to the formula: Factor λ is a parameter that gradually W  J j e j combines the former gradient with the present one in ej. In reinforcement learning, TD(λ) algorithm is applied to learn value function of the optimal policy, beginning with random policy. To adopt TD(λ) learning in job-shop scheduling, TD(λ) is used on-line in a state sequence which has been visited and the reinforcement value in each state is based on the action of the estimated value function (f) at that time. During learning, each step s is looking for one forward step using that function. This is done in purpose to evaluate state which is the result of manipulating some probably applied operators. Chosen action will be the maximal prediction value coming from the result state s’. After taking this action and receiving rewards, the estimation of f will be changed to get the difference between value f(s) and more informative one R(s’) + f(s’). It means that the policy will always be changed during learning process. VI. Implementation The block diagram of learning system using TD learning-neural network will look like in Figure 5 below.

36

Kuswara Setiawan /Progress in Business Innovation & Technology Management 002 (2012) 031-038

Figure 5: Block Diagram of Learning System It is important to know how a scheduling state be represented into a vector that will be used by neural network as a function approximation. Because neural network can only receive inputs in the form of a vector. Furthermore the length of a schedule varies from one to another. So it is a must to determine some features in order that the neural network could predict a state value. They are: 6 1. mean and deviation standard of free capacity pool which is being a bottleneck pool. 2. mean and deviation standard of slack. 3. RDF 4. redundant allocation index 5. percentage of conflict window Radial Basis Function is a linear function approach:

2  s  ci   s(i )  exp   2 i2      Gradient-descent update will also use linear approximation function: s   Vt (s) t

The experiment in this research, a data set with 13 data, 2 kind of resources , 9-10 jobs are used. The results are shown in the next table and graphic. Spread

K

Comparison between RBF and Backpropagation

0.001

4

Radial Basis Function 2.76x Backpropagation

0.001

15

Radial Basis Function 2.29x Backpropagation

0.9

4

Backpropagation 1.46x Radial Basis Function

0.9

15

Radial Basis Function 3.85x Backpropagation

Table 1: Result Comparison

37

Kuswara Setiawan /Progress in Business Innovation & Technology Management 002 (2012) 031-038

Figure 6: RBF (Spread=0.001; k=15) compared with Backpropagation NN V. Conclusion • Memory usage is quite big, relative to the amount of resources, resource pools and jobs • Reinforcement learning and temporal difference learning using linear function approximation RBF could cooperate with gradient-descent method very well. • RBF as a function approximation is an alternative method beside back propagation and in some cases RBF exceeds back propagation 1 – 3.8 times. REFERENCES Matteucci, Matteo. A Tutorial on Clustering Algorithm. http://home.dei.polimi.it/matteucc/Clustering/ tutorial_html/kmeans.html Pack Kaelbling, Leslie, L.Littman, Michael, W.Moore, Andrew. Reinforcement Learning: A Survey, Journal of Artificial Intelligence Research 4 (1996) 237-285 Setiawan, Kuswara. Paradigma Sistem Cerdas, PT Bayumedia (2004) Sutton, Richard S. dan Barto, Andrew G. (1998). Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, A Bradford Book. http://www-anw.cs.umass.edu/~rich/book/the-book.html W. Conway, Richard, L.Maxwell, William, W.Miller, Louis. Theory of Scheduling, Addison-Wesley Publishing Company Zhang, Wei. (June 1996). Reinforcement Learning for Job-Shop Scheduling, A dissertation submitted to Oregon State University.

38

Reinforcement Learning Combined With Radial Basis ... - isbitm [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch