Curriculum Vitae - UMBC CSEE [PDF]

Michael L. Anderson, Matt Schmill, Tim Oates, Don Perlis, Darsana Josyula, Dean ... M. Schmill, D. Josyula, M. Anderson,

0 downloads 5 Views 1MB Size

Recommend Stories


Curriculum Vitae Curriculum Vitae
You often feel tired, not because you've done too much, but because you've done too little of what sparks

CURRICULUM VITAE
Just as there is no loss of basic energy in the universe, so no thought or action is without its effects,

CURRICULUM VITAE
Everything in the universe is within you. Ask all from yourself. Rumi

CURRICULUM VITAE
You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

CURRICULUM VITAE
Don’t grieve. Anything you lose comes round in another form. Rumi

CURRICULUM VITAE
If you feel beautiful, then you are. Even if you don't, you still are. Terri Guillemets

CURRICULUM VITAE
Silence is the language of God, all else is poor translation. Rumi

CURRICULUM VITAE
No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

CURRICULUM VITAE
Respond to every call that excites your spirit. Rumi

CURRICULUM VITAE
Before you speak, let your words pass through three gates: Is it true? Is it necessary? Is it kind?

Idea Transcript


Curriculum Vitae

Name: Dean Earl Wright III Permanent Address:

422 Shannon Court, Frederick, MD 21701

Degree and date to be conferred: Doctor of Philosophy, May 2009. Date of Birth: 27 November 1954. Place of Birth: La Rochelle, France. Secondary Education:

McCluer High School, Florisant, Missouri.

Previous Degrees: Hood College, Master of Business Administration, 2005 Hood College, Master of Science (Computer Science), 2001 Hood College, Bachelor of Science (Computer Science), 1998 Frederick Community College, Associate in Arts (Business Administration), 1993 Professional publications: Michael L. Anderson, Matt Schmill, Tim Oates, Don Perlis, Darsana Josyula, Dean Wright and Shomir Wilson. Toward Domain-Neutral Human-Level Metacognition. In Proceedings of the 8th International Symposium on Logical Formalizations of Commonsense Reasoning, pages 1–6, 2007. M. Schmill, D. Josyula, M. Anderson, S. Wilson, T. Oates, D. Perlis, D. Wright and S. Fults. Ontologies for Reasoning about Failures in AI Systems. In Proceedings of the Workshop on Metareasoning in Agent-Based Systems, May 2007 Michael L. Anderson, Scott Fults, Darsana P. Josyula, Tim Oates, Don Perlis, Matthew D. Schmill, Shomir Wilson, and Dean Wright. A Self-Help Guide For Autonomous Systems. To appear in AI Magazine. 2007. Professional positions held: CRW/Logicon/Northrop Grumman, January 1985–July 2007 Scientific Time Sharing Corporation (STSC), January 1977–January 1985

ABSTRACT

Title of Thesis: Reengineering the Metacognitive Loop Dean Earl Wright III, Ph.D. Computer Science, 200x Dissertation directed by: Tim Oates, Professor Department of Computer Science and Electrical Engineering The field of Artificial Intelligence has seen steady advances in cognitive systems. But many of these systems perform poorly when faced with situations outside of their training or in an dynamic environment. This brittleness is a major problem in the field today. Adding metacognition to such systems can improve their operation in the face of perturbations. The Metacognitive Loop (MCL) (Anderson et al. 2006) works with a host system, monitoring its sensors and expectations. When a failure is indicated, MCL advises the host system on corrective actions. Past implementations of MCL have been hand crafted and tightly integrated into their host systems. MCL is being reengineered to provide a C language API and to do Bayesian inference over a set of indication, failure, and response ontologies. These changes will allow MCL to be used with a wide variety of systems. To prevent brittleness within MCL itself several items need to be addressed. MCL must be able to resolve host system failures when there is more than one indication of the failure or when a second indication occurs while MCL is attempting to help the host system recover from the failure. MCL also needs the ability of MCL to monitor itself and improve its own operation. A twenty month plan is proposed to enhance MCL as described and to measure (1) the effectiveness of MCL in improving the operation of the host system; (2) MCL’s operational efficiency in terms of additional computational resources required and (3) the effort needed to incorporate MCL into the host system.

Reengineering the Metacognitive Loop

by Dean Earl Wright III

Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2009

c Copyright Dean Earl Wright III 2007

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

Chapter 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . .

1

Chapter 2

BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1

Metacognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2

Metacognitive Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2.1

Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2

Assess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3

Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.4

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 3 3.1

TECHNICAL APPROACH . . . . . . . . . . . . . . . . . . .

12

MCL Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1

Indications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.2

Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.3

Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.4

Intra-ontology linkages . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2

Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3

Support for Reasoning Over Time . . . . . . . . . . . . . . . . . . . . . . 24

3.4

Recursive Invocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5

C Language Application Program Interface . . . . . . . . . . . . . . . . . 29

3.6

Extending the API to Multiple Languages . . . . . . . . . . . . . . . . . . 29

Chapter 4 4.1

4.2

METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . .

30

Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.1

Effectiveness in Improving Host System Operation . . . . . . . . . 31

4.1.2

Additional Computational Resources Required . . . . . . . . . . . 31

4.1.3

Implementation Effort . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.4

Breadth of Deployment . . . . . . . . . . . . . . . . . . . . . . . . 32

Evaluation Domains

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1

Chippy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.2

Windy Grid World . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.3

WinBolo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Chapter 5

PRELIMINARY RESULTS . . . . . . . . . . . . . . . . . . .

45

5.1

Grid World Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2

BoloSoar Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3

Chippy Agent With Ontology-Based MCL . . . . . . . . . . . . . . . . . . 51

Chapter 6

RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . .

57

6.1

Pre-ontology Metacognitive Loop . . . . . . . . . . . . . . . . . . . . . . 57

6.2

Early Ontology Metacognitive Loop . . . . . . . . . . . . . . . . . . . . . 58

6.3

Model-Based Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.4

Multi-agent Metacognition . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Chapter 7

FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . .

62

7.1

Automatic Expectation Generation . . . . . . . . . . . . . . . . . . . . . . 62

7.2

Automatic Ontology Expansion/Linking . . . . . . . . . . . . . . . . . . . 63

7.3

Application to Multi-agent systems . . . . . . . . . . . . . . . . . . . . . . 64

7.4

Transferring Learning with MCL Networks . . . . . . . . . . . . . . . . . 64

7.5

Modeling dynamic environments . . . . . . . . . . . . . . . . . . . . . . . 64

Chapter 8

TIMETABLE . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

8.1

Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.2

Monthly Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

LIST OF FIGURES

2.1

Metacognitive Monitoring and Control . . . . . . . . . . . . . . . . . . . .

7

2.2

Software agent interactions with the environment . . . . . . . . . . . . . .

7

2.3

Software agent with metacognition . . . . . . . . . . . . . . . . . . . . . .

8

2.4

Host systems with MCL support . . . . . . . . . . . . . . . . . . . . . . .

9

3.1

Ontologies as additional metaknowledge . . . . . . . . . . . . . . . . . . . 14

3.2

Ontological Linkages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3

Divergence Nodes in the Indications Ontology . . . . . . . . . . . . . . . . 17

3.4

Failure Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5

Response Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6

Example Ontology Connections . . . . . . . . . . . . . . . . . . . . . . . 23

3.7

Conditional probability tables for portion of MCL response ontology . . . . 24

3.8

Reentrant MCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.9

Expectations arranged in single (a) and multiple groups (b) . . . . . . . . . 26

3.10 MCL providing metacognitive services to MCL . . . . . . . . . . . . . . . 28 4.1

A 8x8 “Chippy” grid world with two rewards . . . . . . . . . . . . . . . . 33

4.2

A Chippy policy after 1,000 moves . . . . . . . . . . . . . . . . . . . . . . 34

4.3

A Chippy policy after 5,000 moves . . . . . . . . . . . . . . . . . . . . . . 34

4.4

A Chippy policy after 1,000,000 moves . . . . . . . . . . . . . . . . . . . 35

4.5

Q-Learning in the Chippy Grid World . . . . . . . . . . . . . . . . . . . . 36

4.6

Q-Learning before and after perturbation . . . . . . . . . . . . . . . . . . . 36

4.7

Q-Learning with exploration rate set to 0 after policy learned . . . . . . . . 37

4.8

The windy grid world with both single and double offsetting columns . . . 38

4.9

A 15 step path from the start to the goal . . . . . . . . . . . . . . . . . . . 39

4.10 The two moves that lead to the goal and the seven squares that can not be entered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.11 SARSA exploration of the Windy Grid World showing cumulative moves over multiple episodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.12 A Windy grid world policy learned by SARSA after 170 episodes . . . . . 40 4.13 Windy grid world optimum policy with Q values. The best path from the start to the goal is underlined. . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.14 A WinBolo console with menus, actions, primary and secondary displays. . 42 5.1

Effect on Chippy perturbation recovery of varying the learning rate . . . . . 47

5.2

Effect on Chippy perturbation recovery of varying the exploration rate . . . 48

5.3

Effect on Chippy perturbation recovery of varying the discount rate . . . . . 49

5.4

WinBolo tank outside small maze . . . . . . . . . . . . . . . . . . . . . . 50

5.5

C code to add WinBolo status information to Soar’s input structure . . . . . 51

5.6

Soar rules to land tank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.7

C++ code to initialize MCL interface for Chippy . . . . . . . . . . . . . . 53

5.8

C++ code to define the sensors for Chippy using the MCL API . . . . . . . 54

5.9

C++ code to define the expectations for Chippy using the MCL API . . . . 55

5.10 C++ code for Chippy to implement suggestions from MCL . . . . . . . . . 55 5.11 Average Rewards per Step for Chippy with and without MCL monitoring . 56 6.1

Monitoring a multi-agent system with a sentinel is isomorphic to using metacognition with a single cognitive agent. . . . . . . . . . . . . . . . . . 61

LIST OF TABLES

1.1

Research contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

3.1

Ontologies used with NAG cycle . . . . . . . . . . . . . . . . . . . . . . . 13

3.2

Indication Ontology Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3

Sensor, Divergence and Core Node Indications Ontology Linkages . . . . . 18

3.4

Concrete Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1

MCL Platforms and Languages . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2

Wind Speed and Directions in the Seasonal Windy Grid World . . . . . . . 41

5.1

Rewards for Chippy after perturbation with varying Q-Learning rates . . . . 47

6.1

Monitor and control in MCL and Sentinels . . . . . . . . . . . . . . . . . . 60

8.1

Monthly Timetable for MCL Reengineering . . . . . . . . . . . . . . . . . 67

Chapter 1

INTRODUCTION

The field of Artificial Intelligence has seen steady advances in cognitive systems. AI has acquitted itself in one area after another, theorem proving, game playing, machine learning, and more. While advances in computer speeds and memory sizes have certainly helped, the majority of the achievements have come from new algorithms and experience in applying them. Without detracting anything from AI’s accomplishments, many of the cognitive systems perform poorly when faced with situations outside of their training or in an dynamic environment. A robot trained to search out a dark blue goal may or may not detect a light blue one. A robotic car trained to drive on American roads will likely be a danger to itself and others if transported to a country with left hand driving rules. A switch between instruments reading in miles to kilometers may doom a spacecraft. The transition between laboratory training and successful real-world operation remains a major challenge. To cope with the possible future encounters, additional rules and/or more training can be used but this increases the cost and lengthens the time between conception and deployment. Alternatively, greater ability can be given to the agent to explore, learn, and reason about its environment but this too raises the manufacture and operational costs.

1

2 Adding more capabilities to an agent to cope with possible perturbations increases the complexity of the agent and, except during periods of perturbation, may decrease performance. Adding metacognition to such systems can improve their operation in the face of such perturbations without sacrificing performance during normal operations. Only when a problem has been noticed does the problem correction correction code need to be active. Metacognition can monitor an agent’s performance and invoke corrective action only when needed. The Metacognitive Loop (MCL) is a metacognitive approach that has been applied to a variety of systems (Anderson & Perlis 2005). It works with a host system, monitoring its sensors to ensure that they are within expected levels. As long as all the expectations are met, MCL remains purely in a monitoring state allowing the agent to function unimpeded by extraneous problem-handling code. When a failure is indicated by a violation of one or more expectations, MCL examines the violations, determines possible failures, and advises the host system on potential corrective actions. The agent implements the corrective action (e.g., removing a recent rule from the KB, adjusting the exploration parameter, re-evoking a previous procedure). The agent then continues to operate, leaving the performance evaluation tasks to MCL. Past implementations of MCL have been hand-crafted and tightly integrated into their host systems. MCL had direct access to the agent’s sensors. The failure of expectations was tied directly to corrective actions. The logic of the agent and MCL were separate but their implementations were intertwined. These implementations were successful in showing that an MCL-enhanced agent performed better in a dynamic, perturbed situation than an agent without such support. To allow MCL to be implemented with a larger number of systems, MCL is being reengineered to provide an Application Programming Interface (API). This will provide a clean separation between MCL and the host system. The API will be developed for the

3 C language and then (with the use of the open source SWIG package) extended to other programming languages. But, more than just having a new facade, MCL is being reengineered internally to do Bayesian inference over a set of indication, failure, and response ontologies. The three ontologies and the links between them capture the knowledge of how problems manifest themselves and what the appropriate correction actions are. Using Bayesian inference over probabilities on inter- and intra-ontology links, the most likely failure and the most effective corrective action can be determined. By having an API that can be used regardless of the implementation language of the host (and internal processes that are generalized and probabilistic per above), a wide variety of systems will be able to incorporate MCL for metacognitive monitoring and control. MCL will no longer have to be implemented in the agent’s programming language and intertwined with the agent’s processing with every new use of MCL requiring MCL to be re-implemented. Now a single implementation can be created, leveraging the MCL/hostsystem division made possible by the API and the computational power inherent in the Bayesian augmented ontologies. The reengineering of MCL will also allow some brittleness problems within MCL itself to be addressed. When a system fails to achieve its goal, one or more expectations may have been violated. Multiple expectation violations may indicate that there is more than one problem or only a single problem. Likewise, multiple problems can manifest themselves in a single expectation violation so a single violation is not a guarantee of a single problem. Being able to disambiguate problem failures from expectation violations is expected to come as a benefit from the new ontologies and the use of Bayesian inference. A second problem that will be addressed within the new MCL framework is that of reentrant invocations. That is the ability to resolve host system failure when a second exception occurs while MCL is attempting to help the host system recover from an earlier

4 exception. To effectively guide the host system we need to know if this second exception is an indication of the same problem as the first failure or a new problem. The ontologies and Bayesian inference will provide part of the answer but MCL must also be able to maintain (create, update and eventually discard) a context record that allows it to reason about failures over time. Just as MCL can improve the operation of a host system, MCL should also be able to assist itself through a recursive invocation. The meta-MCL would provide the same services and facilities that MCL provides to a host system: that of monitoring sensors for exceptions and suggesting corrective action when an exception has been violated. In the case of meta-MCL, the sensors would be MCL’s internal performance metrics and the corrective actions would be to adjust MCL’s ontologies and conditional probability tables. While it may not be possible to create an AI agent that can function optimally in all situations, metacognition in the form of the Metacognitive Loop can be efficiently applied to many types of AI architectures to improve their performance in dynamic environments. The new MCL will be used with several systems (some which have been used with the older version of MCL and some new domains). The performance of the MCL-enhanced systems versus the base versions will be compared to show the effectiveness of MCL. Source lines of code will be used as a measure of the implementation cost of MCL. CPU, elapsed time, and memory usage will be used to show that the cost of adding MCL is nominal. In summary, I propose to improve the operation of AI agents by enhancing and extending the existing MCL. Changes will be made to improve generality and the robustness of MCL. The contributions of this research are listed in Table 1.1.

5

Table 1.1. Research contributions

Category Changes Generality Ontologies Bayesian inference Cross platform/language API Robustness Reentrant invocation Recursive invocation Other Metrics and evaluation

Chapter 2

BACKGROUND

This section describes metacognition for use with computer system and the particular implementation, the Metacognitive Loop (MCL), whose extensions will be the basis for the contributions of the proposed research.

2.1

Metacognition The philosophical origins of metacognition may be traced to the dictum of “know thy-

self.” Metacognition is studied as part of developmental and other branches of psychology. While there are several different approaches, one common model is a cognitive process which is monitored and controlled by a metacognitive process as in Figure 2.1. Metacognition can be studied in conjunction with metaknowledge (knowledge about knowledge) and metamemory (memory about memory) (Cox 2005). The canonical depiction of a software agent (figure 2.2) has sensors to perceive the environment and activators with which the agent tries to control it. Metacognition can be layered onto a software agent so that the metacognitive process monitors and controls the cognitive process of the software agent as shown in figure 2.3, with metamemory and metaknowledge. Metacognition improves the performance of the agent in the environment by providing

6

7

F IG. 2.1. Metacognitive Monitoring and Control

F IG. 2.2. Software agent interactions with the environment

8

F IG. 2.3. Software agent with metacognition

two control functions. The first is to inform the agent when a cognitive task (e.g. the selection of the next action to perform) has been satisfactorily achieved so that the agent can move on to another task (such as performing the selected action). For some agents, the cognitive sufficiency test is built into the cognitive process itself. For example, the cognitive task may be limited to selecting (based on a specified estimated utility) between a fixed number of choices. In such a case there is no opportunity for metacognitive intervention. The second metacognitive control function is to reflect on the performance of the agent. This can be done at the completion of a successful task, but it is most often performed after a failure. The metacognitive process evaluates the decisions made by the agent and determines where an alternative selection would have been more appropriate or it may suggest a change to the agent’s current cognitive state such as invoking a learning module. Reflection can also be done on a continuous basis, allowing small deviations from

9 the expected to trigger controls to prevent future failures.

2.2

Metacognitive Loop The purpose of the metacognitive loop (MCL) is to improve the operation of the host

system by dealing with unexpected events (Anderson & Perlis 2005). It does this by adding a metacognitive layer to the host that is concerned with monitoring the operation of the host system, and taking corrective action when it detects a problem. Figure 2.4 shows a cognitive host system with MCL.

F IG. 2.4. Host systems with MCL support

MCL consists of three phases that implement its megacognitive knowledge about problem detection, fault isolation, and corrective action for cognitive agents. These three

10 phases correspond to the process often used by humans where (1) we notice that something is not working, (2) make decisions about it (whether the problem is important, how likely it is to get worse in the future, if it is fixable, etc.) and then (3) implement a response based on the decisions that were made (ignore the problem, ask for help, attempt to fix the problem using trial-and-error, etc.) 2.2.1

Note

The MCL process starts with the Note phase that provides the host system with a "selfawareness" component. MCL monitors the host system to detect a difference between expectations and observations. An anomaly occurs when an expectation is violated. An expectation is a statement about the allowable values for a sensor. Statements such as “the mixing vat temperature will not exceed 170 degrees” and “the flow in the coolant pipe will be between 80 and 90 gallons per second” are expectations made about external sensors. Anomalies can also be about internal host processes such as “a new plan will be generated no more than 5 seconds after a new subgoal has been made the current subgoal.” When the sensor information is at odds with the expected values, an anomaly is noted and MCL moves to the assess state. 2.2.2

Assess

In the Assess state, MCL attempts to determine the cause of the problem that led to the anomaly and the severity of the problem. The computation done in this phase need not be excessive. Indeed it is the philosophy of MCL that lightweight, efficient problem analysis is better than ignoring the problem, attempting to design out every conceivable problem, or to attempt to model and monitor large portions of the world. In some implementations of MCL this phase is almost nonexistent with a direct connection between the exception and the corrective action.

11 2.2.3

Guide

The third state of MCL is Guide, where MCL attempts to guide the host system back to proper operation but offering a suggestion as to what action(s) will return the sensor values to be within the limits set by the expectations. The suggestions available in this phase vary depending on the features of the host system. Once the suggestion has been made, MCL returns to the exception monitoring state. Any new exceptions will cause MCL to again enter the Note, Assess, and Guide phases of the NAG cycle. 2.2.4

Example

A Mars rover tasked with exploring geological formations on the red planet also has to manage power consumption. When a low battery alarm causes the rover initiate a return to the recharging station, it plans a path avoiding known obstacles. The path leads it over a dust field, while not an obstacle as such, it requires addition motive power. The additional power consumption drains the rover’s battery, ending its mission. If the rover had an MCL component, the additional power consumption would have have been noted as an indication of a problem. An assessment would have been made with a response of re-planning the path to the recharging station with the dust field now classified as an obstacle.

Chapter 3

TECHNICAL APPROACH

The Metacognitive Loop has been shown to be effective in lessening the problem of brittleness in cognitive systems. MCL was added to those systems as a customized enhancement. Each one slightly different to correspond to the implementation language and target machine of the cognitive system. The Note, Assess, and Guide steps were also tailored to the domain and host system. While this approach works fine on a small scale to make MCL available for cognitive systems, in general, it will require an implementation that works for a variety host systems in different domains implemented in different languages for different machines. Several areas of MCL will need enhancing to provide the benefits of metacognition as a general service. This work will require both research and system engineering efforts. The research areas include: • Using Bayesian inference over Indications, Failure and Response ontologies for metacognitive reflection; • Reasoning over time about multiple error indications; • Using metacognition to improve the metacognitive process itself. The system engineering effort will be to provide MCL services in a package that can be 12

13 easily used across many different implementation environments.

3.1

MCL Ontologies For MCL to serve as a general purpose tool for the brittleness problem for cognitive

systems, it must be able to perform its Note, Assess, and Guide phases without needing extensive tailoring for each domain. MCL should be able to reason using mainly abstract, domain-neutral concepts to determine why a system is failing and how to cope with the problem. To support this, three ontologies were created (one for each phase of the NAG cycle). These ontologies are additional metaknowledge for MCL as shown in figure 3.1. Each of the three ontologies is used by a different phase of MCL (see Table 3.1). The Indications ontology is used in the Note phase when sensor input shows that an expection has been violated. The Assess phase uses the Failure ontology to determine likely causes of the violated expections. Once likely causes of the failure have been identified, the Guide phase uses the Response ontology to determine appropriate responses to the failure.

Table 3.1. Ontologies used with NAG cycle

Phase Ontology Note Indications Assess Failure Guide Responses Elements within each ontology are linked to others in the ontology to show an “isa” relationship. For example, the “sensor not responding” node in the Failure ontology is connected to the “sensor failure” node to show that “sensor not responding” is a type of

14

F IG. 3.1. Ontologies as additional metaknowledge

“sensor failure.” Elements in one ontology may also be linked to elements in a different ontology to show a possible “cause-and-effect” or “problem-solution” relationship. The general pattern of ontology linkage is shown in Figure 3.2. This figure also shows how the expectations are linked to the indications ontology elements and the elements of the response ontology lead to suggestions that MCL gives to the host system. The sensors and expectations are part of the “concrete” realm of the host system. Processing by MCL moves from the concrete expectations to the abstract indication, failure, and response ontologies, and then back to the concrete suggestions implemented by the host system. Figure 3.2 shows the division between the concrete and abstract processing. The next sections expand on this process, going into each of the three ontologies in greater detail.

15

F IG. 3.2. Ontological Linkages

3.1.1

Indications

The Indications ontology is comprised of three types of nodes (See Table 3.2). The purely abstract indication nodes support concepts that cross multiple domains. These make up the core of the indications ontology. These nodes represent concepts such as “deadline missed”, “failed to change state”, and “reward not received”. The sensor nodes of the indications ontology model the sensors of the host system and their attributes. When the sensors of the host system are defined, sensor nodes are added to the indications ontology. Additional nodes are added to the ontology for the expectations for the values of the sensors. The third set of nodes in the indications ontology forms a linkage from the concrete sensor and expectations nodes and the abstract, core nodes of the ontology. The divergence nodes define how expectations can be violated. Figure 3.3 shows these nodes and the rela-

16 Table 3.2. Indication Ontology Nodes Type Core

Sensor

Divergence

Node deadlineMissed rewardNotReceived resourceOverflow resourceDeficit failedStateChange unanticipatedStateChange assertedControlUnchanged state control temporal resource reward message ambient objectProp spatial critical noncritical discrete ordinal maintenance effect divergence aberration cwa-violation cwa-decrease cwa-increase breakout-low breakout-high missed-target missed-unchanged short-of-target long-of-target over under late

17 tionships between them. The three free standing nodes (over, late, and under) are not part of the divergence tree structure but like the other divergence nodes can be used to further define the exception. In the lower left nodes “cwa” stands for Closed World Assumption.

divergence

missed target aberation short of target breakout low

missed unchanged

long of target

breakout high over cwa violation

cwa decrease

late cwa increase

under

F IG. 3.3. Divergence Nodes in the Indications Ontology

It is the violation of expectations that starts the MCL NAG cycle. The type of violation and the type of sensor are linked together to a core indications ontology node. Table 3.3

18 shows sensor and divergence nodes linked to core nodes.

Table 3.3. Sensor, Divergence and Core Node Indications Ontology Linkages Core Node Deadline missed Reward not received Resource overflow Resource deficit Failed state change Unanticipated state change Asserted control unchanged

3.1.2

Sensor Node temporal reward resource resource state state control

Divergence Node late under over under missed-unchanged aberration missed-unchanged

Failures

Once the violated expectations have been evaluated in the Note phase, MCL proceeds to evaluate the problem indications to determine the cause in the Assess phase. The Failure ontology is used in the problem determination. This phase is used (rather than mapping indications directly to responses) because of the ambiguous nature of failures and their indications: two different failures which need different responses might have the same initial problem indications and a single problem might manifest itself with multiple indications. The Failure ontology (Figure 3.4) is a catalog of the various problems that befall cognitive systems. This includes problems with sensors, effectors, resources, and the domain model (or models). The links in the failure ontology are all of the is-a variety. Thus a “sensor malfunction” is-a “sensor error” is-a “knowledge error” is-a “failure”. The “failure” node is the root of the Failure ontology and all Failure ontology nodes eventually lead to it.

19

Failure Resource error

Lack of resource

Knowledge error effector error

MPA error

effector noise

Bad parameter Model error

Timing model error

Resource Property

Effector failure

Sensor error

Sensor noise Procedure mode error

Resource cost

Sensor Malfunction

Predictive mode error underfit error

Misfit error Overfit error

F IG. 3.4. Failure Ontology

20 3.1.3

Responses

As the Failure ontology was an itemized list of everything that can go wrong with a cognitive system, the Response ontology (Figure 3.5) is a list of everything that can be done about it. There are two types of nodes in the Response ontology: abstract and concrete. The abstract nodes represent general problem-solving techniques and the concrete nodes represent specific suggestions that MCL can send to the host system. Table 3.4 lists the concrete responses and the abstract nodes that directly link to them. The remaining links within the response ontology are for “is-a” relationships. For example, “Strategic Change” is-a “System Response” is-a “Internal Response” which is-a “Response”. The “Response” node is the root of the Response ontology.

Table 3.4. Concrete Responses Concrete Node Solicit suggestion Relinquish control Run sensor diagnostic Run effector diagnostic Activate Learning Rebuild Models Adjust Parameters Revisit Assumptions Revise Expectations Algorithm Swap Change HLC Try Again

Abstract Node Ask for help Ask for help Run diagnostic Run diagnostic Modify Predictive Model Modify Predictive Model Modify Procedure Model Modify Procedure Model Modify Avoid Strategic Change Strategic Change System

21

Response External

Internal Plant

Ask for Help Relenquish control

Solicit suggestion

Run diagnostic

System

Modify knowledge

Try Again

Revise expectation

Modify predict Rebuild models

Strategic Change Change HLC

Modify procedure

Adjust parameter

Effector diagnostic

Sensor diagnostic

Modify avoid

Modify cope

Activate learning

Test hypothesis

Algorithm Swap

Amend controller Revisit assumptions

F IG. 3.5. Response Ontology

22 3.1.4

Intra-ontology linkages

The three ontologies (Indications, Failure, and Response) are connected by intraontology links. Core nodes in the Indications ontology connect to nodes in the Failure ontology. Many nodes of the Failure ontology are connected to nodes in the Response ontology. The linkages form a chain of reasoning from the violated expectation to a suggestion that may correct the problem. Figure 3.6 shows such a path for a Q-learner faced with a dynamic grid world domain where the rewards have been moved. Note that this is a very simplified diagram with most of the nodes and links removed. When the Q-learner moves to the grid square that no longer contains the expected reward, the expectation of getting the reward in that square is violated. This activates the “Reward not received” node in the Indication ontology. That node is connected (via an intra-ontology link) to the “Model error” node of the Failure ontology. The “Model error” node has two children, “Procedure model error” and “Predictive model error”. The “Predictive model error” node has an intra-ontology link to the “Modify predictive response” node of the Response ontology. The “Modify predictive response” has a child node of “Rebuild model response” that is a concrete node for generating the “Rebuild model” suggestion. This set of inter- and intra-ontology linkages allows reasoning from the failed expectation of obtaining a reward to rebuilding the Q-learner’s Q-table.

3.2

Bayesian Inference For each problem indication we want to be able to determine the most likely cause or

causes of the failure. For each failure we want to be able to determine which responses are the most likely to correct the failure. Thus given the sensors and the violated expectations we would like to find the responses with the highest probability of working. (Actually we want to find the response with the highest utility but for the this discussion we will

23

F IG. 3.6. Example Ontology Connections

let all the costs be the same and focus just on the probabilities). The three ontologies and their inter-ontology linkages (which form a directed graph) can be viewed as a Bayes net. Direct observation can be made of the sensors. By associating conditional probability tables (CPTs) with each node in the three MCL ontologies we can use Bayesian inference to compute the needed probabilities for the responses. Figure 3.7 shows the addition of CPTs to a small section of the Response ontology. The Bayesian inference will be implemented using the Intel contributed open source Probabilistic Network Library (PNL) available at https://sourceforge.net/projects/openpnl/.

24

F IG. 3.7. Conditional probability tables for portion of MCL response ontology

3.3

Support for Reasoning Over Time Errors can occur either once or multiple times. When the reward for a Q-Learner is

moved, the learner’s policy will drive it to repeatedly enter the square that used to contain the reward. If there is an expectation that the square will give a positive reward, that expectation will be repeatedly violated. But all of these violations are an indication of the same problem: that the reward has been moved. Even if MCL was invoked on the first occurrence of the unexpected reward and QLearner immediately adjusted the learning and/or exploration rates in response, the old reward square would still be visited several times while learning the new policy. Thus, MCL needs to be able to associate multiple exceptions with the same error even while recovering from the initial exceptions.

25 The recovery process itself may give rise to additional errors. Following MCL’s suggestions to increase the exploration rate, a Q-learner may experience a longer interval between rewards. Assigning additional resources to recover from a problem in one area may cause a scarcity of resources in another triggering a resource deficiency exception. This exception and the original one should both be considered by MCL when determining further corrective suggestions for the host cognitive system. To correctly assess multiple indications, MCL needs to remember what exception violations it has seen in the past and what suggestions were provided as recommended responses to those exceptions. Figure 3.8 shows the addition of previous exception violation information as part the meta-knowledge of the enhanced Metacognitive Loop.

F IG. 3.8. Reentrant MCL The mclFrame is the data structure for holding the context information that will be used by the enhanced MCL to allow reasoning over time. It consists of the MCL ontologies

26 with the calculated probabilities (plus a few other pieces of information). One mclFrame will be created for each exception violation. Multiple frames can be merged in the Guide phase of MCL if the frames are determined to represent the same problem. This will be done by comparing the probabilities associated with each of the nodes in the Failure ontology. To provide an organized method of retaining and using mclFrames, each expectation will be associated with an expectation group. An expectation group can hold zero, one, or more expectations. Expectation groups can have a parent group so that hierarchies can be created. Figure 3.9 shows expectations arranged in a single expectation group and then in a hierarchy. Grouping expectations by the host systems’ functional categories should provide better problem resolution.

a

b

F IG. 3.9. Expectations arranged in single (a) and multiple groups (b)

27 An mclFrame can be associated with each expectation group to provide a memory of past violations for reasoning about errors over time. This is done by including the probabilities retained in the mclFrame of an expectation group when calculating the probabilities of any of the group’s violated expectations.

3.4

Recursive Invocation MCL is a cognitive AI system like the host system it monitors. Like that host system

it receives perceptions from the environment and acts upon those perceptions to attempt to change the environment. But, while the host system is situated in the real world (or a simulation of it), MCL’s environment is the host system. Whatever constitutes the host system’s environment, MCL is only aware of the shadows of that environment as projected by the exceptions and sensors of the host system. MCL’s actuators are the suggestions it passes to the host system. Like any other AI system, MCL is susceptible to perturbation when its environment (the host system) changes unexpectedly. And like any other AI systems, we should be able to improve performance of MCL in times of perturbation by invoking MCL to note the problem indications, assess those indications, and guide MCL to a solution. Figure 3.10 shows a meta-MCL monitoring the operation of an MCL that is monitoring a cognitive agent. The environment of the meta-MCL is the MCL agent and the environment of the cognitive agent forms the meta-environment of the meta-MCL. mclFrames provide the mechanism for handling multiple exceptions in MCL and they also serve the same function in the recursive use of MCL. The mclFrames of the metaMCL are separate and distinct from the mclFrames used with the exceptions and exceptions groups of the MCL monitoring the cognitive agent. While allowing MCL to monitor itself should improve its operation over time, there

28

F IG. 3.10. MCL providing metacognitive services to MCL

is also the possibility of introducing major problems. Excessive resource consumption If the recursive MCL monitors many expectations or these expectations are written so that they are often violated, the recursive MCL could use a large of resources (i.e. memory and CPU). Recursive MCL should be limited or done only as low priority task when resources or not needed elsewhere. Infinite regress If MCL can be used to improve MCL, then MCL can be used to improve that MCL and so on. It is expected (but by no means proved) that layering on more and more MCLs will be subject to diminishing returns. MCL regression will be capped at one level. Destructive changes Having MCL making changes to MCL could be described as letting a surgeon cut on his own brain. Changes should be limited in scope to prevent

29 catastrophic failure.

3.5

C Language Application Program Interface The original MCL implementations were hand-crafted and tightly bound to the host

system. To provide metacognitive support for a variety of systems, a clean delineation is needed between the host system and the metacognitive monitor. An initial API has been created for C++ (the language in which the new MCL is being constructed). This interface will be used as the basis for a C language API. This will actually require only a few changes to the C++ API but will allow more applications to use MCL. The C language API (as well as the C++ API) will be documented with examples of its use.

3.6

Extending the API to Multiple Languages The open source Simplified Wrapper and Interface Generator project (www.swig.org)

provides API generators that take a C (or C++) language API. These generators create APIs for more than a dozen languages (Allegro CL, C#, Chicken, Guile, Java, Modula-3, Mzscheme, OCAML, Perl, PHP, Python, Ruby, and Tcl). In addition to the APIs created using SWIG, an API that will allow use of MCL with the SOAR general cognitive architecture (sitemaker.umich.edu/soar/home). SOAR has a large community of interest with its own newsletters and conferences. Giving them easy access to MCL will allow many applications to receive the benefits of metacognitive monitoring and control. These APIs will be documented with examples of their use.

Chapter 4

METHODOLOGY

The primary measure for the success of the proposed research is how well the enhanced MCL improves the performance of the host system. This chapter describes the method of testing for that and other criteria, as well as the problem domains that will be used in the testing.

4.1

Evaluation Criteria The hypotheses driving this research proposal are (1) that MCL augmented by on-

tologies and Bayesian inference provides cognitive systems a solution to the brittleness problem when the environment changes in unexpected ways; (2) that MCL is efficient, both in terms of operating costs and in the effort to add MCL to the host system; and (3) that the MCL solution is broadly applicable across a variety of domains, implementation languages, and operating systems. In this section the evaluation criteria for each of the hypotheses are presented. The next section will describe the problem domains that will be used in testing.

30

31 4.1.1

Effectiveness in Improving Host System Operation

To evaluate the effectiveness of MCL in improving operation of the host system, base, optimized, and MCL-enhanced versions of the host systems will be compared. The host systems used in the evaluation are grid world reenforcement learners (described in sections 4.2.1 and 4.2.2) and the Bolo tank simulation (section 4.2.3). Average Reward will be used on periodic grid worlds while number of steps to goal will be used with episodic grid worlds. For Bolo domain the time to complete the task will be used. The base measurement will be done without any tuning of the cognitive system to the domain. This base value will be used to measure the improvement when the host system is optimized (e.g. for Q-learning selecting the best alpha and epsilon values). The MCLenhanced system will also be compared to the base system to make sure that it does indeed improve performance. It will then will be compared to the optimized system to see if MCL improves the system beyond what would normally be done for a system. To determine if any improvement is statistically significant, the unpaired t test will be used. 4.1.2

Additional Computational Resources Required

The CPU, wall clock, working set size, and partition size will be measured for the base, optimized, and MCL-enhanced versions of the host systems. The unpaired t test will be used to determine if any additional resource usage of the MCL system is nominal or significant. 4.1.3

Implementation Effort

The number of source lines of code will be counted for the base, optimized, and MCLenhanced versions of the host systems. Many of the lines that needed to be added to support MCL are the same (or nearly the same) in most implements. The number of these “boiler

32 plate” lines will be counted separately from the custom code. 4.1.4

Breadth of Deployment

The metacognitive loop is to be available for multiple platforms for multiple computer languages. Table 4.1 shows the platforms and languages that MCL will be tested on to ensure that MCL can be widely used. The TBD entries for OS X and Solaris reflect the dependency on the Intel originated Bayesian Inference Library: PNL. While it is hoped that the PNL library will work on non-Intel processors, this has not yet been attempted.

Table 4.1. MCL Platforms and Languages platform Windows Linux Max OS X Solaris

4.2

C C++ Java Yes Yes Yes Yes Yes Yes TBD TBD TBD TBD TBD TBD

Python Yes Yes TBD TBD

Soar Yes TBD

Evaluation Domains Several cognitive system will be augmented with the enhanced MCL to demonstrate

increased resilience to brittleness due to changes in the environment. These include domains investigated in the initial MCL literature and new domains. 4.2.1

Chippy

The chippy grid world (Anderson et al. 2006) is an 8 by 8 square matrix as shown in figure 4.1. An agent (in this case a chipmunk) can move in the four cardinal directions from

33 square to square. An attempt to move off the board from one of the edge squares leaves you in the same square. The lower left (R1) and upper right (R2) squares provide rewards and transports the agent (chipmunk) to the opposite corner. The agent starts in one of the center squares and continues to move (and occasionally transport) until the simulation is stopped.

R2

Start

R1

F IG. 4.1. A 8x8 “Chippy” grid world with two rewards

With R1 = 10 and R2 = -10, Figure 4.2 shows‘ the policy learned by a Q-Learner in a chippy grid world after 1,000 moves. Since only 2 of the 64 squares contain a reward the Q-Learner makes many moves (on average about 98) before even seeing the first reward so that learning can begin. By about 5,000 moves, an optimal policy has been learned (Figure 4.3) which achieves a reward of 10 every 14 moves. Note that a large portion of the grid remains unexplored. Even after a million moves (Figure 4.4), it is possible that some squares may never be visited.

34

 06 ?

-10  0?

1 ?

3.4 ?

6.3 1.1 ?

8.4 ?

9.9

2.2

?

? 7.5

10

F IG. 4.2. A Chippy policy after 1,000 moves

-10

3 1.1  0.17

5 4.5 4.1 3.7 

?

?

4.9 1.4

4.3- 5.6

?

6 3.2- 3.7 3.2 2.4

?

0.92 1.9- 6.2 6.6  ?

?

9.3

4.3

7.3

?

?

?

5.1  0.15 ?

6.9 6.2 ?

11 9.5 8.5 7.7

6

4.4

?

12 10

6

6

?

13

9.2

? ? -13 3.4

10

F IG. 4.3. A Chippy policy after 5,000 moves

35

-10 ?

6.9 6.2 5.6  5 4.5 4.1 3.7  ?

?

?

?

?

?

7.7 6.9 6.2 5.6  5 4.5 4.1 3.7 ?  8.5 7.7 ? ?  9.5 8.5 ? ?  11 9.5 ? ?  12 11 ? ? 13 12 ? ? 6 -13 ?

10

? 6.9 ? 7.7 ? 8.5 ? 9.5 ? 11 ? 12

? 6.2 ? 6.9 ? 7.7 ? 8.5 ? 9.5 ? 11

? 5.6 ? 6.2 ? 6.9 ? 7.7 ? 8.5

? 5 ? 5.6 ? 6.2 ? 6.9

? ? 4.5 4.1 ? ?  5 4.5 ? 6 5.6 4.1 ? 6.2 2.8

6

6.2

6 7.7  6

F IG. 4.4. A Chippy policy after 1,000,000 moves

Perturbation

Perturbation is introduced into the Chippy Grid World by swapping

the values of the two goal squares (R1 and R2). Using R1 = 10 and R2 = -10 as above, Figure 4.5 shows the average rewards earned by a Q-Learner in the Chippy grid world. With the standard parameters (α = 0.5, γ = 0.9,  = 0.05), Q-Learning produces a policy that converges in about 5000 steps as can be seen by the flattening of the curve. From that point onward it gets a reward of 10 every 14 steps plus an occasional exploratory move. If this were to remain a static world, the exploration rate could be lowered to zero and we could achieve a slightly higher average reward (7.1). Keeping some exploration proves very useful if the world changes. In figure 4.6 the rewards of (10, -10) are changed to (-10, 10) in step 10,000. The Q-Learner continues to make adjustments to its policy based on the rewards received and eventually achieves a new policy with a reward of 10 every 14+ moves. If, however, we turn off exploration once the optimum policy is learned to get the extra reward once the perturbation occurs, the best we can do is learn a very sub-optimal

36 0.7

0.6

0.5

Learning Rate = 0.5 Discount Rate = 0.9 Exploration = 0.05

0.4

0.3

X axis = Steps Y axis = Average Reward

0.2

0.1

0 0

5000

10000

15000

20000

F IG. 4.5. Q-Learning in the Chippy Grid World

(but at least positive) policy as show in figure 4.7. 0.7

0.6

0.5

Takes almost 10,000 to relearn

0.4

What we learned in 5,000 steps

0.3

0.2

0.1

0 0

5000

10000

15000

20000

-0.1

F IG. 4.6. Q-Learning before and after perturbation

4.2.2

Windy Grid World

The windy grid world (Sutton & Barto 1998) is a 7×10 matrix as shown in Figure 4.8. An agent can move in the four cardinal directions from square to square. Attempting to move off the board from one of the edge squares leaves you in the same square. The starting (0,3) and goal (7,3) squares are labeled “S” and “G” respectively. Each move receives a reward of −1 until the goal is reached. Movement is affected by a “northerly

37 0.8

0.7

0.6

But not after a perturbation, no exploration means a very long time to learn the new optimum policy

0.5

Turing off exploration increases rewards once the optimum policy has been learned

0.4

0.3

0.2

0.1

0 0

5000

10000

15000

20000

-0.1

F IG. 4.7. Q-Learning with exploration rate set to 0 after policy learned

wind” that offsets movement one or two squares upward (as indicated by the single and double arrows.) Figure 4.9 shows a path from the start to the goal and demonstrates how the winds offset movement. Movement right (east) from the start is unaffected until point a is reached. If there was no wind then another east move should go to b but instead the movement is to c. Another eastern action (with a northernly shift) moves to d. Here the wind is even stronger, causing a two space shift. Moving east from d goes to e which is only a single upward shift but it is limited by the edge of the world. The path to the goal continues east to the upper right square f . Now, unopposed by the wind, four southern actions lead to square g. No wind alters the westward movement to h. A second westward action (with a northern offset) gets to the goal. This path (which is the shortest possible) takes 15 moves. Unless placed on an edge, the goal in a grid world should be accessible from the four cardinal directions. The wind offset alters the spaces that lead to the goal. Figure 4.10 shows the two squares that lead to the goal and the direction that leads there. It also marks (with an “X”) the seven squares that cannot be reached. Figure 4.11 shows the number of episodes completed increase (the length of the path taken from S to G decreasing) by a SARSA reinforcement learner. Figure 4.12 shows the

38 policy at the end of the 170 episodes. Figure 4.13 shows the optimum policy for the Windy Grid World. Goal, near goal, and inaccessible squares should all be discernable in any reasonably complete policy learned for the windy grid world.

6

S

6

6

66 66

6

G

F IG. 4.8. The windy grid world with both single and double offsetting columns

Perturbation

Perturbation can be added to the Windy Grid World by changing the

strength and direction of the “winds”. The Seasonal Windy Grid World varies the winds according to a fixed repeating pattern given in Table 4.2. The normal Windy Grid World is the “summer” season with strong winds from the south. The “winter” season reverses the direction of those winds. The “spring” and “fall” seasons only have unit one winds where the Windy Grid World has its unit two winds. The two ’equinox’ seasons have no winds. The number of steps that the world is in each season is called the rotational speed. 4.2.3

WinBolo

WinBolo (Morrison 2006) is a networked simulation game that has multiple players, alone or in teams, driving tanks on an island world (Figure 4.14). The players explore the

39

6

6

6

e

f

66 -66 -6 

d

?



c

?



S

-

-

a b

G@I

-

?

h g?

@ @

F IG. 4.9. A 15 step path from the start to the goal

6

S

6

6

66 66

6

G X S? W X X X X X X

F IG. 4.10. The two moves that lead to the goal and the seven squares that can not be entered

40

Windy World SARSA Learning 180 160 140

Episodes

120 100 80 60 40 20 0 0

2000

4000

6000

8000

10000

Time Steps

F IG. 4.11. SARSA exploration of the Windy Grid World showing cumulative moves over multiple episodes

6

6

6

- -13.4 - -11.8 - -10.5 - -9.38 - -8.96 - -8.02 - -6.56 -15.4 -14.9 -14.3 ? ? 6 6 6 6 6 - -15.1 -14.3 - -13 - -13.1 -11.9 -10.8 -9.43 -7.76 - -5.1 -15.7 ? 6 6 - -15.5 - -14.7 - -13.8 -12.9 -11.6 - -9.87 - -4.52 -7 -4.03 -16.1 ? ? ? ? - -15.7 - -14.7 - -13.7 - -12.4 - -10.8 - -9.98  -16.7 -4.92 -3.24 ? ? 6 - -14.3 - -13.4 - -11.5 - -10.8 -2.23 -0.998 -1  -16.1 -15.3 ? ? ? ? 6 - -14.6 - -13.4 - -12.2 - -11.6 -1.93  -15.3 -1.7  -2.81 ? ? ? ? 6 - -14 - -13.2 - -12.4 -1.49  -1.96 -14.8 ?

F IG. 4.12. A Windy grid world policy learned by SARSA after 170 episodes

41

-15- -14- -13- -12- -11- -10- -9- -8- -7- -6 ?

-15- -14- -13- -12- -11- -10-

-9-

-8-

-7-

-15- -14- -13- -12- -11- -10-

-9-

-8-

-6-

-15- -14- -13- -12- -11- -10-

-9-

-5 ?

-4 ?

-3-

-3 ?

-15- -14- -13- -12- -11- -10-

-1  -1  -2 ?

-15- -14- -13- -12- -11-

6 -2 -2 -3 ? 6 -2 -3

-15- -14- -13- -12-

F IG. 4.13. Windy grid world optimum policy with Q values. The best path from the start to the goal is underlined.

Table 4.2. Wind Speed and Directions in the Seasonal Windy Grid World Season Summer Equnox Fall Winter Equnox Spring

Strength Direction Strong South None Weak North Strong North None Weak South

42

F IG. 4.14. A WinBolo console with menus, actions, primary and secondary displays.

island, capture resources, and attack other players’ tanks. In tournament play, the goal is to have the tank capture and hold the most resources in a time-limited contest. WinBolo was derived from Bolo, a MAC 68K game that was, in turn, inspired by an older (two player) video game. Each WinBolo player runs a copy of WinBolo that connects to a WinBolo server (running on either Windows or Linux). In single-player games, the same Windows machine is both the client and the server. A player uses the keyboard to drive the tank, turning left (O) and right (P), speeding up (Q), slowing down (A), shooting (space), and laying mines (tab). The player drives over refueling bases to capture them (as well as pillboxes once they have been shot sufficiently to reduce their armor to zero). The object of the game is to capture all of the refueling

43 bases. Captured pillboxes can be relocated and used to defend your bases. The player can also build roads, bridges, and buildings provided that enough trees have been harvested to obtain the raw materials. It is the complexities of deciding whether to attack or defend, to harvest or build, and to use speed or stealth, in combination with complex terrain and multiple agents, that make WinBolo a suitably rich environment for AI research. Version 1.15 of WinBolo (the latest version as of March 2007) will be used for this research. WinBolo’s API WinBolo calls a program using its C langauge API a “brain.” The API is defined in a single file, “brain.h”, and described in a short text document (Morrison & Cheshire 2000). The API defines a single function, BrainMain(), that WinBolo calls, giving control (briefly) to the brain. It is called once for initialization (BRAIN_OPEN), multiple times during the course of a game (BRAIN_THINK), and then once more at termination (BRAIN_CLOSE). The brain code is given the state of the world from a large C structure called BrainInfo. From this sensor information, the brain should decide on a plan of action and then modify certain elements of the interface structure to implement its actions. The brain code is compiled into a Windows DLL with a file type of “BRN”. The brain code is activated by choosing it from the “Brains” menu of the WinBolo user console. The BrainInfo structure contains information about the player’s tank (location on a 65536 × 65536 plane, speed (0-64), direction (0-255) and others), the terrain1 near the tank (mapped onto a 256 × 256 grid), and the location and status of nearby tanks, pillboxes, and bases. There is also static information such as the interface version number. To indicate an action, the brain makes changes to the BrainInfo structure. Only certain changes are allowed. The most important variables are holdkeys and tapkeys 1

WinBolo comes with a built-in map, “Everard Island.” There are also other maps included in the download. More importantly, map editors are available for custom terrain maps.

44 which are used to request single or multiple changes in direction and speed, or to shoot the gun. Setting items in the BuildInfo structure allows harvesting or building at the specified map coordinates. You can also send text messages to one or more players. This is used for coordination in multi-player (agent) tournaments and can be useful as a debugging tool. Perturbation

There is a small amout of randomness inherent in the WinBolo world

introduced by network and processing speeds. Larger scale perturbation will be added by training agents to perform on one bolo map and then testing it on another. A WinBolo map allows (almost) complete customization of the WinBolo world in terms of tank start points, terrain (forest, rivers, building, etc.) features, and the location and strength of pillboxes. An example of such a perturbation is to train a WinBolo tank on a map where all the pillboxes are strength 0 and then to deploy the tank on a map where the pillboxes are strength 1.

Chapter 5

PRELIMINARY RESULTS

I joined the UMBC/UMCP MCL working group in the spring of 2006. This group included Tim Oates and Matt Schmill of UMBC and Don Perlis, Mike Anderson, Darsana Josyula and others from UMCP. My initial assignment was on a port to Windows from Linux of an MCL-enhanced Bolo agent and the construction of maps for WinBolo that would test the agent’s response to perturbation. In the summer and fall of 2006, I joined with the others on producing the Indications, Failure and Response ontologies and in the preparation of papers that discussed using them with MCL. These are the ontologies presented in the Technical Approach chapter. As Matt Schmill’s implementation of an ontology-based MCL progressed, I provided the Windows port and assisted in the detection and correction of problems and the implementation of missing features. Along with assisting in the group efforts on the ontologies and the WinBolo domain, I also worked on creating programs for evaluating MCL in the Chippy and Windy World domains. The next three sections discuss the progress made in constructing the test domains and the initial testing of MCL with Bayesian inference for the Chippy grid world.

45

46 5.1

Grid World Implementation In the Methodology chapter of this proposal, the test domains that will be used are

described. The Chippy and Windy grid worlds have been implemented in C++. Both the baseline and non-MCL optimized versions of these agents have been created. The optimized versions were found by varying the learning parameters across a wide range and then using the parameters that produced the highest reward after perturbation. The optimization effort was carried out by varying the learning rate (Figure 5.1), the exploration rate (Figure 5.2), and the discount rate (Figure 5.3). In each case a single parameter was changed, leaving the other two parameters at their nominal values (α = 0.5, γ = 0.9,  = 0.05). The total reward received after perturbation is given in Table 5.1. The value for the best parameter is highlighted. The learning rate (α) was best at 0.9 with all of the high values being better than the low ones. A higher learning rate allowed quicker replacement of of the old Q values. The exploration rate () that returned the highest reward was 0.06 which is not too far from the nominal value of 0.05. The best path in Chippy is the same before and after perturbation although the direction of travel changes. Too low a learning rate keep it from learning the new direction quickly and too high a value prevents exploiting the optimum path once it is learned. To achieve the best reward, the best discount rate (γ) at 0.6 was much lower that the normal 0.9. The rewards earned with (α = 0.5, γ = 0.6,  = 0.05) at 5,333 was the best of the post-perturbation rewards. For Chippy, lowering the discount rate, improved post-perturbation performance better than changing the learning or exploration rates.

47

F IG. 5.1. Effect on Chippy perturbation recovery of varying the learning rate

Table 5.1. Rewards for Chippy after perturbation with varying Q-Learning rates alpha .1 .2 .3 .4 .5 .6 .7 .8 .9

rewards 619.64 3079.54 3916.66 4427.12 4587.84 4624.60 4809.90 4776.69 5123.45

epsilon .01 .02 .03 .04 .05 .06 .07 .08 .09

rewards 3320.31 3938.82 4225.87 4353.92 4486.38 4629.67 4595.64 4590.09 4618.18

gamma .1 .2 .3 .4 .5 .6 .7 .8 .9

rewards 4580.51 4933.39 5227.81 5199.27 5256.53 5333.90 5099.67 4929.31 4589.14

48

F IG. 5.2. Effect on Chippy perturbation recovery of varying the exploration rate

49

F IG. 5.3. Effect on Chippy perturbation recovery of varying the discount rate

50 5.2

BoloSoar Implementation An initial implementation of the WinBolo/Soar interface was done as part of course

work for a class on Agent Architecture and Multi-Agent Systems. In a demostration, a WinBolo tank controlled by a set of Soar rules found the solution to a small maze using a random walk (see Figure 5.4).

F IG. 5.4. WinBolo tank outside small maze

Connecting WinBolo to Soar required putting the WinBolo tank’s sensor information onto Soar’s intput-link. A portion of the code to accomplish this is shown in Figure 5.5. Once the sensor information has been set, control is turned over to Soar which applies the rules in its knowledge base and then put actions on the output-link. Figure 5.6 shows the three Soar rules used to land the tank. There were a total of 36 rules defined in the random walk demonstration.

51 /* 3. Put tank information on ^input-link */ integer_to_input_link(pInputLink, &pspeed, "speed", brainInfo->speed); integer_to_input_link(pInputLink, &pdir, "direction", brainInfo->direction); integer_to_input_link(pInputLink, &ptankx, "tankx", brainInfo->tankx); integer_to_input_link(pInputLink, &ptanky, "tanky", brainInfo->tanky); integer_to_input_link(pInputLink, &pinboat, "inboat", brainInfo->inboat); integer_to_input_link(pInputLink, &pnewtank, "newtank", brainInfo->newtank); F IG. 5.5. C code to add WinBolo status information to Soar’s input structure

5.3

Chippy Agent With Ontology-Based MCL The Indications, Failure, and Response ontologies have been created but are still being

revised as are the linkages between the ontologies and the conditional probability tables for inter- and intra-ontology links. These have progressed far enough for initial testing. A Q-Learning agent for the Chippy Grid World was enhanced (via the C++ MCL API) to set expectations and receive suggestions from MCL. Figures 5.7, 5.8, 5.9, and 5.10 show the code added to the agent for initialization, defining sensors, setting expectations, and implementing MCL’s suggestions. Figure 5.11 shows the results for Chippy with and without MCL.

52

# # # # # #

-----------------------------------------------------landing phase 1. If still in the boat, speed up to get to shore 2. Once we reach shore, slow down 3. Once on shore and stopped, landing phase is done ------------------------------------------------------

sp {propose*lp-speed-up (state ^io.input-link ) ( ^phase landing) ( ^inboat 1) --> ( ^operator +) ( ^name speed ^value increase) } sp {propose*lp-slow-down (state ^io.input-link ) ( ^phase landing) ( ^inboat 0 -^speed 0) --> ( ^operator +) ( ^name speed ^value decrease) } sp {propose*lp-end-landing (state ^io.input-link ) ( ^phase landing) ( ^inboat 0 ^speed 0) --> ( ^operator +) ( ^name end-landing) } F IG. 5.6. Soar rules to land tank

53

// 1. Introduce ourselves to MCL mclAPI::initializeMCL("Chippy2", 0); // 2. Define properties mclAPI::setPV(PCI_INTENTIONAL, mclAPI::setPV(PCI_EFFECTORS_CAN_FAIL, mclAPI::setPV(PCI_SENSORS_CAN_FAIL, mclAPI::setPV(PCI_PARAMETERIZED, mclAPI::setPV(PCI_DECLARATIVE, mclAPI::setPV(PCI_RETRAINABLE, mclAPI::setPV(PCI_HLC_CONTROLLING, mclAPI::setPV(PCI_HTN_IN_PLAY, mclAPI::setPV(PCI_PLAN_IN_PLAY, mclAPI::setPV(PCI_ACTION_IN_PLAY,

PC_NO); PC_NO); PC_NO); PC_YES); PC_NO); PC_YES); PC_NO); PC_NO); PC_NO); PC_NO);

mclAPI::setPV(CRC_IGNORE, mclAPI::setPV(CRC_NOOP, mclAPI::setPV(CRC_TRY_AGAIN, mclAPI::setPV(CRC_SOLICIT_HELP, mclAPI::setPV(CRC_RELINQUISH_CONTROL, mclAPI::setPV(CRC_SENSOR_DIAG, mclAPI::setPV(CRC_EFFECTOR_DIAG, mclAPI::setPV(CRC_ACTIVATE_LEARNING, mclAPI::setPV(CRC_ADJ_PARAMS, mclAPI::setPV(CRC_REBUILD_MODELS, mclAPI::setPV(CRC_REVISIT_ASSUMPTIONS, mclAPI::setPV(CRC_AMEND_CONTROLLER, mclAPI::setPV(CRC_REVISE_EXPECTATIONS, mclAPI::setPV(CRC_ALG_SWAP, mclAPI::setPV(CRC_CHANGE_HLC,

PC_YES); PC_YES); PC_YES); PC_NO); PC_NO); PC_NO); PC_NO); PC_YES); PC_NO); PC_YES); PC_NO); PC_NO); PC_YES); PC_NO); PC_NO);

F IG. 5.7. C++ code to initialize MCL interface for Chippy

54

// 3. Define the sensors mclAPI::registerSensor("step"); mclAPI::registerSensor("old_x"); mclAPI::registerSensor("old_y"); mclAPI::registerSensor("new_x"); mclAPI::registerSensor("new_y"); mclAPI::registerSensor("reward"); mclAPI::registerSensor("reward0"); mclAPI::registerSensor("reward1");

// // // // // // // //

[0] [1] [2] [3] [4] [5] [6] [7]

// 4. Define the property values mclAPI::setSensorProp("step", mclAPI::setSensorProp("old_x", mclAPI::setSensorProp("old_y", mclAPI::setSensorProp("new_x", mclAPI::setSensorProp("new_y", mclAPI::setSensorProp("reward", mclAPI::setSensorProp("reward0", mclAPI::setSensorProp("reward1",

for the sensors PROP_DT, DT_INTEGER); PROP_DT, DT_INTEGER); PROP_DT, DT_INTEGER); PROP_DT, DT_INTEGER); PROP_DT, DT_INTEGER); PROP_DT, DT_INTEGER); PROP_DT, DT_INTEGER); PROP_DT, DT_INTEGER);

mclAPI::setSensorProp("step", mclAPI::setSensorProp("old_x", mclAPI::setSensorProp("old_y", mclAPI::setSensorProp("new_x", mclAPI::setSensorProp("new_y", mclAPI::setSensorProp("reward", mclAPI::setSensorProp("reward0", mclAPI::setSensorProp("reward1",

PROP_SCLASS, PROP_SCLASS, PROP_SCLASS, PROP_SCLASS, PROP_SCLASS, PROP_SCLASS, PROP_SCLASS, PROP_SCLASS,

// // // // // // // //

[0] [1] [2] [3] [4] [5] [6] [7]

SC_TEMPORAL); SC_SPATIAL); SC_SPATIAL); SC_SPATIAL); SC_SPATIAL); SC_REWARD); SC_REWARD); SC_REWARD);

F IG. 5.8. C++ code to define the sensors for Chippy using the MCL API

// // // // // // // //

[0] [1] [2] [3] [4] [5] [6] [7]

55

// 5. Define the expectation group. // We will add the expectations when we get the rewards. mclAPI::declareExpectationGroup((void *)this); // Set reward expectation 0 or 1 char sensor_name[15]; sprintf(sensor_name, "reward\%d", index); expected[index] = reward; mclAPI::declareExpectation((void *)this, sensor_name, EC_MAINTAINVALUE, (float) reward); F IG. 5.9. C++ code to define the expectations for Chippy using the MCL API

// 5. Tell MCL what we know responseVector m = mclAPI::monitor(sensors, 8); // 6. Evaluate the suggestions from MCL if (m.size() > 0) { int q=1; for (responseVector::iterator rvi = m.begin(); rvi!=m.end(); rvi++) { cout

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.