Saliency and Task-Based Eye Movement [PDF]

has been examined and approved by the dissertation committee as satisfactory for the dissertation required for the. Ph.D

0 downloads 63 Views 870KB Size

Report

Download PDF

PNG Network

Recommend Stories

Saliency and Task-Based Eye Movement Prediction and Guidance

If you want to become full, let yourself be empty. Lao Tzu

diplopia and eye movement disorders

Don't be satisfied with stories, how things have gone with others. Unfold your own myth. Rumi

Culturally different eye movement

Before you speak, let your words pass through three gates: Is it true? Is it necessary? Is it kind?

Rapid Eye Movement (REM)

The only limits you see are the ones you impose on yourself. Dr. Wayne Dyer

Eye Movement Desensitization and Reprocessing (EMDR) Therapy [PDF]

Jul 31, 2017 - Eye Movement Desensitization and Reprocessing (EMDR) therapy (Shapiro, 2001) was initially developed in 1987 for the treatment of posttraumatic stress disorder (PTSD) and is guided by the Adaptive Information Processing model (Shapiro

Read PDF Eye Movement Desensitization and Reprocessing (EMDR)

Don't be satisfied with stories, how things have gone with others. Unfold your own myth. Rumi

Download Eye Movement Desensitization and Reprocessing (EMDR)

Knock, And He'll open the door. Vanish, And He'll make you shine like the sun. Fall, And He'll raise

Eye movement desensitization and reprocessing for treating

Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

Eye Movement Desensitization and Reprocessing (EMDR)

It always seems impossible until it is done. Nelson Mandela

Download Eye Movement Desensitization and Reprocessing (EMDR)

I cannot do all the good that the world needs, but the world needs all the good that I can do. Jana

Idea Transcript

Saliency and Task-Based Eye Movement Prediction and Guidance by

Srinivas Sridharan

A dissertation proposal submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the B. Thomas Golisano College of Computing and Information and Sciences Rochester Institute of Technology February, 2015

B. THOMAS GOLISANO COLLEGE OF COMPUTING AND INFORMATION AND SCIENCES ROCHESTER INSTITUTE OF TECHNOLOGY ROCHESTER, NEW YORK CERTIFICATE OF APPROVAL

Ph.D. DEGREE PROPOSAL

The Ph.D. Degree Proposal of Srinivas Sridharan has been examined and approved by the dissertation committee as satisfactory for the dissertation required for the Ph.D. degree in Computing and Information Sciences

Dr. Reynold J Bailey, Dissertation Advisor

Coordinator Ph.D. Degree Program

Dr. Joe M Geigel

Dr. Anne Haake

Dr. Linwei Wang ii

Date

Date

Saliency and Task-Based Eye Movement Prediction and Guidance by Srinivas Sridharan Submitted to the B. Thomas Golisano College of Computing and Information and Sciences in partial fulfillment of the requirements for the Doctor of Philosophy Degree at the Rochester Institute of Technology Abstract The ability to predict and guide viewer attention has important applications in computer graphics, image and scene understanding, object detection, visual search and training. Human eye movements have interested researchers as they provide insight into the cognitive processes involved in task performance. It has also interested researchers to understand what guides viewer attention in a scene. It has been shown that saliency in the image, scene context, and task at hand play a significant role in guiding attention. Many computational models have been proposed to predict regions in the scene that are most likely to attract human attention. These models primarily deal with bottom-up visual attention and typically involve free viewing of the scene. In this proposal we would like to develop a more comprehensive computational model for visual attention that uses scene context, scene saliency, task at hand, and eye movement data to predict future eye movements of the viewer. We would also like to explore the possibility of guiding viewer attention about the scene in a subtle manner based on the predicted gaze obtained from the model. Finally, we would like to tackle the challenging inverse problem - to infer task being performed by the viewer based on scene information and eye movement data.

iii

Contents List of Tables

vi

List of Figures

vii

1 Introduction

1

2 Saliency and Task Based Eye Movement Prediction 2.1 Problem Definition . . . . . . . . . . . . . . . . . . . 2.2 Research Objectives and Contributions . . . . . . . . 2.3 Background and Related Work . . . . . . . . . . . . . 2.3.1 Bottom-up Saliency Based Visual Attention . 2.3.2 Top-Down Cognition Based Visual Attention . 2.4 Proposed Approach . . . . . . . . . . . . . . . . . . . 2.4.1 Scene Context Extraction . . . . . . . . . . . 2.4.2 Saliency Map Generation . . . . . . . . . . . . 2.5 Comprehensive Model . . . . . . . . . . . . . . . . . 2.5.1 Training Phase . . . . . . . . . . . . . . . . . 2.5.2 Testing Phase . . . . . . . . . . . . . . . . . . 2.6 Evaluation Measurement . . . . . . . . . . . . . . . . 2.6.1 Kullback-Leibler (KL) divergence. . . . . . . . 2.6.2 Normalized Scanpath Saliency (NSS). . . . . . 2.6.3 Linear Correlation Coefficient (LCC). . . . . .

iv

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

3 3 4 5 6 7 8 9 11 13 13 15 15 16 16 17

Contents

v

3 Adaptive Subtle Gaze Guidance Using Estimated Gaze 3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Research Objectives and Contributions . . . . . . . . . . . . . 3.3 Background and Related Work . . . . . . . . . . . . . . . . . . 3.3.1 Subtle Gaze Direction . . . . . . . . . . . . . . . . . . 3.4 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Adaptive Subtle Gaze Direction Using Estimated Gaze 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 User Study . . . . . . . . . . . . . . . . . . . . . . . . 4 Task Inference Problem 4.1 Problem Definition . . . . . . . . . . . 4.2 Research Objectives and Contributions 4.3 Background and Related Work . . . . . 4.4 Approach . . . . . . . . . . . . . . . . 4.5 Evaluation . . . . . . . . . . . . . . . . 4.5.1 User Study . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . . . .

18 18 19 20 21 24 24 25 25

. . . . . .

29 29 31 31 33 34 34

5 Timeline

37

6 Conclusion

38

A Eye Tracking Datasets

40

Bibliography

42

List of Tables A.1 Eye tracking datasets over still images. D, T and d columns stand for viewing distance in centimeters, stimuli presentation time in seconds and screen size in inches respectively. Reproduced from [1] . . . . . . 40

vi

List of Figures 2.1

2.2

2.3

(A) This figure illustrates the information preserved by the global features for two images. (B) The average of the output magnitude of the multiscale-oriented filters on a polar plot. (C) The coefficients (global features) obtained by projecting the averaged output filters into the first 20 principal components (D) shows noise images with filtered outputs at 1, 2, 4 and 8 cycles per image, representing the gist of the scene and maintaining the spatial organization and texture characteristics of the original image. The texture contained in this representation is still relevant for scene categorization (e.g., open, closed, indoor, outdoor, natural or urban scenes).Reproduced from [2] 10 (a) Image shows the schematic representation of Koch and Ullman model to compute saliency model using primitive feature maps and the center surround neurophysiological properties of the human eye . 12 (b) Image shows the flowchart of the model developed by Itti to compute saliency map based on the Koch and Ullman model. This flowchart shows the filtering process involved, extraction of feature maps, center-surround normalization and also methods to combine feature maps to obtain the saliency map. Reproduced from [3] . . . . 12

vii

List of Figures

viii

2.4

Schematic diagram of the model for predicting task-based eye movements. The training phase shows the selected training images, eye tracking data on the training images, task based feature extraction, saliency map for each image and training fixations extracted from the eye tracking data. We combine all the features to obtain the training feature set and then perform PCA/ICA reduction if necessary to reduce the feature size. A Trainer is implemented to train on this feature set. Second half of the image shows the testing phase where similar testing features are extracted as in the training phase and then the Trainer is used to predict the eye position. This new predicted eye position can then be compared to testing fixations which server as ground-truth data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1

Figure shows a mammogram image. The large red circle shows the area marked by the expert as an irregularity. . . . . . . . . . . . . . Hypothetical image with current fixation region F and predetermined region of interest A. Inset illustrates geometric dot product to compute ✓. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaze distributions for an image under static and modulated conditions. Input image (top). Gaze distribution for static image (bottom left). Gaze distribution for modulated image (bottom right). White crosses indicate locations preselected by researchers for modulation. Image on left is the image viewed by the subject when assigned the task of counting the number of deer in the scene. The red circles in the image indicate viewer’s fixation data. The image on right shows the corresponding task based saliency map, highlighting task relevant regions to direct the viewer’s attention. . . . . . . . . . . . . . . . .

3.2

3.3

3.4

. 21

. 22

. 23

. 24

List of Figures

4.1

4.2

4.3

ix

Figure showing Image A, a street view image with eye tracking data where the task provided to viewer was to locate the cars in the scene. Image B shows a similar image that can be classified as a street image and also has cars in the image to be task relevant. . . . . . . . . . . . 30 Figure shows the experiment conducted by Yarbus in 1967. Image on top-left shows the picture of “Family and an unexpected visitor” and the scanpaths of a subject for each task in the experiment while viewing the stimulus image. . . . . . . . . . . . . . . . . . . . . . . . 32 Image (A) shows the eye movements of a subject when given the task to check the rear view mirrors. Image (B) shows the eye movements of the subject when given the task to check the gauges on the dashboard. The circle in red/green indicates the fixations made by the subject when performing the task. The number indicated inside the circle shows the order of fixations. Note that the subject also gathers information of the road, the GPS when performing the task at hand. 36

Chapter 1 Introduction Predicting human gaze behavior and guiding viewer attention in a given scene are very challenging tasks. Humans perform a wide variety of tasks such as navigation, reading, playing sports, and interacting with objects in the environment. Each task performed depends on the input from the environment and from memory about the task. Attention research has been concerned with understanding the input stimuli, neural mechanisms, information processing, memory, and the motor signals involved in task performance. Eye movements provide information about the regions attended in an image and gives insight about the underlying cognitive process [4]. Saliency in the image has been shown to guide attention, e.g. regions with high local contrast, high edge density, bright colors (bottom-up e↵ect) [9, 3, 10]. Humans are also immediately drawn to faces or regions with contextual information (top-down e↵ect) [8]. Finally the pattern of eye movements not only depends on the scene being viewed but also on the viewer’s intent or task assigned [5, 6, 7]. Researchers continue to debate whether it is salient features, contextual information or both that ultimately drives attention during free viewing (no task specified) of static images [11, 12, 13, 14]. There are many computational models that predict regions that are most likely to attract viewer attention in a scene. These computational models are designed based on bottom-up attention, top-down attention or a combination of both. However, many of these models only consider free viewing and as such do not

1

2

take into account the impact of any specific task on eye movements. In the proposed work we plan to: 1. Develop and evaluate a comprehensive model of human visual attention prediction that incorporates: • Scene context (GIST, SIFT, SURF, Bag of Words etc.) • Bottom-up scene saliency • Task at hand • Eye movement data across multiple subjects 2. Develop and evaluate a novel adaptive approach to guide viewer attention about a scene that requires no permanent or overt changes to the scene being viewed, and has minimal impact on viewing experience. 3. Develop a framework for task inference based on scene information and eye movement data. This framework attempts to di↵erentiate eye movements for task performance versus eye movements made to gather information in the scene. In the next three chapters each of these research objectives are explained in more detail, specifically we provide the problem definition, background and related work, proposed approach and evaluation measures. “Saliency and Task Based Eye Movement Prediction” is presented in chapter 2, “Adaptive Subtle Gaze Guidance Using Estimated Gaze” is presented in chapter 3, and “Task Inference Problem” is presented in chapter 4. Timeline for the proposed work is presented in chapter 5. Chapter 6 presents the conclusion and highlights potential future work that is beyond the scope of this proposal. An appendix is provided listing several research datasets (images and corresponding eye movements) that will be utilized over the course of this work.

Chapter 2 Saliency and Task Based Eye Movement Prediction 2.1

Problem Definition

Predicting gaze behavior of a human in a given scene is a very challenging task. There are multiple factors that influence human gaze behavior. The salient features in the scene, the task at hand and prior knowledge of the scene are some of the factors that highly influence gaze behavior. Visual saliency based models predict regions of interest that attract the gaze of a subject based on image features such as contrast, color, orientation etc [15, 16, 17, 18]. There are other top-down computational models that combine saliency maps and scene context. Some top down models use face detection, object detection, and image blobs with visual saliency to gather visual attention details in the scene [19, 20, 21]. The task being undertaken has a very strong influence on deployment of attention [5]. It has been shown that humans process visual information in a need-based manner. We look for things that are relevant for the current task and pay less attention to irrelevant objects in the scene. Researchers have shown that there is a high correlation between visual cognition and eye movements when dealing with complex tasks [22]. When subjects are asked to perform a visually guided task their fixations were found to be on 3

2.2. Research Objectives and Contributions

4

task-relevant locations. This finding was established using the “block-copying” task where the subjects were asked to assemble building blocks, and it was shown that subjects’ eye movements reveal the algorithm used for completing the task [23]. Others have studied gaze behavior while performing tasks in natural environments such as driving, sports, walking, etc [6, 24, 22, 25]. The view of many is that both bottom-up and top-down factors are combined to direct our attention. There have been many computational models using Bayesian approaches to integrate top-down and bottom-up salient cues [26]. Eye-tracking technology helps to estimate visual attention of the subject while performing a task. Eye trackers provide fixation and saccade information in real-time that could give insight of top-down task based visual attention and the scene features provide the bottom-up saliency map. Many gaze prediction algorithms have been proposed based on image scene features and visual saliency maps [27, 28]. These computational models lack two key factors in gaze prediction, 1) these models seldomly account for the top-down visual attention that can be obtained by considering the scene’s context and 2) the training data used to develop these models were obtained during free viewing and so does not take into account the impact of specific tasks on eye movements. Hence, there is a need for a comprehensive computational model of human visual attention prediction, that can identify regions in the scene most likely to be attended for a given task at hand.

2.2

Research Objectives and Contributions

The goal of this aspect of the proposed work is to develop a comprehensive model of human visual attention prediction that incorporates, scene context, bottom-up scene saliency, task at hand and eye movement data obtained across multiple subjects to build a task based saliency map. The task based saliency map will predict regions in the scene that attract viewer’s attention while performing an assigned task. The model is further trained to predict viewer gaze on new (related) stimuli images. Such a model will allow researchers to gain more insight into the task solving behavior

2.3. Background and Related Work

5

and also predict the task solving approach under di↵erent input conditions. The model can then be used to understand the subject gaze behavior for a given task and compare it with other tasks on similar stimuli. The proposed model will also help to address the time-consuming burden of creating manual annotations of the regions of images used in perception-related experiments. Our proposed model will be able to aid people performing repeated image search tasks, by suggesting regions of interest based on the predicted gaze. While gaze target prediction techniques are prone to false positives, they can sill be very valuable in providing additional suggestions for viewing. For example, consider a radiologist searching for abnormal regions in a mammogram. At the end of the task, our prediction system can suggest other regions to look which he/she might have missed. In this manner, the technology is seen as providing assistance rather than attempting to replace the expert. This model can also be used to study di↵erences in visual attention between subjects. Hence for a given task and gaze behavior it could be possible to di↵erentiate experts from non-experts.

2.3

Background and Related Work

When we look around us, we perceive some objects in the scene to be more interesting than others. There are certain objects in the scene that “pop-out” and grab our attention over others. The drawing of our attention in this fashion is termed as “bottom-up” or “saliency-based” visual attention. Our focused attention can be thought of as a rapidly shifting spotlight and the areas focused are the salient regions in the scene. These salient regions can be represented as a 2-dimensional saliency map, that captures these regions of high attention. However, human visual attention is not plainly a feed-forward spatially selective filtering process. There is also cognitive feedback to the visual system to focus attention in a “top-down” manner. For example, there may be contextually relevant areas of the image (such as faces) that also draw our attention. Several computational models have been proposed to model bottom-up, top-down or both to understand visual attention.

2.3. Background and Related Work

2.3.1

6

Bottom-up Saliency Based Visual Attention

Saliency-based attention models are classified based on the saliency computation mechanism. Most saliency based models intend to highlight regions of interest that attract attention in the scene. Bottom-up visual saliency models can be broadly classified in several ways [29]: • Cognitive Models: Models that closely relate the psychological and neurophysiological attributes of the human visual system to compute saliency. These models account for contrast sensitivity functions, perceptual decomposition, visual masking, and center-surround interactions [15, 30].

• Bayesian Models: Models using Baye’s rule to detect object of interest or saliency regions by probabilistically combining extracted scene features with known prior knowledge of the scene or scene context [17, 31].

• Decision Theoretic Models: Visual attention is believed to produce decisions on the state of the scene being viewed such that there is an optimal decision based on minimizing the probability of error. Hence salient features can be defined as best recognized classes over all other visual classes available [32, 33].

• InformationTheoretic Models: Information theoretic models define saliency to be regions computed that maximizes information sampled from a given scene. The most informative regions are selected from all possible regions available in the scene [34, 35].

• Graphical Models: Visual attention models are computed based on eye movements. These eye movements when treated as a time series and the

2.3. Background and Related Work

7

hidden variables influencing these eye movements can be modeled using Hidden Markov Models, Dynamic Bayesian Networks and Conditional Random Fields [36, 18].

• Spectral Analysis Models: A digitized scene viewed in the spatial domain can be converted to frequency domain and saliency models are derived based on the premise that similar regions in frequency domain imply redundancy. Such models are simpler to explain and compute but do not necessarily explain psychological and neurophysiological attributes of the human visual system [37, 38, 39].

2.3.2

Top-Down Cognition Based Visual Attention

Top-down models on the other hand are goal-driven or task-driven compared to bottom-up cues which are mainly based on characteristics of a visual scene [40]. Topdown visual attention models are determined by cognitive factors such as knowledge of the scene, expectations, rewards, tasks at hand and goals. Bottom-up attention being feed-forward tends to be both involuntary and fast. On the other hand, top-down attention is slow, task driven and voluntary. Top-down attention is also referred to as a closed-loop [41, 42]. Two major sources have been explored that influence top-down or cognition based visual attention. Scene context or layout of a scene have also been shown to influence eye movements and visual attention. For example, people are naturally attracted to faces and regions relevant to the scene. The second one is task-based, certain complex tasks such as driving or reading highly influence when and where humans fixate in the scene. It has been proposed that humans are able to exhibit interest towards targets by relatively changing the gains on di↵erent basic features which attract attention [43]. For example when asked to look for a specific colored object a higher gain will be assigned to search that particular color among the other available colors in the scene. A computational model was developed that integrate

2.4. Proposed Approach

8

the visual cues for target detection by maximizing the signal-to-noise ratio of target vs. background [44]. An evolutionary algorithm was also developed to search in the basic saliency model parameter-space for the target objects [20]. In comparison to gain adjustment for fixed feature detection, other top-down attention models were suggested, in which preferred features were obtained by tuning the width of feature detectors [45]. These models study the role of object features in visual search and are similar to the techniques of object detection in computer vision. However, these models are based on human visual attention as compared to computer vision models which use predefined feature templates for detecting/tracking cars, humans or faces [46, 47].

2.4

Proposed Approach

Both top-down and bottom-up computational models provide a single saliency map which indicate regions in the image that are most likely to be attended. The saliency maps are generated while free viewing and features detectors are tuned for specific image attributes. A major drawback for having such a saliency map is that the location with the highest saliency value does not necessarily translate to the region most attended, it has been shown that majority of fixations are towards task-relevant locations [22]. It is also very difficult to predict the order of attention from a single saliency map. The saliency maps provides us with a mask which eliminates locations that may be least attended and also locations which most likely attract the viewer’s attention. In this aspect of the proposed work we aim to develop a comprehensive model of human visual attention that uses scene context, saliency map, task at hand and eye movement data to obtain a task based saliency map. The model is further trained to test if it is capable of predicting human fixations in other similar images for the same task.

2.4. Proposed Approach

2.4.1

9

Scene Context Extraction

Scene context plays a vital role in attracting visual attention to specific regions in the scene. Humans have a high degree of accuracy is describing a scene or image even with viewing times as low as ⇠80ms. This ability enables humans to capture enough information to obtain a rough representation or “gist” of the scene [48]. This enables humans to quickly classify the scene such as indoor vs outdoor, urban vs. rural or natural vs. man-made. It has been shown that semantic associations play a vital role in guiding visual attention. When searching for shoes for example, humans are more likely to look for them on floor than on top of a table or on the ceiling [49, 50]. Several models utilizing low-level features have been presented to obtain the gist of the scene. A computational model was proposed that is based on the spatial envelope using low-dimensional representation of the scene. The model generates a multidimensional space reduced by applying principal-component-analysis and independent-component-analysis in which scenes sharing membership in semantic categories are projected [2]. Gabor filters were used on input images to extract a selected number of universal textons (from the training set using K-means clustering) [51]. Researchers have also used the biological center-surround features (receptive field) from the orientation, color and intensity channels for modeling gist [52]. Gist representation is a well-known field in computer vision as it provides a global scene information which is especially useful for searching scene databases with many images. It has also been used to limit the region for object search in a scene rather than processing it in the entireity. The most important use of gist representation is in the modeling of top-down attention [53, 54]. In this proposal we use gist of the scene [2] to obtain a low-dimensional representation of the scene that does not require explicit segmentation of image regions and objects. Gist refers to the meaningful information that an observer can identify from the glimpse of the scene. We use gist description to include the semantic label of the scene with few objects and their surface characteristics and layout. It represents the global properties of the space that the scene subtends and not neces-

2.4. Proposed Approach

10

sarily include individual objects that the scene contains. Every scene is defined by 8 categories namely naturalness, openness, expansion, depth, roughness, complexity, ruggedness and symmetry. Each scene is described as a vector of meaningful values indicating image’s degree of naturalness openness, roughness, expansion, mean depth etc. The gist of the scene will help classify similar images and also provide a global low-dimensional representation for image groups. Figure 2.1 shows the two representative sample images, a polar plot showing average responses of multiscale-oriented filters on these images obtained by applying principal component analysis (global feature templates), global features projected in the first 20 principal components, and low-frequency representation (noise image) representing the gist maintaining the spatial organization and texture characteristics of the original image.

Figure 2.1: (A) This figure illustrates the information preserved by the global features for two images. (B) The average of the output magnitude of the multiscaleoriented filters on a polar plot. (C) The coefficients (global features) obtained by projecting the averaged output filters into the first 20 principal components (D) shows noise images with filtered outputs at 1, 2, 4 and 8 cycles per image, representing the gist of the scene and maintaining the spatial organization and texture characteristics of the original image. The texture contained in this representation is still relevant for scene categorization (e.g., open, closed, indoor, outdoor, natural or urban scenes).Reproduced from [2] Within image local features and between image features can be obtained after computing gist for each image. Feature detection algorithms such as Scale-invariant

2.4. Proposed Approach

11

feature transform (SIFT) [55], Speeded up Robust Feature (SURF) [56], Maximally Stable Extremal Regions (MSER) [57], Histogram of Oriented Gradients (HOG) etc can be used to identify key local features in the scene. Local feature detection and matching algorithms help identify regions that are similar within the image and also regions similar between classified images. This will further enable us to build scene context (region based) with similar features and group them into a labeled category. A list of such categories (e.g. Bag-of-words) can be used to associate regions of the scene to a task at hand.

2.4.2

Saliency Map Generation

Most attention models are directly or indirectly inspired by the physiological or neurophysiological properties of the human eye. The basic model proposed by Itti et al. [3] uses four assumptions. First, visual input is represented in the form of topographic feature map. The feature maps are constructed based on the idea of center-surround representation of the features at di↵erent spatial scales and competition among feature for visual attention. The second assumption is that these feature maps are combined to give a single local saliency map of any location with respect to its neighborhood. Third, the maximum of the saliency map is the most salient location at a given time and it also helps determine the next location for attention shift. Fourth, attention is shifted to di↵erent parts of the stimuli based on the saliency map and the order of attention shift is represented by the decreasing order of saliency in the map. Figure 2.2 shows the schematic representation proposed by Koch and Ullman and the 2.3 shows model proposed by Itti et al. In the early model proposed by Koch and Ullman [58] low-level features of the visual system such as color, intensity and orientation were computed to obtain a set of pre-attentive feature maps which were based on the retinal input to the eye. The activity of all these feature maps were combined for a given location. This combination of feature maps provide the topographic saliency map. A simple winner-take-all network was designed to detect the most salient location. The second part of the image shows the schematic

L. Itti, C. Koch / Vision Research 40 (2000) 1489–1506

Finally, we discuss future computational work that needs to address the physiological evidence for multiple saliency maps, possibly operating in different coordinate systems (e.g. retina versus head coordinates), and

2.4. Proposed Approach

the need to integrate information across saccades. The work presented here is a considerable elaboration upon the model presented in Itti et al. (1998) and has not been reported previously.

12

L. Itti, C. Koch / Vision Research 40 (2000) 1489–1506

Finally, we discuss future computational work that needs to address the physiological evidence for multiple saliency maps, possibly operating in different coordinate systems (e.g. retina versus head coordinates), and

1491

1491

the need to integrate information across saccades. The work presented here is a considerable elaboration upon the model presented in Itti et al. (1998) and has not been reported previously.

Figure 2.2: (a) Image shows the schematic representation of Koch and Ullman model to compute saliency model using primitive feature maps and the center surround neurophysiological properties of the human eye

Fig. 1. (a) Original model of saliency-based visual attention, adapted from Koch and Ullman (1985). Early visual features such as color, intensity or orientation are computed, in a massively parallel manner, in a set of pre-attentive feature maps based on retinal input (not shown). Activity from all feature maps is combined at each location, giving rise to activity in the topographic saliency map. The winner-take-all (WTA) network detects most salient location and directsvisual attention towards it, such features(1985). from Early this location reach such a more central representation Fig. the 1. (a) Original model of saliency-based attention, adapted from that Kochonly and Ullman visual features as color, intensity for or further analysis. (b) Schematic the model study. It directly on theonarchitecture in (a), but provides orientation are computed, in a diagram massively for parallel manner,used in a in setthis of pre-attentive featurebuilds maps based retinal inputproposed (not shown). Activity a complete implementation of all at processing stages. Visual features computed using linear filtering at eight spatial scales, followed by from all feature maps is combined each location, giving rise to activity are in the topographic saliency map. The winner-take-all (WTA) network detects the most salient location directs local attention towards it, such that only features from this more An central representation center-surround differences, whichand compute spatial contrast in each feature dimension forlocation a total reach of 42 amaps. iterative lateral inhibition for further analysis. (b) Schematic diagram within for the each modelfeature used inmap. this study. It directly builds on the architecture proposed but ‘conspicuity provides scheme instantiates competition for salience After competition, feature maps are combined intoina(a), single map’ complete of conspicuity all processing stages. features into are computed using linear filtering at eight scales, followed by for aeach featureimplementation type. The three maps thenVisual are summed the unique topographic saliency map. spatial The saliency map is implemented differences, which compute local spatial in each feature dimension for a neurons, total of 42detects maps. the An iterative laterallocation inhibition as acenter-surround 2-D sheet of Integrate-and-Fire (I&F) neurons. Thecontrast WTA, also implemented using I&F most salient and directs scheme instantiates competition for salience within each feature map. After competition, feature maps are combined into a single ‘conspicuity map’ attention towards it. An inhibition-of-return mechanism transiently suppresses this location in the saliency map, such that attention is for each feature type. The three conspicuity maps then are summed into the unique topographic saliency map. The saliency map is implemented autonomously directed to the next most salient image location. We here do not consider the computations necessary to identify a particular object as a 2-D sheet of Integrate-and-Fire (I&F) neurons. The WTA, also implemented using I&F neurons, detects the most salient location and directs at the attended location. attention towards it. An inhibition-of-return mechanism transiently suppresses this location in the saliency map, such that attention is

Figure 2.3: (b) Image shows the flowchart of the model developed by Itti to compute saliency map based on the Koch and Ullman model. This flowchart shows the filtering process involved, extraction of feature maps, center-surround normalization and also methods to combine feature maps to obtain the saliency map. Reproduced from [3]

autonomously directed to the next most salient image location. We here do not consider the computations necessary to identify a particular object at the attended location.

2.5. Comprehensive Model

13

diagram used for the study which was built on the Koch and Ullman architecture and provides a complete implementation of all stages. Multi-scale spatial images (eight spatial scales per channel) are computed and the center-surround di↵erences for each feature (3 features) is computed to obtain the local spatial feature map (42 feature maps). A lateral inhibition scheme is used to initiate competition for saliency with the feature map. These individual feature maps are then combined to form a single “conspicuity map” for each feature type. The conspicuity maps are then combined to obtain a single topographic saliency map.

2.5

Comprehensive Model

We propose a comprehensive computational model of human visual attention prediction, that can identify regions in the scene most likely to be attended from the scene context, saliency map, task at hand and eye movement data of the subject. A set of images from publicly available image databases are chosen and the gist and saliency maps for these images are pre-computed. The task to be performed when viewing these set images is determined ahead of time. Subject’s eye movements are recorded for these images while performing the given task for a specified period of time. The images are then randomly divided into training and testing datasets. Figure 2.4 shows the schematic representation of the proposed model. The model is divided into a “Training” and “Testing” phase and each phase is explained in detail below.

2.5.1

Training Phase

The stimuli images are randomly separated into training and testing images. The images for training are eye tracked using a remote eye-tracker. Fixation data is gathered from subjects looking at images in the training set using an eye-tracking device. A fixation map (averaged across all subjects) is then created. The eye tracking data is then split into two-groups n-initial fixations which serve as a feature vector to the model, remaining fixations are used as data to train the model. The

2.5. Comprehensive Model

14

Figure 2.4: Schematic diagram of the model for predicting task-based eye movements. The training phase shows the selected training images, eye tracking data on the training images, task based feature extraction, saliency map for each image and training fixations extracted from the eye tracking data. We combine all the features to obtain the training feature set and then perform PCA/ICA reduction if necessary to reduce the feature size. A Trainer is implemented to train on this feature set. Second half of the image shows the testing phase where similar testing features are extracted as in the training phase and then the Trainer is used to predict the eye position. This new predicted eye position can then be compared to testing fixations which server as ground-truth data.

saliency map obtained from saliency based visual attention model is also used as a feature vector. The gist and local image-based features are also extracted and are provided as an input to the model. The task at hand is encoded as an independent variable to the model. Using the gist, local image features, saliency map, task at hand and eye tracking data the final feature vector is generated. This feature vector will be of very high dimensionality, hence the feature space is reduced using techniques such as principle component analysis (PCA) or Independent Component Analysis(ICA). The stimuli images are then trained (linear model, neural-network,

2.6. Evaluation Measurement

15

gaussian mixture, support vector machines) with the reduced features. The learning algorithm is now trained for the saliency, scene context (gist + additional local features) and n-initial eye movement data. The final learning model (Trainer) will assign weights to features based on training fixations. To counter over-learning bias the training images are split randomly using 80/20 rule. The model will learn on 80% of the images in the training dataset and will be tested on the remaining 20% of the images to re-parameterize the model.

2.5.2

Testing Phase

The stimuli images which have not been used for training are used as testing images. Fixation data from several subjects is also gathered for these images using a remote eye-tracker. The eye tracking data is preprocessed and split into two-groups, ninitial fixations which serve as a feature to the model (similar to Training phase) and remaining fixations that act as ground-truth data. Similar to the training phase, saliency based features are extracted, the saliency map is obtained and is used as feature to the model. The gist, local image-based features are also extracted as an input. The final features are reduced in dimension using PCA/ICA and is provided as input to the learnt model. The output of the model is the predicted gaze position (point-based or region-based). This predicted gaze position is then compared to the ground truth data (remaining fixations).

2.6

Evaluation Measurement

Many attention and gaze prediction models are validated against eye tracking data of human observers. Eye movements provide an easy mechanism to understand the cognitive process involved in image perception and how eye movements vary with task. We can evaluate the predicted gaze obtained from the model to that of the eye movement data obtained from the human observer viewing the scene. The evaluation can be classified as 1) point-based 2) region based and 3) subjective evaluation. In the point-based approach the predicted gaze points are compared with

2.6. Evaluation Measurement

16

the ground-truth eye tracking gaze point and a distance measure can be obtained. In the region based approach instead of evaluating a single gaze point we compare the estimated gaze region to that of the region of fixations from multiple subjects. Subjective scores can also be obtained from experts to evaluate estimated gaze on a Likert scale. However, subjective evaluation is time-consuming, error-prone and is not quantitative compared to methods 1 and 2. In literature the following are the widely used evaluation techniques.

2.6.1

Kullback-Leibler (KL) divergence.

KL divergence also known as information divergence, is a measure of di↵erences between two probability distributions P and Q and is denoted as DKL (P k Q). In the context of saliency and gaze prediction it is used as a distance metric between distributions. P is the discrete probability distribution of the predicted gaze and Q is the ground-truth distribution. Models that can predict human fixations exhibit higher KL divergence since human subjects will fixate on fewer regions (with maximum response) and will avoid most of the regions with lower response from the model [59]. KL divergence is both sensitive to any di↵erences between distributions and is invariant to reparameterizations thereby not a↵ecting the scoring.

2.6.2

Normalized Scanpath Saliency (NSS).

The normalized scan path saliency is defined as the response value at a given position on the predicted gaze region which is normalized to have a zero mean and unit standard deviation N SS = 1s (S(x, y) µS ). NSS is computed once for each fixation and subsequently the mean and stander error are computed across the set of NSS scores. When the value of NSS is 1 it indicates that the subject’s eye fixation fall in a region where the predicted gaze is ones standard deviation above average. We can also say that a NSS value of 1 show that the region predicted is the most probable region to fixate than any other region on the image [60]. Whereas a NSS value of 0 show that the model does not perform any better than randomly picking a fixation

2.6. Evaluation Measurement

17

location in the scene.

2.6.3

Linear Correlation Coefficient (LCC).

The linear correlation coefficient measures the strength of a linear relationship between two variables being measured. LCC measure is widely used to compare two images for registration, features, object recognition and disparity measurement. LCC(Q, P ) =

P

x,y (Q(x, y)

q

µQ ).(P (x, y) 2 2 Q. P

µP )

(2.1)

In equation 2.1 P and Q represent the predicted fixation region and ground truth subject’s fixation in the region x, y respectively. µ and 2 represent the mean and variance of the pixel values in the region around x and y. The advantage of using LCC is that it is bounded in comparison to KL divergence and it is easy to compute than NSS or AUC. A correlation value of +1/ 1 indicate that there is a perfect linear relationship between the two variables and a value of 0 indicate no correlation.

Chapter 3 Adaptive Subtle Gaze Guidance Using Estimated Gaze 3.1

Problem Definition

The previous chapter focused on the problem of gaze prediction. In this chapter we focus on the related problem of gaze guidance. When viewing traditional static images the viewer’s gaze pattern is guided by a variety of influences (bottom-up and top-down). For example, the pattern of eye movements may depend on the viewer’s intent or task [5, 6]. Image content also plays a significant role. For example, it is natural for humans to be immediately drawn to faces or other informative regions of an image [8]. Additionally, research has shown that our gaze is drawn to regions of high local contrast or high edge density [9, 10]. Although traditional images are limited to these passive modes of influencing gaze patterns, digital media o↵ers the opportunity for active gaze control. The ability to direct a viewer’s attention has important applications in computer graphics, data visualization, image analysis, and training. Existing computer-based gaze manipulation techniques, which direct a viewer’s attention about a display, have been shown to be e↵ective for spatial learning, search task completion, and medical training applications. We propose a novel mechanism for guiding visual attention 18

3.2. Research Objectives and Contributions

19

about a scene. Our proposed approach guides the viewer in a manner that has minimal impact to the viewing experience. It also requires no permanent alterations to the scene to highlight areas of interest. Previous work on guiding visual attention typically involved having the researchers manually select the relevant regions of the scene. This process is slow and tedious. We propose to overcome this issue by combining our gaze guidance technique with our gaze prediction framework. While gaze prediction techniques are prone to false positives, they can still be very valuable in providing additional suggestions for viewing.

3.2

Research Objectives and Contributions

Our proposed gaze guidance mechanism will be developed with the following goals in mind • It should perform in real time • It should adapt to image/ scene content as well as viewing configuration • It Should adapt to the task assigned to the viewer • The technique should be subtle and have minimal impact on viewing experience The proposed model is adaptive (real-time) in selecting task relevant regions in the image based on the regions not previously fixated by the user and the taskbased saliency map. These predicted regions from the model are used to actively map regions in the scene to guide the viewer’s attention. The adaptive model can highlight task relevant regions that have not been viewed or other salient regions in the image to assist the viewer for task completion. An adaptive gaze guidance technique will enable researchers to quickly and accurately direct viewer’s attention to unattended relevant regions in the image. Such a model is novel as it selects regions of interest in the image to guide a viewer based

3.3. Background and Related Work

20

on the current viewing pattern. The location and order of fixations of no two viewers is the same, hence manually pre-selecting regions to guide attention (as done in previous work) is not ideal. A gaze guidance model of this nature eliminates the need for manual intervention and also adapts in real-time for each image being viewed. The model can learn over time and also provide assistance to the viewer in real-time while performing the task at hand. Our adaptive subtle gaze guidance technique can also be deployed in psychophyisical experiments involving short-term information recall, learning, visual search and problem solving tasks.

3.3

Background and Related Work

Jonides [62] explored the di↵erences between voluntary and involuntary attention shifts and referred to cues which trigger involuntary eye-movements as pull cues. Computer based techniques for providing these pull cues are often overt. These include simulating the depth-of-field e↵ect from traditional photography to bring di↵erent areas of an image in or out of focus or directly marking up on the image to highlight areas of interest [63, 64]. The issue with these types of approaches is that they require permanent, overt changes to the image which impacts the overall viewing experience and may even hide or obscure important information in the image. Figure 3.1 for example shows a mammogram that has a red circle highlighted to visually identify abnormal region in the image. Actively guiding viewer’s attention to relevant information has been shown to improve problem solving [64, 65]. Guiding attention has shown to enhance spatial learning by improving the recollection of location, size and shape of objects in images [66, 67, 68]. It has also been shown to improve training, learning and education[71, 72, 73, 74]. Gaze manipulation strategies have also been used for improving performance on visual search tasks by either guiding attention to previously unattended regions [69] or guiding attention directly to the relevant regions in a scene [70]. Subtle techniques have been proposed to guide viewer’s attention e↵ectively to regions of interest in a scene using remote eye trackers [61].

3.3. Background and Related Work

21

Figure 3.1: Figure shows a mammogram image. The large red circle shows the area marked by the expert as an irregularity.

Our proposed approach is based on Subtle Gaze Direction (SGD) technique that works by briefly introducing motion cues (image-space modulations) to the peripheral region of the field of view [61]. Since the human visual system is highly sensitive to motion, these brief modulations serve as excellent pull cues. To achieve subtlety these modulations are presented only to the peripheral regions of the field of view. This is determined by using a real-time eye tracking device. The eye tracker provides us the current gaze position thereby giving us the accurate location of where the subject is foveated. These peripheral modulations are terminated before the viewer can scrutinize them with their high acuity foveal vision.

3.3.1

Subtle Gaze Direction

Figure 3.2 shows a hypothetical image, suppose the goal is to direct the viewer’s gaze to some predetermined area of interest A. Let F be the position of the last recorded

3.3. Background and Related Work

22

Figure 3.2: Hypothetical image with current fixation region F and predetermined region of interest A. Inset illustrates geometric dot product to compute ✓.

~ , be the velocity of current saccade, let W ~ be the vector from F fixation and let V ~ and W ~ . Modulations are performed on to A, and let ✓ be the angle between V the pixel region A. Once the modulation commences, saccadic velocity is monitored using feedback from an eye tracker and the angle ✓ is continually updated using the geometric interpretation of the dot product. A small value of ✓ indicates that the center of gaze is moving towards the modulated region. In such cases, modulation is terminated immediately. This contributes to the overall subtlety of the technique. By repeating this process for other predetermined areas of interest, the viewer’s gaze is directed about the scene. A user study conducted with 10 participants showed that the “activation time” (from start of the modulation to detection of movement towards modulation) was 0.5 seconds for nearly 75% of the target regions, indicating that participants responded to the majority of modulations. Nearly, 70% of the fixations were within one perceptual span from the modulation and 93% were within two perceptual spans. Finally, figure 3.3 shows that it is possible to guide viewer’s attention to regions of interest in a subtle manner. The user study shows that it is possible to guide subject’s attention to relevant regions of the scene, while these observations show that the SGD

3.3. Background and Related Work

23

Figure 3.3: Gaze distributions for an image under static and modulated conditions. Input image (top). Gaze distribution for static image (bottom left). Gaze distribution for modulated image (bottom right). White crosses indicate locations preselected by researchers for modulation.

technique is successful in directing gaze it does not necessarily mean that the viewer fully processed the visual details of the modulated regions or remembered them. To better understand the impact of Subtle Gaze Direction on short-term spatial information recall and its applicability for training scenarios, we have already conducted several studies. See [68, 71, 72, 75] for more details.

3.4. Proposed Approach

3.4

24

Proposed Approach

Our approach combines the subtle gaze direction technique with the saliency and task based eye movement prediction model (in chapter 2) to actively and adaptively guide viewer’s attention to task relevant regions in the scene. By combining the two methods we can guide viewer’s attention in real-time based on the predicted gaze obtained from the comprehensive model and also achieve subtlety to ensure that there is minimal impact on the overall viewing experience.

3.4.1

Adaptive Subtle Gaze Direction Using Estimated Gaze

The biggest challenge for gaze guidance is that the next fixation of the viewer is not available ahead of time, it has to be computed based on the direction of the movement of eye (saccade velocity) with the help of an eye tracker. Also the regions where the subject’s gaze is to be guided is pre-computed manually and the sequence of regions are fixed ahead of time. This approach is both time consuming and cumbersome. Each viewer’s scanpath is unique and changes based on the task at hand. Our saliency and task-based eye movement prediction model can be used to automatically generate task relevant regions for the gaze guidance technique.

Figure 3.4: Image on left is the image viewed by the subject when assigned the task of counting the number of deer in the scene. The red circles in the image indicate viewer’s fixation data. The image on right shows the corresponding task based saliency map, highlighting task relevant regions to direct the viewer’s attention.

3.5. Evaluation

25

Figure 3.4 shows an image viewed by the subject when the task assigned was to count the number of deer in the scene. The corresponding task based saliency map image is shown on right, highlighting the regions in the image that are task relevant. The subject is eye tracked during the task and our proposed model predicts the gaze of the user based on the series of fixations recorded. The intensity map (right image) highlights the priority and relevance to the task regions in the scene. Task relevant regions are placed in a queue based on their saliency value and are moved to the end or popped once the subject has scrutinized the region for a desired duration of time. The model will then be able to guide the subject using SGD to these task relevant regions if previously unattended. The viewer’s gaze is directed to task relevant regions by presenting a brief luminance modulation to the peripheral region of the field of view. The modulation is terminated as soon as the direction of saccade is towards the region of interest. This approach makes sure that our model is able to subtly guide viewer attention to task relevant regions that are previously unattended by the subject, and ensures that maximum visual coverage is achieved for successful completion of the task.

3.5

Evaluation

3.5.1

User Study

The goal of the user study will be to test the e↵ectiveness of the adaptive subtle gaze guidance using estimated gaze from the proposed model. Participants are chosen randomly and eye tracked while viewing a collection of static images. All participants are chosen to have normal or corrected-to-normal vision with no cases of color blindness. Each participant will undergo a brief calibration procedure to ensure proper eye-tracking. The images are pre-processed and the saliency map, gist and local image features are computed along with the previously recorded eye movement data as mentioned in chapter 2. After viewing the scene for a short period of time, the model gathers eye movement data of the subject in real-time and attempts to guide their attention to task-based relevant regions that are unat-

3.5. Evaluation

26

tended. This ensures that all task-relevant regions are attended, and the image is sufficiently scrutinized to successfully complete the task at hand. The relevant regions are highlighted by briefly projecting motion cues (image-space modulations) to the peripheral region of the field of view. Eye tracking data and scene stimuli from each subject are recorded and the accuracy of performance will be computed against a control group that is not guided using adaptive subtle gaze guidance technique. The following methods will be used to evaluate performance: Activation Time Activation time is defined as the time elapsed between the start of the modulation and the detection of movement in the direction of the modulation. As shown in the subtle gaze direction technique [61], that the criteria for terminating the modulation was met within 0.5 seconds for approximately 75 percent of the target regions and within 1 second for approximately 90 percent of the target regions, indicating that the participants responded to the majority of the modulations. The adaptive subtle gaze guidance should be tested to ensure that the activation time is similar or better than the SGD technique. In the SGD technique modulated regions were manually pre-selected to ensure the faster onset and termination of visual cues. In case of adaptive subtle gaze guidance the model has to predict the next possible fixation and also keep an account of the sequence of fixations previously made by the viewer. The model has to run in real-time, and accurately predict the next task-relevant gaze location to decide if viewer’s attention is to be guided to this new location. Accuracy Measurement For tasks involving problem solving, training or visual search it is important to measure the accuracy of performance. The adaptive gaze guided group is compared with a static group to see if they performed significantly better for the given task at hand. The accuracy of the groups is evaluated using the following: • Binary Classification Statistics

3.5. Evaluation

27

Binary classification statistics [76] can be used to establish measures of accuracy as well as sensitivity and specificity. To calculate these properties it is necessary to categorize the test outcomes as true positives(TP), true negatives(TN), false positives(FP), and false negatives(FN). Sensitivity is computed as follows: Sensitivity =

(#of T P ) ⇤ 100 (#of T P + #of F N )

(3.1)

(#of T N ) ⇤ 100 (#of T N + #of F P )

(3.2)

Specificity is defined as follows: Specif icity =

The sensitivity and specificity values can then be combined to produce a binary classification based measure of accuracy as follows: Accuracy =

(#of T P + #of T N ) ⇤ 100 (#of T P + #of T N + #of F N + #of F P )

(3.3)

The accuracy value can be compared between adaptive subtle gaze guided and control groups, higher accuracy would indicate better performance of the task at hand. • Area Under Curve (AUC) Area Under Curve (AUC) or Receiver Operating Characteristic (ROC) curve can be used as a binary classifier system with a variable threshold. AUC or area under ROC curve is used to assess the performance of the adaptive gaze guidance technique. A value of 1 on the curve indicates perfect classification of task relevant regions. The ROC curve is e↵ectively used to test if the regions selected by the viewers is better-than random classification. This measure will ensure to see if the control group or adaptive gaze guided group performance is not by random chance. AUC or ROC along with accuracy measure from

3.5. Evaluation

28

binary classification will provide a complete result of the group performance. • Levenshtein Distance Levenshtein distance [77, 78] is a string metric, developed in the field of information theory and computer science to compute di↵erences between sequences. Levenshtein distance provides an appropriate measure to compare distances between task that require an ordered sequences. To accurately compare sequences using Levenshtein distance the correct (intended) viewing order of each image is converted into a string sequence. All responses from each participant are also converted to an appropriate string sequence in order to facilitate comparison to the correct sequence. Since the number of relevant regions varies across the images we normalize the distance measure computed for each image by dividing by the number of correct regions for the task. Each correct region is assigned a label. Suppose for the eight relevant regions in the scene the correct viewing order is [ABCDEFGH]. A Levenshtein distance of 0 ([ABCDEFGH]) would indicate no di↵erence, whereas a distance of 8 ([DCBAGHFE]) would indicate the maximum di↵erence.

Chapter 4 Task Inference Problem 4.1

Problem Definition

It has been shown that task at hand influences visual attention greatly. The best known example for task based top down attention was proposed by Yarbus in 1967 [5, 79]. Eye movements convey vital information of the cognitive process involved when performing task such as driving, reading, visual search and scene understanding. Eye movements reveal the shift in attention and a sequence of eye movements highly relate to the task at hand. The difficulty and complexity of the task also significantly influences eye movement. This is based on the assumption that eye-movements and visual cognition are highly correlated [22]. Eye movement can be used both as data to understand the underlying cognitive process and also to validate the computational models on visual attention. Thus, eye movements are used to better represent the task at hand and the fixations extracted are used as features for the computational model as described chapter 2. However the inverse of this process, to determine the task at hand from eye movement data is very difficult. Eye movements are made to perform a task at hand, also to gather additional information in the scene while performing the task. Salient regions in the image that are task irrelevant also attract visual attention. The process of di↵erentiating eye movements as task-based and information-based is the holy grail of eye tracking. 29

4.1. Problem Definition

30

Hence the task inference problem can be defined as identifying the task performed by the user while viewing the scene with the help of image features and real-time eye movement data. A generic model to predict the task at hand from eye movement data is far from reach. The problem needs to be simplified as, for a given set of stimuli images and relevant tasks is it possible to identify the task based on eye movement data? It is also important to extend the idea to any new image which can be classified to an existing image group in the data set and has relevance to the task defined for that image group. For example, if image A in figure 4.1 belongs to street image group and the task is to “locate the cars” in the image, then another image B can be a new stimulus image that can be classified as a street image and the provided task is applicable to it.

Figure 4.1: Figure showing Image A, a street view image with eye tracking data where the task provided to viewer was to locate the cars in the scene. Image B shows a similar image that can be classified as a street image and also has cars in the image to be task relevant. The problem can now be defined as, given a set of p images in a group (i1..p ⇢ I) and eye movement data for each image (Ei ) in the group for n assigned tasks (t1..n ⇢ T ). The model should be able to identify the task performed by the viewer for a new stimuli image inew which can be classified in I when the tasks t1..n ⇢ T are relevant to it.

4.2. Research Objectives and Contributions

4.2

31

Research Objectives and Contributions

In this chapter the objective is to develop a framework for predicting the task being undertaken based on the scene context, bottom-up scene saliency, and eye movement data. The model is initially trained on a set of training stimuli images to compute the task-based saliency maps for the task at hand across multiple human subjects. A human observer (not part of training phase) is then presented with a new image that is similar to the training images (scene classification) and is relevant to the task at hand. The model has to then accurately in real-time predict the task performed by the user on the stimuli image based on the eye movement data gathered. Such a model can be used to obtain vital information about the viewer’s intent in repeated image search tasks. For example, TSA experts look for certain specific objects (hazardous or harmful) in the image and this search process is highly repetitive. The model can be tuned to certain specific image groups enabling assistance in the visual search process. The idea can be extensively used in many image search tasks. It can also be used in training and learning environments to better understand the viewer’s eye movements. Finally, this model can also provide a rich data-set of stimulus images and corresponding task dependent eye movements which can serve as ground-truth information for other visual attention models. This dataset can also be used for validation and to conduct empirical and performance studies with di↵erent saliency computational models.

4.3

Background and Related Work

Yarbus showed that eye movement not only depends on the scene presented but also on current task at hand. Subjects were asked to watch a picture (a room with a family and an unexpected visitor entering the room) under di↵erent task conditions involving guessing the age, material circumstances of the family, reaction of the family and free viewing of the image. Figure 4.2 shows the scan path of a subject for the various tasks while viewing the same stimulus image. Attention in humans has also been broadly di↵erentiated on the basis of its at-

4.3. Background and Related Work

32

Figure 4.2: Figure shows the experiment conducted by Yarbus in 1967. Image on top-left shows the picture of “Family and an unexpected visitor” and the scanpaths of a subject for each task in the experiment while viewing the stimulus image.

tribute namely “covert” and “overt”. Overt attention is the process of directing the fovea towards a desired object or stimuli to fixate on the object and gather information. Covert attention on the other hand, is while focusing on an object simultaneously gathering information on surrounding objects without necessarily making an eye movement. An example of covert attention is while driving, the driver while focusing on the road covertly keeps tracks of his gauges, road signs and traffic lights. The theory behind covert attention is to quickly gather information on other interesting objects or features in the scene other than the one currently fixated. The reason for covert attention is due to the physiology of the eye that

4.4. Approach

33

maps slow saccades to other locations in the scene to gather interesting information for the next fixation [80]. However, researchers are still trying to understand the complex interactions between overt and covert attention. Many computational models try to find the regions that attract eye fixation and explain the process of overt attention. However, there are no computational frameworks that explain the reasons and mechanisms of covert attention and there is also no known measure for covert attention. Thus, visual saliency models cover the likelihood for a region in the scene being attended to but cannot explain whether the information gathered is through covert or overt attention. Most models predict very specific tasks such as locating humans which require these models to detect human faces [81, 47], skin color [82], skeletal structure and posture to detect humans in the scene. There have been approaches to detect specific features such as skin, faces, horizontal and vertical lines, curvature, corners and crosses, shapes texture, depth etc. These features enable us to di↵erentiate the salient region and to group similar regions based on the above features. The other approaches use specific object detection and scene classification techniques to identify images of interest. Other models predict the gaze of the subjects for a very specific task within a controlled setup [53, 83]. However there are no known models that use eye movement data to predict the task being performed by the user on similar images.

4.4

Approach

Task inference using only eye movements is a very complex problem and it has been shown that it is extremely difficult to di↵erentiate task related fixations to other fixations in the scene. The comprehensive computational model of human visual attention prediction proposed uses saliency, gist, local image-based features, eye movement data and also encodes the task at hand to predict viewer’s gaze. The model proposed will narrow down the regions of interest in the scene from saliency map or just eye movement data. When the task-based saliency map of the model

4.5. Evaluation

34

is generated with the eye movement data from multiple subjects the task relevant regions will be isolated. Thus predicting the task relevant gaze position and the real-time eye movement data will enable the model to run as a controlled feedback loop to predict the task being performed by the user. For example, in a driving scenario for a task to locate speed signs, the fixations are going to be on a speed sign (if present) in the scene. If there are multiple speed signs the attention is going to shift from one sign to another. Figure 4.3 shows the eye tracked image of a person driving a virtual truck. In image A the subject is given a task to view traffic using the rear-view mirrors. Whereas in image B the task provided to the subject was to monitor the gauges and other instruments while driving. In images A and B it can be clearly seen that there are fixation on the task-relevant regions and there are other fixations made to gather additional information. The proposed model will predict the task-relevant regions and for all the tasks specified for the image, and will infer the task performed by the user based on real-time eye movement data.

4.5

Evaluation

Eye tracking has been extensively used to validate visual attention and gaze guidance models. Researchers manually record and view eye movement data to understand the cognitive process involved while performing a task. The proposed model will predict the task being performed by the user using image saliency map, gist, local imagebased features and pre-computed eye movement data. It is necessary to evaluate the performance of the model on task inference. The accuracy of the model is 100% if it predicted the task performed correctly and 0% if it failed to predict the task performed by the user.

4.5.1

User Study

The goal of the user study is to test the accuracy of the model in predicting the task performed by the user given the image saliency map, gist, local image-based

4.5. Evaluation

35

features and pre-computed eye movement data. The model would have already been evaluated on gaze prediction as described chapter 2. A user study will be conducted to evaluate how quickly and accurately the model perform the task inference. Participants will be chosen randomly and eye tracked while viewing a collection of static images that have not been trained by the model. All participants are chosen to have normal or corrected-to-normal vision with no cases of color blindness. Each participant will be assigned a specific task while viewing the image for the specified period of time and the model will attempt to infer the task being performed. At the end of the study the task assigned to the subject will be compared with the task inferred by the model. The speed and accuracy of the model will be tested for each image and each image group overall. The eye movement data of the subject is recorded and the fixation map will be compared to the task based saliency map computed by the model. The evaluation measures discussed in chapter 2 section 2.6 can be used to compare the fixation distribution to the task based saliency map. A binary classification test can also be performed as described in chapter 3 section 3.5.

4.5. Evaluation

36

Figure 4.3: Image (A) shows the eye movements of a subject when given the task to check the rear view mirrors. Image (B) shows the eye movements of the subject when given the task to check the gauges on the dashboard. The circle in red/green indicates the fixations made by the subject when performing the task. The number indicated inside the circle shows the order of fixations. Note that the subject also gathers information of the road, the GPS when performing the task at hand.

Chapter 5 Timeline

37

Chapter 6 Conclusion A comprehensive model of human visual attention prediction was proposed that incorporates scene context, bottom-up saliency map, task at hand and eye movement data across multiple subjects. The model’s predicted gaze could be used to actively and adaptively guide the viewer’s attention in static images. Finally, an interesting and challenging problem of predicting the task at hand using the eye movement data of the subject and image saliency,gist, local image features was discussed. The following future ideas can be seen as a natural extension of the work discussed in this proposed: • Specialized visual search task: The proposed computational model can be used in fields where visual search is the primary task. The proposed model can be used in medical imaging, airport security, remote sensing etc. For example, consider a radiologist searching for abnormal regions in mammogram, at the end of the task, our prediction system can suggest other regions to look which he/she might have missed. In this manner, the technology is seen as providing assistance rather than attempting to replace the expert. This model can also be used to understand di↵erences in visual attention between subjects. • Dynamic Images: The proposal deals currently with only static natural images, and it would be an important feature to extend it to dynamic images. 38

39

There are very few visual attention models that work accurately and in realtime with videos. Extending the model to dynamic images will be a good future addition and would also provide new algorithmic and programming challenges in processing video frames in real-time and predicting eye movement on video. • Real World Environments: With the availability of head-mounted eye trackers, gaze data can be obtained while performing complex tasks such as driving, cooking, navigating, playing sports etc outdoor. With the ability to predict gaze in real-time on videos, we can use the model and augmented reality solutions to help guide the viewer in the real world in a similar fashion as we have already demonstrated for desktop applications. • Virtual Reality Environments: There is a great amount of research in understanding visual perception within a virtual reality (VR) environment. Researchers are studying visual search behavior and interaction of humans in virtual environment. The proposed model can be used to predict eye movement while performing a visual search task in a head-mounted display with eye tracking capability. This will enable researchers to better understand search strategies and behavior in VR systems.

Appendix A Eye Tracking Datasets Table A.1: Eye tracking datasets over still images. D, T and d columns stand for viewing distance in centimeters, stimuli presentation time in seconds and screen size in inches respec-

d [in]

D [cm]

T [s]

Resolution

Scenes

Dataset

Subjects

tively. Reproduced from [1]

Eye-

f [Hz]

Restraint

1000

Chin

Tracker FiFA [81] GazeCom

7 11

250 63

1024 x 768 1280 x 720

2 2

80 45

22

EyeLink 1000

rest

EyeLink II 250

Chin

Image [84] IRCCyN

rest 40

27

Image 1 [85] IRCCyN

18

80

⇡768 x 512

15

481 x 321

15

-

-

50

-

50

-

Research 40

17

Image 2 [86] KTH [87]

Cambridge Cambridge Research

31

99

1024 x 768

5

70

18

EyeLink I

-

Head mount

40

41

LIVE

29

101

1024 x 768

5

134

21

DOVES [88]

Fourward

200

Bite bar

60

-

240

Chin

Tech Gen. V

MCGill

21

235

640 x 480

-

70

17

ImgSal [89] MIT

Tobii T60

39

300

Bench-

⇡1024 x 768

3

⇡1024 x 768

3

800 x 600

-

61

19

ETL 400 ISCAN

rest

mark [90] MIT

15

1003

CSAIL [19] MIT

14

912

61

19

75

21

ISCAN

8

1544

1024 x 860

3

61

19

ETL 400

13

758

1024 x 860

5

75

17

ASL

Toronto [35]

20

120

681 x 511

4

75

21

-

TUD Image

20

29

varying

10

70

19

iView X

1 [93]

240

160

600 x 600

8

60

17

2 [94]

iView X

54

768 x 512

-

70

17

SMI

-

-

-

50

Chin rest

50

Head rest

50/60

actions [95] VAIQ [96]

Chin

30

RED 14

Chin

rest

RED 40

Chin

rest

ISCAN

NUSEF [92]

TUD Inter-

240

RK-464

LowRes [91]

TUD Image

-

rest

CVCL [83] MIT

-

Chin rest

15

42

varying

12

60

19

EyeTech TM3

-

-

Bibliography [1] Stefan Winkler and Subramanian Ramanathan. datasets. In QoMEX, pages 212–217, 2013.

Overview of eye tracking

[2] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42(3):145–175, 2001. [3] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40(10-12):1489–1506, May 2000. [4] S.P. Liversedge, I.D. Gilchrist, and S. Everling, editors. The Oxford Handbook on Eye Movements, chapter Visual Cognition and Eye Movements, pages 511– 614. Oxford University Press, Oxford, England, 2011. [5] A. L. Yarbus. Eye Movements and Vision. Plenum. New York., 1967. [6] J. M. Henderson and A. Hollingworth. Eye movements during scene viewing: An overview. In G. Underwood, editor, Eye Guidance in Reading and Scene Perception, pages 269–293. Oxford: Elsevier., 1998. [7] Benjamin W. Tatler, Nicholas J. Wade, Hoi Kwan, John M. Findlay, and Boris M. Velichkovsky. Yarbus, eye movements, and vision. i-Perception, 1, 2010. [8] N. H. Mackworth and A. J. Morandi. The gaze selects informative details within pictures. Perception and Psychophysics, 2:547–552, 1967. 42

Bibliography

43

[9] S. K. Mannan, K. H. Ruddock, and D. S. Wooding. The relationship between the locations of spatial features and those of fixations made during visual examination of briefly presented images. Spatial Vision, 10:165–188, 1996. [10] D. Parkhurst and E. Niebur. Scene content selected by active vision. Spatial Vision, 16:125–154, 2003. [11] Xin Chen and Gregory J. Zelinsky. Real-world visual search is dominated by top-down guidance. Vision research, 46(24):4118–4133, November 2006. [12] Geo↵rey Underwood and Tom Foulsham. Visual saliency and semantic incongruency influence eye movements when inspecting pictures. Q J Exp Psychol (Hove), 59(11):1931–49, Nov 2006. [13] Antonio Torralba, Aude Oliva, Monica S. Castelhano, and John M. Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review, 113(4):766–786, October 2006. [14] James R Brockmole and John M Henderson. Recognition and attention guidance during contextual cueing in real-world scenes: evidence from eye movements. Q J Exp Psychol (Hove), 59(7):1177–87, Jul 2006. [15] L. Itti. Models of Bottom-Up and Top-Down bu;td;mod;psy;cv, Pasadena, California, Jan 2000.

Visual

Attention.

[16] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259, November 1998. [17] Antonio Torralba. Modeling global scene factors in attention. 20(7):1407–1418, 2003.

JOSA A,

[18] Tie Liu, Zejian Yuan, Jian Sun, Jingdong Wang, Nanning Zheng, Xiaoou Tang, and Heung-Yeung Shum. Learning to detect a salient object. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(2):353–367, 2011.

Bibliography

44

[19] Tilke Judd, Krista Ehinger, Fr´edo Durand, and Antonio Torralba. Learning to predict where humans look. In Computer Vision, 2009 IEEE 12th international conference on, pages 2106–2113. IEEE, 2009. [20] Ali Borji, Majid N Ahmadabadi, and Babak N Araabi. Cost-sensitive learning of top-down modulation for attentional control. Machine Vision and Applications, 22(1):61–76, 2011. [21] V. Navalpakkam and L. Itti. Modeling the influence of task on attention. Vision Research, 45(2):205–231, Jan 2005. [22] Mary Hayhoe and Dana Ballard. Eye movements in natural behavior. Trends in cognitive sciences, 9(4):188–194, 2005. [23] Dana H. Ballard, Mary M. Hayhoe, and Je↵ B. Pelz. Memory representations in natural tasks. J. Cognitive Neuroscience, 7(1):66–80, January 1995. [24] K Chajka, M Hayhoe, B Sullivan, J Pelz, N Mennie, and J Droll. Predictive eye movements in squash. Journal of Vision, 6(6):481, 2006. [25] J. N. Bailenson and N. Yee. Digital chameleons: automatic assimilation of nonverbal gestures in immersive virtual environments. Psychol Sci, 16(10):814– 819, Oct 2005. [26] Vidhya Navalpakkam, Christof Koch, Antonio Rangel, and Pietro Perona. Optimal reward harvesting in complex perceptual environments. Proceedings of the National Academy of Sciences, 107(11):5232–5237, 2010. [27] Victor A Mateescu, Hadi Hadizadeh, and Ivan V Bajic. Evaluation of several visual saliency models in terms of gaze prediction accuracy on video. In Globecom Workshops (GC Wkshps), 2012 IEEE, pages 1304–1308. IEEE, 2012. [28] Umesh Rajashekar, Lawrence K Cormack, and Alan C Bovik. Image features that draw fixations. In Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, volume 3, pages III–313. IEEE, 2003.

Bibliography

45

[29] A. Borji and L. Itti. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):185–207, Jan 2013. [30] Olivier Le Meur, Patrick Le Callet, and Dominique Barba. Predicting visual fixations on video based on low-level visual features. Vision research, 47(19):2483– 2498, 2007. [31] Aude Oliva, Antonio Torralba, Monica S Castelhano, and John M Henderson. Top-down control of visual attention in object detection. In Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, volume 1, pages I–253. IEEE, 2003. [32] Dashan Gao and Nuno Vasconcelos. Discriminant saliency for visual recognition from cluttered scenes. In Advances in neural information processing systems, pages 481–488, 2004. [33] Dashan Gao, Sunhyoung Han, and Nuno Vasconcelos. Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(6):989– 1005, 2009. [34] Ruth Rosenholtz. A simple saliency model predicts a number of motion popout phenomena. Vision research, 39(19):3157–3163, 1999. [35] Neil Bruce and John Tsotsos. Saliency based on information maximization. In Advances in neural information processing systems, pages 155–162, 2005. [36] Albert Ali Salah, Ethem Alpaydin, and Lale Akarun. A selective attentionbased method for visual pattern recognition with application to handwritten digit recognition and face recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(3):420–425, 2002. [37] Xiaodi Hou and Liqing Zhang. Saliency detection: A spectral residual approach. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.

Bibliography

46

[38] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. Frequency-tuned salient region detection. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1597–1604. IEEE, 2009. [39] Peng Bian and Liming Zhang. Biological plausibility of spectral domain approach for spatiotemporal visual saliency. In Advances in Neuro-Information Processing, pages 251–258. Springer, 2009. [40] H. E. Egeth and S. Yantis. Visual attention: control, representation, and time course. Ann. Rev. Psychol., 48:269–297, 1997. [41] L. Itti and C. Koch. Computational modelling of visual attention. Nat. Rev. Neurosci., 2(3):194–203, Mar 2001. [42] A. Borji, D. N. Sihite, and L. Itti. Computational modeling of top-down visual attention in interactive environments. In Proc. British Machine Vision Conference (BMVC 2011), pages 85.1–85.12, Sep 2011. [43] Jeremy M Wolfe. Guided search 4.0. Integrated models of cognitive systems, pages 99–119, 2007. [44] V. Navalpakkam and L. Itti. An integrated model of top-down and bottom-up attention for optimal object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2049–2056, New York, NY, Jun 2006. [45] Lior Elazary and Laurent Itti. A bayesian model for efficient visual search and recognition. Vision research, 50(14):1338–1352, 2010. [46] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.

Bibliography

47

[47] Paul Viola and Michael J Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004. [48] Olivier R Joubert, Denis Fize, Guillaume A Rousselet, and Mich`ele FabreThorpe. Early interference of context congruence on object processing in rapid visual categorization of natural scenes. Journal of Vision, 8(13):11, 2008. [49] Wolfgang Einh¨auser, Merrielle Spain, and Pietro Perona. Objects predict fixations better than early saliency. Journal of Vision, 8(14):18, 2008. [50] Alex D Hwang, Hsueh-Cheng Wang, and Marc Pomplun. Semantic guidance of eye movements in real-world scenes. Vision research, 51(10):1192–1205, 2011. [51] Laura Walker Renninger and Jitendra Malik. When is scene identification just texture recognition? Vision research, 44(19):2301–2311, 2004. [52] C. Siagian and L. Itti. Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2):300–312, Feb 2007. [53] Robert J Peters and Laurent Itti. Beyond bottom-up: Incorporating taskdependent influences into a computational model of spatial attention. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007. [54] A. Borji, D. N. Sihite, and L. Itti. Computational modeling of top-down visual attention in interactive environments. In Proc. British Machine Vision Conference (BMVC 2011), pages 85.1–85.12, Sep 2011. [55] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, November 2004. [56] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110(3):346–359, June 2008.

Bibliography

48

[57] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the British Machine Vision Conference, pages 36.1–36.10. BMVA Press, 2002. doi:10.5244/C.16.36. [58] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. Hum Neurobiol, 4(4):219–227, 1985. [59] Laurent Itti and Pierre F Baldi. Bayesian surprise attracts human attention. In Advances in neural information processing systems, pages 547–554, 2005. [60] Robert J Peters, Asha Iyer, Laurent Itti, and Christof Koch. Components of bottom-up gaze allocation in natural images. Vision research, 45(18):2397– 2416, 2005. [61] Reynold Bailey, Ann McNamara, Nisha Sudarsanam, and Cindy Grimm. Subtle gaze direction. ACM Trans. Graph., 28(4):100:1–100:14, September 2009. [62] John Jonides. Voluntary versus automatic control over the mind?s eye’s movement, volume 9, pages 187–203. Erlbaum, 1981. [63] D. DeCarlo and A. Santella. Stylization and abstraction of photographs. In SIGGRAPH ’02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 769–776, New York, NY, USA, 2002. ACM Press. [64] E.R. Grant and M. J. Spivey. Eye movements and problem solving: guiding attention guides thought. Psychological Science, 14(5):462–466, 2003. [65] Martin Groen and Jan Noyes. Solving problems: How can guidance concerning task-relevancy be provided? Comput. Hum. Behav., 26:1318–1326, November 2010. [66] Dirk Walther, Ueli Rutishauser, Christof Koch, and Pietro Perona. Selective visual attention enables learning and recognition of multiple objects in cluttered scenes. Comput. Vis. Image Underst., 100(1-2):41–63, October 2005.

Bibliography

49

[67] L.E. Thomas and A. Lleras. Moving eyes and moving thought: on the spatial compatibility between eye movements and cognition. Psychonomic bulletin and review, 14(4):663–668, August 2007. [68] Reynold Bailey, Ann McNamara, Aaron Costello, Srinivas Sridharan, and Cindy Grimm. Impact of subtle gaze direction on short-term spatial information recall. In Proceedings of the Symposium on Eye Tracking Research and Applications, ETRA ’12, pages 67–74, New York, NY, USA, 2012. ACM. [69] Pernilla Qvarfordt, Jacob T Biehl, Gene Golovchinsky, and Tony Dunningan. Understanding the benefits of gaze enhanced visual search. In Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications, pages 283–290. ACM, 2010. [70] Ann McNamara, Reynold Bailey, and Cindy Grimm. Improving search task performance using subtle gaze direction. In APGV ’08: Proceedings of the 5th Symposium on Applied Perception in Graphics and Visualization, pages 51–56, New York, NY, USA, 2008. ACM. [71] Srinivas Sridharan, Reynold Bailey, Ann McNamara, and Cindy Grimm. Subtle gaze manipulation for improved mammography training. In Proceedings of the Symposium on Eye Tracking Research and Applications, ETRA ’12, pages 75– 82, New York, NY, USA, 2012. ACM. [72] Ann McNamara, Thomas Booth, Srinivas Sridharan, Stephen Ca↵ey, Cindy Grimm, and Reynold Bailey. Directing gaze in narrative art. In Proceedings of the ACM Symposium on Applied Perception, SAP ’12, pages 63–70, New York, NY, USA, 2012. ACM. [73] P. Vaidyanathan, J. Pelz, Rui Li, S. Mulpuru, Dong Wang, Pengcheng Shi, C. Calvelli, and A. Haake. Using human experts’ gaze data to evaluate image processing algorithms. In IVMSP Workshop, 2011 IEEE 10th, pages 129 –134, june 2011.

Bibliography

50

[74] R. Li, P. Vaidyanathan, S. Mulpuru, J. Pelz, P. Shi, C. Calvelli, and A. Haake. Human-centric approaches to image understanding and retrieval. IEEE Western New York Image Processing Workshop, pages 62–65, 2010. [75] Thomas Booth, Srinivas Sridharan, Ann McNamara, Cindy Grimm, and Reynold Bailey. Guiding attention in controlled real-world environments. In Proceedings of the ACM Symposium on Applied Perception, SAP ’13, pages 75–82, New York, NY, USA, 2013. ACM. [76] Tom Fawcett. Roc graphs: Notes and practical considerations for researchers. Technical report, 2004. [77] V. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8–17, 1965. [78] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8, 1966. [79] A. Borji and L. Itti. Defending yarbus: Eye movements reveal observers’ task. Journal of Vision, 14(3(29)):1–22, Mar 2014. [80] M. S. Peterson, A. F. Kramer, and D. E. Irwin. Covert shifts of attention precede involuntary eye movements. Percept Psychophys, 66(3):398–405, Apr 2004. [81] Moran Cerf, Jonathan Harel, Wolfgang Einh¨auser, and Christof Koch. Predicting human gaze using low-level saliency combined with face detection. In Advances in neural information processing systems, pages 241–248, 2008. [82] Vladimir Vezhnevets, Vassili Sazonov, and Alla Andreeva. A survey on pixelbased skin color detection techniques. In Proc. Graphicon, volume 3, pages 85–92. Moscow, Russia, 2003. [83] Krista A Ehinger, Barbara Hidalgo-Sotelo, Antonio Torralba, and Aude Oliva. Modelling search for people in 900 scenes: A combined source model of eye guidance. Visual cognition, 17(6-7):945–978, 2009.

Bibliography

51

[84] Michael Dorr, Thomas Martinetz, Karl R Gegenfurtner, and Erhardt Barth. Variability of eye movements when viewing dynamic natural scenes. Journal of vision, 10(10):28, 2010. [85] Olivier Le Meur, Patrick Le Callet, Dominique Barba, and Dominique Thoreau. A coherent computational approach to model bottom-up visual attention. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(5):802–817, 2006. [86] Junle Wang, Damon M Chandler, and Patrick Le Callet. Quantifying the relationship between visual salience and visual importance. In IS&T/SPIE Electronic Imaging, pages 75270K–75270K. International Society for Optics and Photonics, 2010. [87] Gert Kootstra, Bart de Boer, and Lambert RB Schomaker. Predicting eye fixations on complex visual stimuli using local symmetry. Cognitive computation, 3(1):223–240, 2011. [88] Ian Van Der Linde, Umesh Rajashekar, Alan C Bovik, and Lawrence K Cormack. Doves: a database of visual eye movements. Spatial vision, 22(2):161–177, 2009. [89] Jian Li, Martin D Levine, Xiangjing An, Xin Xu, and Hangen He. Visual saliency based on scale-space analysis in the frequency domain. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(4):996–1010, 2013. [90] Tilke Judd, Fr´edo Durand, and Antonio Torralba. A benchmark of computational models of saliency to predict human fixations. 2012. [91] Tilke Judd, Fredo Durand, and Antonio Torralba. Fixations on low-resolution images. Journal of Vision, 11(4):14, 2011. [92] Subramanian Ramanathan, Harish Katti, Nicu Sebe, Mohan Kankanhalli, and Tat-Seng Chua. An eye fixation database for saliency detection in images. In Computer Vision–ECCV 2010, pages 30–43. Springer, 2010.

Bibliography

52

[93] Hantao Liu and Ingrid Heynderickx. Studying the added value of visual attention in objective image quality metrics based on eye movement data. In Image Processing (ICIP), 2009 16th IEEE International Conference on, pages 3097–3100. IEEE, 2009. [94] Hani Alers, Lennart Bos, and Ingrid Heynderickx. How the task of evaluating image quality influences viewing behavior. In Quality of Multimedia Experience (QoMEX), 2011 Third International Workshop on, pages 167–172. IEEE, 2011. [95] Judith Redi, Hantao Liu, Rodolfo Zunino, and Ingrid Heynderickx. Interactions of visual attention and quality perception. In IS&T/SPIE Electronic Imaging, pages 78650S–78650S. International Society for Optics and Photonics, 2011. [96] Ulrich Engelke, Anthony Maeder, and H Zepernick. Visual attention modelling for subjective image quality databases. In Multimedia Signal Processing, 2009. MMSP’09. IEEE International Workshop on, pages 1–6. IEEE, 2009.

Saliency and Task-Based Eye Movement [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch