Procesamiento del Lenguaje Natural, Revista nÂº 40, marzo de ... - sepln [PDF]

Apr 28, 2008 - permutaciones y por la inserciÃ³n o eliminaciÃ³n de palabra individuales. Menor es el ...... Cuadro 5: Co

0 downloads 6 Views 6MB Size

Report

Download PDF

PNG Network

Recommend Stories

Procesamiento de lenguaje natural

Silence is the language of God, all else is poor translation. Rumi

Procesamiento del Lenguaje Natural, Revista nº 51, septiembre de 2013 ISSN

Life is not meant to be easy, my child; but take courage: it can be delightful. George Bernard Shaw

O.M. n. 40 del 23 marzo 2005

Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Aplicaciones del Procesamiento del Lenguaje Natural en la Recuperaciéon de Informaciéon en

Love only grows by sharing. You can only have more for yourself by giving it away to others. Brian

xxxiii congreso internacional de la sociedad española para el procesamiento del lenguaje natural

Don’t grieve. Anything you lose comes round in another form. Rumi

Circolare INPS 2 marzo 2018 n.40

I want to sing like the birds sing, not worrying about who hears or what they think. Rumi

DECRETO LEGISLATIVO 6 marzo 2017, n. 40

We can't help everyone, but everyone can help someone. Ronald Reagan

Revista COS Marzo 2016

Be grateful for whoever comes, because each has been sent as a guide from beyond. Rumi

De la Subjetividad del Lenguaje

The happiest people don't have the best of everything, they just make the best of everything. Anony

Deliberazione n. 19 del 26 marzo 2018

Be like the sun for grace and mercy. Be like the night to cover others' faults. Be like running water

Idea Transcript

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008

ISSN: 1135-5948 Artículos Modelling OLIF frame with EAGLES/ISLE specifications: an interlingual approach Carlos Periñán-Pascual, Francisco Arcas-Túnez..........................................................................................9 Aggregation in the In-Home Domain Eva Florencio, Gabriel Amores, Guillermo Pérez, Pilar Manchón...............................................................17 Detección de fármacos genéricos en textos biomédicos Isabel Segura-Bedmar, Paloma Martínez, Dooa Samy...............................................................................27 Bases de Conocimiento Multilíngües para el Procesamiento Semántico a Gran Escala Montse Cuadros, German Rigau..................................................................................................................35 From knowledge acquisition to information retrieval Milagros Fernández Gavilanes, Sara Carrera Carrera, Manuel Vilares Ferro.............................................43 Desarrollo de un Robot-Guía con Integración de un Sistema de Diálogo y Expresión de Emociones: Proyecto ROBINT J.M. Lucas, R. Alcázar, J. M. Montero, F. Fernández, R.Barra-Chicote, L.F. D'Haro, J. Ferreiros, R. de Córdoba, J. Macías-Guarasa, R. San Segundo, J.M. Pardo.......................................................................51 Experiments with an ensemble of Spanish dependency parsers Roser Morante..............................................................................................................................................59 Predicción estadística de las discontinuidades espectrales del habla para síntesis concatenativa Manuel Pablo Triviño, Francesc Alías..........................................................................................................67 Identificación de emociones a partir de texto usando desambiguación semántica David García, Francesc Alías......................................................................................................................75 InTiMe: Plataforma de Integración de Recursos de PLN José Manuel Gómez.....................................................................................................................................83 Non-Parametric Document Clustering by Ensemble Methods Edgar Gonzàlez Pellicer, Jordi Turmo Borràs..............................................................................................91 An Innovative Two-Stage WSD Unsupervised Method Javier Tejada-Cárcamo, Alexander Gelbukh, Hiram Calvo.........................................................................99 Applying a culture dependent emotion triggers syn="S" sem="Agent" /> into onto with

3 Integrating frames into meaning postulates Most semantic representations of verbs have traditionally taken one of two forms (Levin 1995): semantic role-centred approaches (Fillmore 1968, Gruber 1965), where verb arguments are identified on the basis of their semantic relations with the verb, or predicate decomposition approaches (Jackendoff 1972, Schank 1973), which involve the decomposition of verb meaning by means of a restricted set of primitive predicates. In FunGramKB, both approaches are integrated. Similarly to semantic role-centred approaches, verbs are assigned one or more

Figure 2: Predicate frame of load The predicate frame is a structural scheme in which the quantitative and qualitative6 6

Selectional preferences on an argument are not really stored in predicate frames, but they are part of thematic frames in the FunGramKB ontology. However, since predicate frames are derived from thematic frames, selectional preferences can definitely take part in full-fledged predicate frames.

11

Carlos Periñán-Pascual, Francisco Arcas-Túnez

the lexical units linked to that event.8 Therefore, predicate frames are lexical constructs belonging to a particular language, but they are constructed from the interlingual thematic frames located in the ontology. In FunGramKB, every argument found in the predicate frame of a verb must be referenced through coindexation in the thematic frame of the event to which the verb is linked. Moreover, every argument found in the thematic frame of an event is referenced through co-indexation in the meaning postulate assigned to that event. To illustrate, figure 4 displays both the parenthetic string representation and the XML representation of the meaning postulate of +LOAD_00:

valencies of the verb are stated: e.g. load has three subcategorized arguments with the semantic functions Agent, Theme and Goal. Moreover, predicate frames are enriched with information about subcategorization patterns describing the phrasal realizations and syntactic behaviour of the arguments which can linguistically co-occur with the verb. On the other hand, and like predicate decomposition approaches, a lexical unit is linked to a meaning postulate through a conceptual unit in the FunGramKB ontology.7 Furthermore, predicate frames assigned to a lexical unit are integrated into the meaning representation to which the lexical unit is linked by means of the “thematic frame”. To illustrate, figure 3 displays both the parenthetic string representation and the XML representation of the thematic frame of +LOAD_00:

+(e1: +PUT_00 (x1)Agent (x2)Theme (x3)Origin (x4)Goal (f1: +IN_00 ^ +ON_00)Position (f2: (e2: +TAKE_01 (x4)Agent (x2)Theme (x5)Location (x4)Origin (x6)Goal))Purpose)

(x1: +HUMAN_00 ^ +VEHICLE_00)Agent (x2: +CORPUSCULAR_00)Theme (x3)Origin (x4: +HUMAN_00 ^ +ANIMAL_00 ^ +VEHICLE_00)Goal

Figure 3: Thematic frame of +LOAD_00 Thematic frames are cognitive schemata specifying the type of participants involved in the situation described by an event. These participants can be instantiated in the form of arguments in the predicate frames assigned to

Figure 4: Meaning postulate of +LOAD_00 8

The difference between thematic frames and predicate frames is partly influenced by the distinction in the Construction Grammar (Goldberg 1995) between argument roles and participant roles respectively, where the first are related to the construction and the latter to the frame of a particular verb.

7

In fact, regularities in the semantic distribution of verbs in FunGramKB are not based on syntactic criteria (cf. Levin 1993) but on the cognitive decompositions of events by means of their meaning postulates.

12

Modelling OLIF frame with EAGLES/ISLE specifications: an interlingual approach

information, but also on its remarkable conceptualist approach. To this respect, two main differences are observed between OLIF frames and FunGramKB predicate frames. Firstly, OLIF frames are semantically underspecified, since no semantic role is assigned to any slot. Secondly, slot fillers in OLIF are language-specific and not formally represented, whereas in FunGramKB selectional preferences are represented by concepts. Selection preferences should not be lexicalized, but somehow they should be part of human beings’ cognitive knowledge. The benefit of this approach is twofold: (i) the use of concepts as the building blocks of predicate frames removes the problem of lexical semantic ambiguity, and (ii) the inferential power of the reasoning engine is more robust if predictions are based on cognitive expectations. The following section highlights the influence of EAGLES/ISLE standard on the construction of both predicate and thematic frames in FunGramKB.

For example, the first predicate frame of load matches the morphosyntactic structure of a sentence such as They loaded all their equipment into backpacks, identifying they as the loaders (Agent), equipment as the thing to be loaded (Theme) and backpacks as the target entity where that thing is placed (Goal). However, the semantic burden of the frame is greater when linked to the thematic frame and the meaning postulate of +LOAD_00, which reveal that “they put the equipment into backpacks because they intended to carry it to another place”.9 As it has been demonstrated, every argument in the predicate frame of a verb is finally integrated in the meaning postulate of its event through the arguments of its thematic frame, which plays a crucial role in both the semantic role-centred and predicate decomposition approaches to the semantic representation of verbs in FunGramKB. 4 The OLIF frame category Three OLIF data categories are relevant for the construction of FunGramKB predicate frames: (i)

specifies the type of prototypical transitivity of the verb.

(ii)

describes the subcategorization of the lexical entry. A slot-grammar approach is taken for the description of syntactic frames. For example, the frame for the English verb try is as follows (McCormick 2002):

5 Taking into recommendations

EAGLES/ISLE

EAGLES/ISLE proposes two types of frame: the syntactic frame, which describes the surface structure, and the semantic frame, which describes the deep structure. On the one hand, the syntactic (or subcategorization) frame is expressed as a list of slots, where each slot is described in terms of phrasal realization, grammatical function, restricting features and optionality. Indeed, EAGLES/ISLE proposes a FrameSet to be included in the syntactic entry with the aim of collecting surface regular alternations associated with the same deep structure by explicitly linking the slots of the alternating frames by means of rules. Frames involved in a FrameSet are considered to be at the same level, i.e. no alternating frame has a status of privilege from which the other frames are derived through some lexical rule. Surprisingly, the EAGLES/ISLE approach is not as descriptively economical as the traditional approach, where, given two alternating frames, one of them is deemed to be basic and the other derivative. In comparison with the EAGLES/ISLE proposal of syntactic frame, FunGramKB predicate frames make a limited use of restricting features, because only lexical features can be used to refine the information

[subj, (dobj-opt | dobj-sent-ing-opt | dobj-sent-inf-opt)] (iii)

account

specifies the preposition that fills a “prepositional phrase” slot.

The main advantage of the FunGramKB model of predicate frame does not lie just on the further specification of the lexical 9

Indeed, a lexical unit is associated to much more semantic information which is really shown in its meaning postulate. In FunGramKB, all this underlying cognitive information is revealed through a multi-level process called MicroKnowing (Periñán-Pascual and Arcas-Túnez 2005), where thematic frames also play a key role in the application of the inheritance and inference mechanisms on meaning postulates.

13

Carlos Periñán-Pascual, Francisco Arcas-Túnez

specified in the arguments: e.g. the preposition that introduces a prepositional phrase. Moreover, the optional realization of an argument is not stated in FunGramKB predicate frames, because it is thought that context can admit the omission of any traditionally obligatory argument. Concerning frame alternations, FunGramKB can reflect all those syntactic phenomena in which no satellite is involved in the shift. On the contrary, satelliteoriented alternations such as locative alternations or material/product alternations are disregarded, since satellites are excluded from predicate frames. On the other hand, the EAGLES/ISLE semantic frame (or argument structure) is defined in the form of a predicate and a list of arguments, which are described in terms of thematic role and semantic preferences. In general, the type of information in the FunGramKB thematic frame matches that of the EAGLES/ISLE semantic frame; however, differences are found in their approaches to the syntax-semantics interface within a multilingual dimension. EAGLES/ISLE recommends preferably a transfer architecture,10 where monolingual syntactic and semantic frames are put into correlation between L1 and L2; in addition, this approach requires the specification of a set of transformational operations to go from L1 to L2. On the contrary, an interlingual model is adopted by FunGramKB, where thematic frames serve as the bridge between L1 predicate frames and those in L2. Transfer rules are not required since thematic frames are not linked to any particular lexicon but to the ontology, which is shared by all languages. As a result, the FunGramKB interlingual approach gives a more cognitive view to the EAGLES/ISLE semantic frame. Firstly, EAGLES/ISLE recommends that both the predicate and its arguments should be instantiated with language-dependent lexical units, so that complexity in the linkage of the syntactic and semantic frames is dramatically reduced. On the contrary, sub-elements in FunGramKB thematic frames are not lexically driven, since predicates and semantic preferences on arguments are chosen from concepts of the ontology. Therefore, the notion

of thematic frame is more abstract than that of semantic frame. Secondly, EAGLES/ISLE proposes that the choice of the number of arguments for a predicate should be determined on purely semantic grounds; thus it is possible that (a) a syntactic position cannot be mapped to any semantic argument—i.e. reduced correspondence, or (b) a semantic argument cannot be mapped to any syntactic position— i.e. augmented correspondence. In FunGramKB, any decision on the type and number of arguments in thematic frames is guided by cognitive criteria. However, the FunGramKB architecture is so marked by the conceptualist approach that, for example, reduced correspondences in the syntaxsemantics interface are not permitted because predicate frames are built out of their thematic frames, but not conversely. 6 Conclusions and future work This paper presents the modifications and extensions to the OLIF model of frame by taking into account some of the EAGLES/ISLE recommendations. The result is that FunGramKB is provided with predicate frames in the lexicon (lexical frames) and thematic frames in the ontology (cognitive frames). We have also described that the two most important approaches to lexical semantic representation are fully integrated in FunGramKB: thus verbs are assigned one or more predicate frames, whose arguments play an active role in the construction of the meaning postulates to which those verbs are linked. In short, the FunGramKB interlingual approach, which gives a more cognitive view to the EAGLES/ISLE semantic frame, contributes to the large-scale development of deep-semantic NLP resources, mainly for natural language understanding. We intend to develop a more robust characterization of predicate frames by exploring linguistically annotated corpora. Thus, and guided by some other suggestions proposed by EAGLES/ISLE, predicate frames could also include: (i)

11

an index indicating the frequency of the frame,11

Frame probability can be particularly useful in natural language generation. For example, the current model of FunGramKB stores a default translation equivalent for every lexical unit, but it could be possible to use statistical information to

10

Although other approaches to translation are also considered, EAGLES/ISLE multilingual layer is inspired mostly on the transfer-based model.

14

Modelling OLIF frame with EAGLES/ISLE specifications: an interlingual approach

(ii) (iii)

(iv)

(v)

a wider range of participants, i.e. satellites together with arguments, morphosyntactic restrictions on participants, e.g. whether the phrasal realization in a slot must be instantiated via plural word form, conditional optionality of participants, i.e. when the absence of a participant excludes or requires the presence of another participant, lexical collocations as selectional preferences on participants,

EAGLES Document MORPHSYN/R.

EAG-CLWG-

EAGLES Lexicon Interest Group. 1996b. EAGLES: preliminary recommendations on subcategorisation. EAGLES Document EAG-CLWG-SYNLEX/P. EAGLES Lexicon Interest Group. 1999. EAGLES: preliminary recommendations on lexical semantic encoding. Final report LE34244. Fillmore, C.J. 1968. The case for case. E. Bach and R.T. Harms. eds. Universals in Linguistic Theory. Holt, Rinehart & Winston, New York, 1-88.

Bibliography Calzolari, N., R. Grishman, and M. Palmer. eds. 2001. Survey of major approaches towards bilingual/multilingual lexicons. ISLE Deliverable D2.1-D3.1. ISLE Computational Lexicon Working Group.

Goldberg, A.E. 1995. Constructions: A Construction Grammar Approach to Argument Structure. The University of Chicago Press, Chicago.

Calzolari, N., F. Bertagna, A. Lenci, and M. Monachini. eds. 2003. Standards and best practice for multilingual computational lexicons and MILE. Deliverable D2.2-D3.2. ISLE Computational Lexicon Working Group.

Gruber, J.S. 1965. Studies in Lexical Relations. Doctoral dissertation. MIT. Jackendoff, R.S. 1972. Semantic Interpretation in Generative Grammar. MIT Press, Cambridge (Mass.). Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation. The University of Chicago Press, Chicago.

Calzolari, N., A. Lenci, and A. Zampolli. 2001a. The EAGLES/ISLE computational lexicon working group for multilingual computational lexicons. Proceedings of the First International Workshop on Multimedia Annotation. Tokyo (Japan).

Levin, B. 1995. Approaches to lexical semantic representation. D.E. Walker, A. Zampolli, and N. Calzolari. eds. Automating the Lexicon: Research and Practice in a Multilingual Environment. Oxford University Press, New York.

Calzolari, N., A. Lenci, and A. Zampolli. 2001b. International standards for multilingual resource sharing: the ISLE Computational Lexicon Working Group. Proceedings of the ACL 2001 Workshop on Human Language Technology and Knowledge Management. 71-78, Morristown (USA).

Lieske, C., S. McCormick, and G. Thurmair. 2001. The Open Lexicon Interchange Format (OLIF) comes of age. Proceedings of the Machine Translation Summit VIII: Machine Translation in the Information Age. 211-216, Santiago de Compostela (Spain).

EAGLES Lexicon Interest Group. 1993. EAGLES: Computational Lexicons Methodology Task. EAGLES Document EAG-CLWG-METHOD/B.

McCormick, S. 2002. The Structure and Content of the Body of an OLIF v.2.0/2.1. The OLIF2 Consortium.

EAGLES Lexicon Interest Group. 1996a. EAGLES: synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora. A common proposal and applications to European languages.

McCormick, S., C. Lieske, and A. Culum. 2004. OLIF v.2: A Flexible Language Data Standard. The OLIF2 Consortium. Monachini, M., F. Bertagna, N. Calzolari, N. Underwood, and C. Navarretta. 2003. Towards a Standard for the Creation of

address the translation of an L1 lexical unit to the most probable equivalent in L2.

15

Carlos Periñán-Pascual, Francisco Arcas-Túnez

Lexica. ELRA European Resources Association.

Language

Periñán-Pascual, C. and F. Arcas-Túnez. 2005. Microconceptual-Knowledge Spreading in FunGramKB. 9th IASTED International Conference on Artificial Intelligence and Soft Computing, 239- 244, ACTA Press, Anaheim-Calgary-Zurich. Schank, R.C. 1973. Identification of conceptualizations underlying natural language. R.C. Schank and K.M. Colby. eds. Computer Models of Thought and Language. W.H. Freeman, San Francisco, 187-247. Underwood, N. and C. Navarretta. 1997. Towards a standard for the creation of lexica. Center for Sprogteknologi. Copenhagen.

16

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 17-26

recibido 28-01-08, aceptado 03-03-08

Aggregation in the In–Home Domain∗ Agregaci´ on en el entorno dom´ otico Eva Florencio, Gabriel Amores, Guillermo P´ erez, Pilar Manch´ on Grupo de Investigaci´on Julietta Universidad de Sevilla Palos de la Frontera, s/n 41004 Sevilla, Spain {evaflorencio,jgabriel,gperez,pmanchon}@us.es Resumen: Este art´ıculo describe experimentos realizados con vistas a determinar las preferencias de agregaci´on l´exica y sint´actica en ingl´es y espa˜ nol. El objetivo final es la implementaci´on de dichas estrategias en el m´odulo de generaci´on de lenguaje natural de un sistema de di´alogo multimodal para el entorno dom´otico. Palabras clave: Agregaci´on, Generaci´on de Lenguaje Natural, Sistemas de Di´alogo Abstract: This paper describes experiments carried out in order to determine syntactic and lexical aggregation preferences by English and Spanish users. The final goal of this work is the implementation of such strategies in the NLG module of a multimodal dialogue system in the in–home domain. Keywords: Aggregation, Natural Language Generation, Dialogue Systems

1

Introduction

Describing the state of the different devices in a scenario such as the one in Figure 1, where information can be presented and expressed in multiple ways, involves a great complexity for Natural Language Generation (NLG) systems, and even for human beings.

Figure 1: Virtual House Example Thus, the house in Figure 1, could be described by focussing on those devices which are switched–on, or we could group them according to their location, or type, as shown in examples 1, 2 and 3, respectively: (1) The TV, the lights in the sitting room and the light in the kitchen ∗

This work has been funded by the Education and Science Spanish Ministry under the project GILDA: Natural Language Generation for Dialogue Systems (TIN2006-14433-C02-02).

ISSN 1135-5948

are on. (2) In the sitting room, the light is on. The light is on in the kitchen and the TV is on in the bedroom. (3) The lights in the sitting room and kitchen are on, and the TV in the bedroom is on. Moreover, not only can elements be grouped in several ways, but information can also be aggregated differently. Thus, the state of each individual device could be described by single independent clauses without combining them, as shown in example 4: (4) The light in the bedroom is off. The blinds in the bedroom are rolled down. The TV in the bedroom is off. The lights in the patio are off ... Although this way of presenting information is perfectly grammatical, it results in very monotonous and machine–like outputs. An NLG system which is capable of performing different aggregation strategies will produce a more natural output. This paper describes experiments carried out in order to determine aggregation preferences by English and Spanish users. The final goal of this work is the implementation of such strategies in the NLG module of a © Sociedad Española para el Procesamiento del Lenguaje Natural

Eva Florencio, Gabriel Amores, Guillermo Pérez, Pilar Manchón

by Reape and Mellish (1999), most NLG systems lack a linguistic foundation to account for aggregation strategies.

multimodal dialogue system in the in–home domain. The paper is organised as follows. Section 2 introduces the process of aggregation and its relevance in natural language generation. Next, section 3 describes the MIMUS multimodal dialogue system in which the aggregation strategies will be implemented. Section 4 outlines the initial working hypothesis to be confirmed by the experimental results. The experiments carried out are described in section 5. Sections 6 and 7 review the results obtained and the conclusions to be drawn from the experiments. Finally, section 8 advances some of the lines to be carried out from this moment in the context of the project.

2

3

The MIMUS Dialogue System

The context for this project is MIMUS, a multimodal and multilingual dialogue system based on the Information State Update (ISU) Approach (Larsson and Traum, 2000). The system has a symmetric architecture that allows that both the input and the output can be presented in graphical, voice or mixed (voice plus graphical) modalities. Besides, as it is a multilingual system, the user may interact dynamically in English and Spanish (Solar et al., 2007). MIMUS is made up of a series of collaborative agents (P´erez, Amores, and Manch´on, 2006) that cooperate and communicate among them under the Open Agent Architecture (OAA, Martin, Cheyer, and Moran (1999)) framework. The core module is the Dialogue Manager (DM), a collaborative agent that is linked to a Natural Language Understanding (NLU) module and to a Generation Module. Dialogues are driven both by the semantic information provided by the user and by the dialogue expectations generated by the dialogue manager. MIMUS incorporates its own specification language for dialogue structures that allows for the representation of the dialogue history, the control of expectations and the treatment of ambiguity. The current version of MIMUS contains a hybrid NLG module in which sentence planning takes the form of predefined templates, as described in (Amores, P´erez, and Manch´on, 2006). Utterances are elaborated from the mapping of abstract content representations to linguistic ones. In addition, some canned texts are used for common invariable expressions such as Hello, Thank you, or Bye–bye.

Aggregation

A review of the literature on aggregation (Dalianis, 1999; Wilkinson, 1995; Shaw, 1998; Cheng, 2000) clearly points out that there is no agreement on its definition or where to place it in the generation process. Albeit thorough attempts have been made to come up with a core definition (Reape and Mellish, 1999) and a standard architecture (Cahill and Reape, 1999), conceptual problems arise. For the purpose of this project, aggregation is conceived of as a process which removes redundant information from a text because it can be inferred or retrieved from linguistic sources (the remaining text), from computational sources (ontology), or pragmatically (using common knowledge). In this work, we will focus on syntactic aggregation, understanding it as the process of combining phrases by means of syntactic rules, such as coordination, ellipsis or subordination. There are, however, some cases of lexical aggregation covered in this study too. Lexical aggregation is understood as the process of mapping several lexical predicates/lexemes into fewer lexical predicates/lexemes. Pronominalisation is considered as a special case of lexical aggregation on the basis of Quirk et al. (1985)’s analysis of pro–form reduction. The theoretical motivation for it is that, indeed, it reduces the number of lexemes or predicates, but it is done by means of a pronoun, unlike other cases of reduction. We claim that all these phenomena have a linguistic motivation and, consequently, they should be linguistically–grounded. As noted

4

Working Hypothesis

The final goal of this work is to implement aggregation strategies in our NLG system. Namely, the final NLG module will be required to produce coordinated messages as well as sentences containing other linguistic phenomena, such as ellipsis, gapping or stripping. For instance, sentence 5 below shows an example of how the system should be able to concatenate the light’s locations, either by juxtaposition or coordination, and produce 18

Aggregation in the In-Home Domain

4.2

ellipsis or contribute with cue words such as also.

With a view to implementing aggregation in the NLG module of our system, it is important to have some understanding of the grammatical coverage needed in the in–home domain. In addition, the linguistic coverage of the expected texts to be generated is also conditioned by the type of application being implemented (a multimodal dialogue system), and the type of interactions supported (requests about the state of devices in the in– home domain). Taking into account possible questions that users may formulate when interacting with the system, answers may reply to questions about:

(5) The lights are on in the sitting room, in the bedroom, and in the kitchen. The hall is also on.

4.1

Linguistic constructions expected

Location in the overall system

This section discusses where aggregation strategies could be placed in the NLG module of MIMUS. Our first hypothesis is that both syntactic and lexical aggregation in the generation process in MIMUS will be located in the sentence planner. That is, sentence planning templates will be expanded with linguistic information so that they can perform syntactic and lexical aggregation. As explained in the previous section, sentence planning templates map conceptual representations into linguistic ones that will later be passed on to the surface realiser. Therefore, the type of syntactic construction should be specified in the sentence planner so that the surface realiser transforms it into a linguistic unit by means of syntactic rules. The form that terminal nodes will have if lexical aggregation has taken place should also be specified. For instance, some items may have been lexically aggregated by employing a hypernym (e.g., device) instead of their hyponyms (e.g., light, TV, fan and/or blind ). In this fashion, the proposed architecture including aggregation can be seen in Figure 2.

a. Quantity: the number of device(s) satisfying a specified condition(s). b. State: the state (on or off) of the devices will be requested. Two subtypes may be found: • Replies about the state of devices (How is the light in the kitchen? ) • Confirm the state of devices (Is the light in the kitchen on? ) c. Devices: information about which devices are in a specific state or location, i.e. (Which devices are on in the house? ) d. Location: obtain information about the location of devices, i.e. Where is the tv? As discussed in Section 1, the information gathered may be grouped according to some common feature, for example, the type of device, the state they are in, or the location. As a first hypothesis, our prediction is that the grouping will mainly be done by location (see example 6 below), perhaps as a consequence of the distribution of the house, which is clearly separated into rooms, as seen in Figure 1. (6) In the sitting room, the light is on, the fan is off, and the TV is on. In the bedroom, all the devices are on. In the patio, one light is on. Nevertheless, the description could also hinge on the type of device or on their state. In those situations in which one of these characteristics (state, device or location) is explicitly mentioned in the question, it is foreseen that:

Figure 2: Proposed location of aggregation strategies in the NLG module

19

Eva Florencio, Gabriel Amores, Guillermo Pérez, Pilar Manchón

1. If the device is explicitly mentioned, then the grouping is done by location; Sys: Please, tell me the state of the lights. Usr: In the sitting room, there is one light on. In the hall, the light is on. In the kitchen, the light is off. In the bathroom, it is on. In the patio, two lights are on and two are off.

– Constituent coordination: [ [The light in the kitchen] and [the light in the garage] ] are on. • Reduction: It is probably the most common definition of aggregation in the literature and one of the most controversial aspects of its definition. Reduction is the process of removing information that can be inferred or retrieved from the remaining text. Different kinds are distinguished, depending on the type of information elided.

2. If the location is explicitly mentioned, then the grouping is done by device type: Sys: How are the devices in the sitting room? Usr: There is one light on and the other one is off; the TV is on and the fan is off.

– Ellipsis: In our domain, we expect it to be performed mainly when asking about a particular device or when there is only one type of device in a location. (7) The (light in the) patio is on. – Gapping: It is prone to happen when the main verb is understood, because it has just been mentioned, or when it is a copulative verb. In this domain, the main verb will be the copulative estar/to be in almost every sentence. (8) In the sitting room, the TV is on and the fan (is) off. – Stripping: It will take place when describing a device that shares the same state as the one previously mentioned. (9) The light is off and the stove [is off ] too.

3. If the state is the only feature mentioned, then it is considered as a non–specific situation in which the general prediction applies (i.e., grouping will be done by location). Sys: Usr:

4.3

Which devices are on? In the sitting room, only the fan is on. In the bedroom, the light and the TV are on. In the hall, two lights are on.

Types of aggregation required

Concerning the types of syntactic and lexical aggregation that will be necessary in the MIMUS dialogue system, what follows is a list of the ones that should be implemented. The system should be able to produce them, but also to combine them when necessary. Besides, the insertion of some cue words or discourse markers would also be desirable.

• Multiple aggregation: more than aggregation process, including also lexical aggregation takes place. For instance, (10) In the patio, there are two lights on and [constituent coor] one [pronominalisation: light] off. The [ellipsis: light in] kitchen is on and [coor] the bathroom [gapping: is] off.

4.3.1 Syntactic Aggregation The next syntactic aggregation processes are required: • Paratactic constructions: linking units of the same rank (sentences, clauses or phrases –the latter case will be referred to as constituent coordination). They are used whenever we need to go through a list of references.

4.3.2 Lexical Aggregation Reducing the number of lexemes or predicates is required when all the devices in the same location have the same state, for instance: En el dormitorio, todo est´ a apagado/In the bedroom, everything is off ; or when describing the same device, such as Hay una luz encendida en el ba˜ no y otra en la

– Coordination: [The light in the kitchen is on] and [the blind is rolled up]. 20

Aggregation in the In-Home Domain

• 4 about devices.

cocina/There is one light on in the bathroom and another one in the kitchen. Apart from these pronominalisations, we also expect users to make use of other types of lexical aggregation such as the use of hypernym instead of its hyponyms, as in The devices are on (instead of The light and hob in the kitchen are on)/Los aparatos est´ an encendidos (instead of La luz y la vitrocer´ amica est´ an encendidas en la cocina).

• 2 about devices and location. • 3 about description. • 2 asked for confirmation of state. The user’s profile was not specific; the only feature they had in common was that they were na¨ıve, in the sense that they did not have any previous knowledge of the overall functioning of the system. The role of the users was to describe what they saw in a natural manner. In other words, they had to reply as information came to their minds, without elaborating the utterances beforehand. They were provided with some information prior to the experiments, such as the type of devices they may come across (lights, televisions . . . ) as well as the state they may be in (on, off . . . ) and the number of them in each location. There are nineteen devices available in the house, distributed as follows:

4.3.3 Cue Words Finally, the following cue words may contribute fluency, cohesion and coherence to the output messages: tambi´en; as´ı como; tanto. . . como. . . ; and sin embargo, salvo, or pero in Spanish; and too, also, both, and but or however in English. This will also result in more varied and less repetitive sentences.

5

Experiments

This section describes the experiments carried out in order to corroborate or refute the working hypotheses.

5.1

Goals

The main goal of these experiments has been the study of syntactic and lexical aggregation in the in–home domain, both in English and Spanish. Experiments were carried out in both languages in order to determine, in the first place, if they differ in the way information is aggregated. In doing so, aggregation per se will be studied (how do speakers aggregate?, how often?, in which order?) with the aim of obtaining a pattern which may serve as a model of behaviour for its subsequent implementation in the system.

Sitting room: two lights, a TV, a fan and a blind.

5.2

The first settings were considered as an initial contact with the system, in which only basic information could be obtained, being aggregation either basic or non–existent at all. As the experiment moves on, the difficulty increases. Different states with different devices and locations are combined together to see how the user aggregates information:

Bedroom: one light, a TV, a blind and a fan. Kitchen: one light and the ceramic hob. Bathroom: one light. Garage: two lights. Patio: four lights. Hall: one light.

Design

The experiment consisted in showing the informants fifteen print screens of the house in which the devices were in different state configurations. Informants were then asked to describe the state of the devices. The questions to be answered were in the range of possible requests that users can formulate to the system in the real application. Our final goal is to achieve a natural, human– like, virtual butler for the house. The scenarios were distributed as follows:

• simple enumeration, • use of cue words, and

• 3 scenarios asked about quantity.

• preferences either by location, type of device or state.

• 1 about location. 21

Eva Florencio, Gabriel Amores, Guillermo Pérez, Pilar Manchón

5.3

Corpus

The corpus of study was obtained after interviewing twenty–four informants, twelve in Spanish and twelve in English. As aforementioned, since no specific user profile was sought, informants do not share the same characteristics in both languages. Since each informant was presented with 15 print screens, a corpus of 180 descriptions has been obtained for each language. 5.3.1 Spanish Corpus In the Spanish version of the experiment, twelve users were enrolled. Out of these twelve informants, only four were women; the rest were men. All of them were native speakers of Spanish. Their education level was high, meaning that except for one of the informants, all of them held at least a university degree (Master’s Degree, PhD students and PhDs were also interviewed). Their age ranged between 25 and 44 years old. The average age was 27.1, the median was 26, the mode was 25, and the standard deviation was 5.51.

Figure 3: Users’ age range in years type of information demanded, determining if users were asked about the number of a specific device with a concrete state or about the number of devices in general, among other possibilities. Then, the different model answers were set and the usage percentages (out of the total answers for that specific kind of question) were given (see (Florencio, 2007) for further details). At the same time, we also analysed the way in which informants grouped information, either by devices, states or location. After that, the lexical and syntactic aggregation found in each of the predominant patterns is pointed out, as well as the cue words used.

5.3.2 English Corpus For the English version, another twelve informants were recruited. As opposed to the Spanish version, the majority of the users were women, there were only four men involved in the experiment.1 Two of these informants were bilingual (one English and French, and the other Tamil and English), but both reside in English-speaking countries. The average education level was degree studies. Except for three users (two Master’s Degree and Degree), the rest of them were college students. The range of age was from 20 to 62 years old. The average age was 24.3, the median was 21.5, the mode was 20, and the standard deviation was 11.7. The informants’ age distribution of both languages can be seen in Figure 3.

6

6.1

Spanish Results

6.1.1

Types of Syntactic and Lexical Aggregation Performed The most common syntactic structures employed in Spanish were ellipsis, gapping and coordination (including constituent coordination), which were found in almost every reply. Coordination is the most frequent aggregation strategy employed (147 times), above all, when enumerating. Besides, since there were many questions demanding a description, it took place in almost every reply at least once (either sentence coordination or constituent coordination). Ellipsis was the second most frequent type of aggregation (104 times), which was mostly used when the question specified the device. In such cases, most users elided the device in the reply.

Results

In order to properly analyse the results, we first specified the kind of question being asked. That is, among the questions asking for quantity, for example, we broke down the

Sys:

1

The data survey collection was carried out to determine if personal aspects, such as age, sex, or cultural level, could have an influence on their answers. Since no differences were found, no further comment will be made on these aspects.

Usr:

22

¿Qu´e luces est´ an apagadas? (setting 3) Las del sal´ on, una del garage, la cocina, el ba˜ no, dos del patio y el dormitorio.

Aggregation in the In-Home Domain

Ellipsis also occurred when describing the state of a particular device. Sys:

D´ıgame qu´e luces est´ an encendidas. (setting 6) Usr: Una (luz) en el sal´ on, Una (luz) en el dormitorio, dos (luces) en el garaje. As expected, users avoided repetition when they deemed the information was inferable. Gapping was also used very frequently (81 times). There were some informants who omitted the main verb in 90% of their productions. This pattern was used by a few users regularly but not very often by the rest. The reason may reside in the copulative nature of the verb estar. Sys: ¿Me puede describir el estado de todos los dispositivos (luces, aparatos y persianas)? (setting 5) Usr: En el sal´ on, las dos luces apagadas, televisi´ on apagada, y ventilador en movimiento, la persiana del sal´ on bajada, la luz de la entrada apagada. Las dos luces del garaje apagadas. La luz de la cocina encendida, la vitrocer´ amica encendida . . . Stripping was not used very frequently, with the exception of a couple of users who performed it (an average of twice per user, 4 times used). When used, it occurred when a location had more than one device, especially two, and both of them were in the same state, for example: La luz de la cocina est´ a encendida y la vitrocer´ amica (est´ a encendida) tambi´en. Concerning lexical aggregation, todo/a, ninguno/a, nada (15 times), and otro/a (16 times) were often used when describing the same state or when all the devices shared the same state. Otro/a was often employed when enumerating the same device in different locations. No use of the hypernym dispositivo(s), for instance, was made to refer to all lights, blinds, and so on; instead, todo/ninguno was preferred. 6.1.2 Use of cue words The most commonly used cue word was tambi´en (15 times), in an average of at least one time per user. It was mostly used in enumeration. Some users alternated it with other cue words such as as´ı como (1 time) or tanto. . . como. . . (2 times). Other 23

markers used were adversative conjunctions, such as sin embargo (1 time), pero (1 time), salvo (1 time), and some distributive ones: uno. . . otro. . . (10 times). The words s´ olo and el resto were used once each.

6.2 6.2.1

English Results

Types of Syntactic and Lexical Aggregation Performed An analysis of the syntactic and lexical aggregations performed on the English productions was carried out. With respect to syntactic aggregation, the most frequent strategies were ellipsis and coordination again. Coordination, both sentence and constituent coordination, was employed in almost every utterance, adding to a total of 151 times. This phenomenon was employed when listing the types of devices and/or their locations. In the settings in which a description was required, coordination was mostly found. Concerning reduction, ellipsis was highly employed as well. Ellipsis was realised 72 times in all. In the majority of cases, the type of device was the element elided in the sentence, particularly when it appeared in the question in hand. Another form of reduction used was gapping, which appeared 10 times. Only a couple of informants generally omitted the main verb in the sentence, even though it was a copulative verb. No other syntactic strategies were found. With regard to lexical aggregation, we should point out the use of pronominal forms such as one(s) (16 times), other/another (5 times), everything (5 times) and nothing (2 times). They appeared mostly in descriptions, such as Everything is off in the sitting room or The fan is off in the bedroom, but the one in the sitting is on. Finally, all (7 times) and both (15 times) were also employed in the descriptions when the same state applied to all the devices, either in the house or in a specific location: All of the lights are on or Both of the lights are off in the sitting room. 6.2.2 Use of cue words It should be pointed out that English informants did not make use of many cue words in their replies. The most common cue words used were also (7 times) and the adversative but (9 times), which were used when enumerating or describing the state of all the devices in the house.

Eva Florencio, Gabriel Amores, Guillermo Pérez, Pilar Manchón

Other additive phrases employed were as well as (2 times), so is. . . (3 times), or as is. . . (1 time). For instance, The light in the living room is on, so is the one in the patio. As for other adversative phrases, the following ones were also mentioned: except for (1 time), all the rest (1 time), or all the other (3 times). An example would be The light in the kitchen is on, all the rest are off. The highly formal as far as was also used once when listing all the devices in the house (e.g. As far as TV’s, there are two). The adverb only was employed just once to make a contrast, On the patio, only one of the lights is off.

7

Dialogue alignment. Another interesting result from the experiment was that sentence structuring in the replies aligned with the structure of the question. In both languages users were prone to reply following a similar pattern as the one employed in the question whenever a full sentence was provided. In both cases the end–weight and end–focus principles applied. Long vs short answers. However, concerning the patterns established for the several questions, it should be highlighted that different models were obtained for English and Spanish. English speakers tend to construct full sentences, while Spanish speakers were more economic, and provided only the minimum information requested. For example, 53% of the Spanish informants replied to the quantity questions by just giving the number of devices, while only around 11% did so in English. Another divergence is found in the patterns obtained for the reply location scenario. Nearly 70% of the Spanish users just provided the location, as opposed to a 75% of English speakers who provided full sentences (The lights are on in the sitting room, in the bathroom, and in the hall ). This shows a preference for short incomplete sentences in Spanish and full sentences in English.

Comparison and conclusions

By and large, the predictions and working hypotheses advanced in section 4.2, were mostly correct. Grouping of information. With regard to the grouping of information, it was clearly done by location in both English and Spanish. This can be considered as a general preference on how to present the data as can be drawn form Figure 4.2

Syntactic aggregation. Another conclusion related to the preference for short or full sentences is the type of aggregation performed. As illustrated in Figure 5, Spanish users used more aggregation strategies than English informants, although not many aggregation strategies have been observed in the in–home domain overall. Apart from coordination, which was frequently employed in both languages, we could find other forms of syntactic aggregation in the Spanish corpus, such as ellipsis, gapping, and a few cases of stripping. Nevertheless, in the English data just ellipsis was found, and it was not commonly used. No other types of reduction were observed.

Figure 4: Preference for starting descriptions with location Information was not only grouped by location, though; it was presented in a hierarchical way. This hierarchy was not the same for both languages. In Spanish, the most common way to present the data follows a [State — Device — Location] pattern (Est´ a encendida la luz de la cocina); while, in English, the most popular pattern was [Device — State — Location] (The light is on in the kitchen).

Lexical aggregation. As far as lexical aggregation, the results were very similar in English and Spanish. Pronominalisation was the most frequent strategy in both languages. We should emphasise the use of pronominalisation forms such as todo/a, ninguno/a, nada, otro/a in Spanish, and one(s), other/another, everything or nothing

2 As we previously mentioned, this might be due to the graphical interface of the house.

24

Aggregation in the In-Home Domain

tegrated with the TAP system so that different aggregation strategies for both languages can be compared on the basis of the results obtained by the experiments. In addition, the new integrated prototype will incorporate preference strategies for lexical alignment, (i.e. if a user preferred the term bombilla instead of luz to refer to the lights in the house, the system should align consequently in the reply) and for fragmentary vs. verbose replies depending on the context.

References Amores, G., G. P´erez, and P. Manch´on. 2006. Reusing MT Components in Natural Language Generation for Dialogue Systems. Procesamiento del Lenguaje Natural, 37:215–221.

Figure 5: Syntactic aggregation performed in English. Use of cue words. Finally, with respect to cue words, no remarkable differences can be found between the two languages. Also/tambi´en obtained the highest frequency in both languages. The only point worth mentioning is that it seems that in English fewer cue words were employed but the ones employed were more varied. However, the difference is not significant. Is aggregation language–dependent? Finally, although a much broader analysis should be performed, a comparison of the corpora in English and Spanish seems to suggest that aggregation is language–dependent instead of language–independent. Besides, the enormous differences found between the patterns established in each language plus the different aggregation strategies employed open the possibility of reconsidering the localisation of the aggregation process at a later stage (i.e., not in the Sentence Realiser, but on the Surface Realiser), or consider that the generation module as a whole should be language–dependent.

8

Cahill, L. and M. Reape. 1999. Component tasks in applied NLG Systems. Technical report, Information Technology Research Institute Technical Report Series. Cheng, H. 2000. Experimenting with the Interaction between Aggregation and Text Structuring. In Proceedings of the ANLPNAACL 2000 Student Research Workshop, pages 1–6, Seattle, Washington, USA. Dalianis, H. 1999. Aggregation in Natural Language Generation. Computational Intelligence, 15(4):384–414, November. Florencio, E. 2007. A study on syntactic and lexical aggregation in the in-home domain. Master’s thesis, University of Seville, Spain, May. Gerv´as, P. 2007. TAP: a Text Arranging Pipeline. Technical report, Natural Interaction based on Language Research Group, Facultad de Inform´atica, Universidad Complutense de Madrid, May. Working draft. Larsson, S. and D. Traum. 2000. Information state and dialogue management in the TRINDI Dialogue Move Engine Toolkit. Natural Language Engineering, 6(3-4):323–340.

Future Work

At this point in the project, a new specification language is being created in collaboration with the TAP (a Text Arranging Pipeline) project (Gerv´as, 2007) in an effort to create a set of interfaces which define generic functionality for a pipeline of tasks oriented towards natural language generation. The DTAC representation obtained by our dialogue system is currently being in-

Martin, D. L., A. J. Cheyer, and D. B. Moran. 1999. The Open Agent Architecture: A Framework for Building Distributed Software Systems. Applied Artificial Intelligence, 13(1-2):91–128. 25

Eva Florencio, Gabriel Amores, Guillermo Pérez, Pilar Manchón

P´erez, G., G. Amores, and P. Manch´on. 2006. A Multimodal Architecture for Home Control by Disabled Users. In Proceedings of IEEE ACL Workshop on Spoken Language Technology (SLT), pages 134–137, Aruba, December. Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. 1985. A Comprehensive Grammar of the English Language. Longman Group Limited. Reape, M. and C. Mellish. 1999. Just what is aggregation anyway? In Proceedings of the 7th European Workshop on Natural Language Processing, pages 20–29, Toulouse (France), May. Shaw, J.C. 1998. Clause Aggregation Using Linguistic Knowledge. In Proceedings of the 9th International Workshop on Natural Language Generation, pages 138–147, Niagara-on-the-Lake, Canada, August. Solar, C. Del, G. P´erez, E. Florencio, D. Moral, G. Amores, and P. Manch´on. 2007. Dynamic Language Change in MIMUS. In Proceedings of the Eighth Interspeech Conference (INTERSPEECH 2007 Special Session: Multilingualism in Speech and Language Processing), pages 2141–2144, Antwerp, Belgium, August 2731. Wilkinson, J. 1995. Aggregation in natural language generation: Another look. Technical report, Co-op work term report, Department of Computer Science, University of Waterloo, September.

26

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 27-34

recibido 31-01-08, aceptado 03-03-08

Detección de fármacos genéricos en textos biomédicos Detecting generic drugs in biomedical texts Isabel Segura Bedmar Paloma Martínez Doaa Samy Universidad Carlos III de Universidad Carlos III de Universidad Carlos III de Madrid Madrid Madrid Avda. Universidad 30, 28911 Avda. Universidad 30, 28911 Avda. Universidad 30, 28911 Leganés, Madrid Leganés, Madrid Leganés, Madrid [email protected] [email protected] [email protected] Resumen: Este trabajo presenta un sistema para el reconocimiento y clasificación de nombres genéricos de fármacos en textos biomédicos1. El sistema combina información del Metatesauro UMLS2 y reglas de nomenclatura para fármacos genéricos, recomendadas por el consejo “United States Adoptated Names” (USAN)3, que permiten la clasificación de los fármacos en familias farmacológicas. La hipótesis de partida es que las reglas USAN son capaces de detectar posibles candidatos de fármacos que no están incluidos en UMLS (versión 2007AC), aumentando la cobertura del sistema. El sistema consigue un 100% de precisión y un 97% de cobertura usando sólo UMLS sobre una colección de 1481 resúmenes de artículos científicos de PubMed. La combinación de las reglas USAN con UMLS mejoran ligeramente la cobertura del sistema. Palabras clave: Reconocimiento de entidades biomédicas, Fármacos Genéricos, UMLS Abstract: This paper presents a system for drug name recognition and clasification in biomedical texts. The system combines information from UMLS Metathesaurus and nomenclatura rules for generic drugs, recommended by United States Adoptated Names (USAN), that allow the classification of the drugs in pharmacologic families. The initial hypothesis is that rules are able to detect possible candidates of drug names which are not included in the UMLS database (version 2007AC), increasing, in this way, the coverage of the system. The system achieves a 100% precision and 97% recall using UMLS only. The combination of the USAN rules and UMLS slightly improves the coverage of the system. Keywords: Biomedical Named Entities, Generic Drugs, UMLS.

1

Introducción

Este trabajo es un primer paso en el desarrollo de un sistema que permita la extracción automática de interacciones farmacológicas en textos biomédicos. Una interacción ocurre cuando los efectos de un fármaco se modifican por la presencia de otro fármaco, o bien de un alimento, una bebida o algún agente químico ambiental (Stockley, 2004).

Las consecuencias pueden ser perjudiciales si la interacción causa un aumento de la toxicidad del fármaco. Por ejemplo, los pacientes que reciben warfarina pueden comenzar a sangrar si se les administra azapropazona o fenilbutazona sin disminuir la dosis de warfarina. Del mismo modo, la disminución de la eficacia de un fármaco causada por una interacción puede ser igual de peligrosa: si a los pacientes que reciben warfarina se les administra rifampicina, necesitaran más cantidad de aquélla para mantener una anticoagulación adecuada. Sin

1

Este trabajo ha sido parcialmente financiado por los proyectos FIT-350300-2007-75 (Interoperabilidad basada en semántica para la Sanidad Electrónica) y TIN2007-67407-C03-01 (BRAVO: Búsqueda de respuestas avanzada multimodal y multilingüe). 2 http://www.nlm.nih.gov/research/umls/ 3 http://www.ama-assn.org/ama/pub/category/2956.html

ISSN 1135-5948

© Sociedad Española para el Procesamiento del Lenguaje Natural

Isabel Segura-Bedmar, Paloma Martínez, Dooa Samy

embargo, en determinadas ocasiones el uso combinado de medicamentos puede ser beneficioso. La combinación de fármacos antihipertensivos y diuréticos logran unos efectos antihipertensores que no se obtendrían con la administración de uno u otro fármaco por separado (Stockley, 2004). Cuantos más fármacos toma un paciente, mayor es la probabilidad de producirse una interacción adversa. En un estudio hospitalario se halló que el porcentaje era del 7% entre aquellos pacientes que tomaban entre 6 y 10 fármacos, pero aumentaba en un 40% en aquellos que ingerían entre 16 y 20 fármacos, lo que representa un aumento desproporcionado (Smith et al., 1969). Investigadores y profesionales de la salud utilizan distintos recursos como bases de datos online y herramientas4,5 para identificar y prevenir las interacciones farmacológicas. Sin embargo, la literatura biomédica es el mejor sistema para estar al día en lo que se refiere a la información sobre nuevas interacciones. Los últimos avances en biomedicina han provocado un crecimiento vertiginoso del número de publicaciones científicas. PubMed6, un buscador online de artículos de la revista MedLine, tiene más de 16 millones de resúmenes. Investigadores y profesionales de la salud están desbordados ante tal avalancha de información. Por este motivo, es imprescindible el desarrollo de sistemas que faciliten la extracción de conocimiento y un acceso eficiente a la información en el dominio de la biomedicina. El uso de recursos y tecnologías de procesamiento de lenguaje natural puede contribuir a ello. El reconocimiento y clasificación de los términos biomédicos es una fase crucial en el desarrollo de este tipo de sistemas. Es imposible comprender un artículo sin una precisa identificación de sus términos (genes, proteínas, principios activos, compuestos químicos, etc.). La detección de nombres de fármacos genéricos es una tarea compleja debido a las dificultades que implica el procesamiento del texto farmacológico. Nuevos fármacos se introducen diariamente mientras que otros se retiran. Los recursos terminológicos, aunque se

modificados frecuentemente, no pueden seguir el paso acelerado de esta terminología en constante cambio. Así, los sistemas capaces de detectar de forma automática nuevos fármacos pueden contribuir a la actualización automática de sus bases de conocimiento. El sistema presentado en este artículo persigue el reconocimiento y clasificación de nombres genéricos de fármacos, combinando información de UMLS y un módulo que implementa las reglas recomendadas por el consejo USAN para la denominación de sustancias farmacológicas. Esta fase es un paso previo e imprescindible para la extracción automática de las interacciones farmacológicas en la literatura biomédica. La combinación de ambos recursos obtiene una precisión y cobertura elevada. UMLS garantiza la precisión, mientras que las reglas amplían la cobertura del dominio detectando nuevos nombres de fármacos que aún no han sido registrados en UMLS. Además, las reglas permiten una clasificación más específica de los fármacos en familias farmacológicos, que ULMS no es capaz de aportar. Consideramos que la familia de un fármaco puede ser una pista valiosa a la hora de detectar interacciones farmacológicas en textos biomédicos. Los fármacos de una misma familia comparten una estructura química base, y por este motivo, si es conocida la interacción de un determinado fármaco, es bastante probable que otro fármaco de la misma familia presenten la misma interacción. El artículo está organizado como sigue: la sección 2 es una revisión de los trabajos en el reconocimiento de entidades biomédicas. La sección 3 describe brevemente los principales recursos de información utilizados en el sistema: UMLS y las reglas USAN. La sección 4 proporciona una descripción de la arquitectura del sistema y el corpus utilizado. La evaluación se presenta en la sección 5. Finalmente, la sección 6 incluye algunas conclusiones y el trabajo futuro.

2

Trabajos relacionados

La identificación de genes, proteínas, compuestos químicos, fármacos y enfermedades, etc., es crucial para facilitar la recuperación de información y la identificación de relaciones entre esas entidades, como por ejemplo, las interacciones entre fármacos.

4

http://www.micromedex.com/products/ http://www.ashp.org/ahfs/index.cfm 6 http://www.ncbi.nlm.nih.gov/sites/entrez/ 5

28

Detección de fármacos genéricos en textos biomédicos

entidades mediante el uso de pistas léxicas y ortográficas, aunque también se suele utilizar información morfosintáctica. Una de sus principales desventajas es el elevado coste de tiempo y esfuerzo que implica el desarrollo de las reglas. Además, su adaptación para el reconocimiento de otro tipo de entidades es compleja. La combinación de elementos internos tales como afijos, raíces, letras griegas y latinas se emplea para describir la formación de patrones de términos mediante una gramática en el trabajo (Ananiadou, 1994). El sistema PROPER, desarrollado por (Fukuda et al., 1998), utiliza patrones léxicos y elementos ortográficos para la detección de nombres de proteínas, consiguiendo en un pequeño experimento una precisión del 94.7% y una cobertura del 98.8%. El sistema PASTA utiliza una gramática libre de contexto para el reconocimiento de proteínas. Las reglas están basadas en propiedades léxicas y morfológicas de los términos del dominio. El sistema consigue un 84% de precisión y un 82% de cobertura en el reconocimiento de 12 clases de proteínas (Gaizauskas et al., 2003). En el trabajo de (Narayanaswamy et al., 2003) se combina el uso de raíces y sufijos típicos en el dominio químico, con información contextual, es decir, información sobre las palabras que rodean la entidad. También hay trabajos de adaptación de reconocedores de entidades de carácter general com el presentado en (Hobbs, 2002) para detección de nombres de proteínas. Otros enfoques combinan el uso de diccionario y reglas para mitigar el problema de la variabilidad terminológica, y conseguir así una mayor cobertura. (Chiang y Yu, 2003) proponen un sistema robusto de reconocimiento de términos basado en reglas y en la ontología Gene8. Las reglas consideran las posibles variaciones multipalabra, generadas por las permutaciones y por la inserción o eliminación de palabra individuales. Menor es el número de los sistemas que han utilizado aprendizaje supervisado, debido principalmente a la carencia de corpus etiquetados en el dominio biomédico. A continuación, se presentan algunos de estos sistemas basados en aprendizaje automático. En (Zhan et al., 2004) se adaptó un modelo oculto de Markov para el reconocimiento de entidades y abreviaturas en el dominio

El reconocimiento de entidades intenta encontrar términos de interés en el texto y clasificarlos dentro de categorías predefinidas como genes, compuestos químicos, fármacos, etc. El problema consiste en determinar dónde empieza y termina cada término, y la asignación de la clase correcta. Muchos trabajos se han centrado en la identificación de genes (Tanabe y Wilbur, 2002) y proteínas (Fukuda et al., 1998). Menor atención ha recibido la detección de otro tipo de entidades como las sustancias químicas (Wilbur et al., 1999), fármacos (Rindflesch et al., 2000) o enfermedades (Friedman et al., 2004). Se han empleado diferentes enfoques para tratar el problema del reconocimiento de entidades biomédicas: reglas, diccionarios, aprendizaje automático, métodos estadísticos, y una combinación de las distintas técnicas. Los métodos basados en diccionarios utilizan recursos terminológicos para localizar las ocurrencias de los términos en el texto. Su principal desventaja es que no son capaces de tratar adecuadamente la variabilidad terminológica. Normalmente, un mismo concepto puede recibir distintos nombres, y los diccionarios, en numerosas ocasiones, no recogen esta variabilidad. (Hirschman et al, 2002) utiliza patrones para localizar genes en una lista extensa obtenida de la base de datos FlyBase. Muchos nombres de genes comparten su representación léxica con palabras comunes en el idioma inglés (ej: an, by, can, for). Esta homonimia es la responsable de la baja precisión del sistema: un 2% en artículos completos y un 7% en resúmenes. La cobertura varía de 31% en resúmenes a un 84% en artículos completos. En (Tsuruoka y Tsujii, 2003) se describe un método para el emparejamiento aproximado de cadenas en un diccionario de proteínas. Además, este método utilizaba un clasificador Bayesiano entrenado sobre el corpus GENIA7, para filtrar los falsos positivos. Este filtrado mejora la precisión (73.5%), al excluir ciertos términos detectados como proteínas según el diccionario, pero que realmente no lo son en el texto. El sistema consigue una cobertura del 67.2%. El principal enfoque de los sistemas basados en reglas consiste en el desarrollo de heurísticas o gramáticas que describan las estructuras comunes de los nombres de determinadas

8 7

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/

29

http://www.geneontology.org/

Isabel Segura-Bedmar, Paloma Martínez, Dooa Samy

biomédico, mediante el uso de elementos ortográficos, morfológicos, morfosintácticos y semánticos. (Collier y Takeuchi, 2004) utilizan el clasificador Support Vector Machines (SVM) para detectar entidades biomédicas. Los elementos utilizados fueron ortográficos y etiquetas morfosintácticas. Los experimentos demostraron que el uso de información morfosintáctica provocaba un ligero descenso en los resultados. En (Lee et al, 2004), el reconocimiento se divide en dos fases: identificación y clasificación. Esta división permite una selección más apropiada de los elementos utilizados para el entrenamiento del algoritmo SVM en cada una de las fases. El sistema descrito en este artículo combina el uso de reglas y diccionario. Las reglas están basadas en las recomendaciones del consejo USAN para nominar sustancias farmacológicas. Además, la utilización de estándares oficiales, como es el caso de las reglas USAN, garantiza cierta precisión comparada con la que podría obtenerse al aplicar simples heurísticas.

3

identificación de los nombres de fármacos en el texto. La red semántica consta de 135 tipos semánticos y 54 relaciones que representan relaciones importantes en el dominio de la biomedicina. La Figura 1 muestra parte de la red semántica. Cada concepto de UMLS se clasifica por al menos un tipo semántico. Debido a su extenso alcance, la red semántica permite la categorización de un amplio rango de terminología, lo que favorece el desarrollo de sistemas para el procesamiento automático del lenguaje natural en múltiples dominios biomédicos. Sin embargo, en lo que se refiere al dominio farmacológico, esta categorización es insuficiente. En UMLS, los fármacos genéricos se clasifican en “Pharmacological Substances” o “Antibiotics”. El tipo “Clinical Drugs” se refiere a marcas comerciales, y queda fuera del alcance de nuestro estudio. Mientras que los antibióticos se clasifican en el tipo “Antibiotics”, para el resto de familias farmacológicas (analgésicos, antivirales, anticoagulantes, antiinflamatorios, etc), UMLS proporciona una clasificación demasiado general, al clasificarlos como “Pharmacologic Substance”, sin hacer distinción alguna entre las distintas familias. El tercer recurso de UMLS, SPECIALIST Lexicón está formado por numerosos términos biomédicos y contiene información sintáctica, morfológica y ortográfica. Es posible acceder a estos recursos de tres formas distintas: a través de un servidor cliente utilizando un navegador estándar, mediante un programa que utilice el API UMLSKS, o a través de una interfaz TCP/IP. También es posible trabajar con una copia local de los recursos UMLS, distribuida gratuitamente por la National Library Medical (NLM)9 de Estados Unidos. En la arquitectura aquí descrita se implementó un programa JAVA que embebía el API UMLSKS para acceder a la información en el servidor remoto.

Recursos específicos del sistema

El sistema utiliza dos fuentes de información para identificar y clasificar los nombres de fármacos en textos biomédicos: el Metatesauro UMLS y las recomendaciones del consejo USAN para el nombrado de fármacos genéricos. Ambos se describen a continuación.

3.1 UMLS Knowledge Sources (UMLSKS) El Sistema de Lenguaje Médico Unificado (UMLS) es una base de datos de conocimiento que integra varios recursos. Uno de sus principales propósitos es facilitar el desarrollo de sistemas automáticos para el procesamiento lenguaje natural en el dominio de la biomedicina. Tres son los recursos principales de UMLS: el Metatesauro, la red semántica y el SPECIALIST Lexicón. El Metatesauro solventa en gran medida el problema de la variabilidad terminológica, debido a que integra información de más de 60 vocabularios y clasificaciones biomédicas. La organización del Metatesauro está basada en conceptos. Un concepto agrupa los posibles nombres que puede tomar un mismo significado en la literatura médica. En el sistema aquí presentado, el Metatesauro UMLS permite la

3.2 Reglas de nombrado recomendadas por el consejo USAN. Un fármaco tiene tres nombres: uno químico basado en su estructura, uno genérico (no propietario) que es el nombre oficial del fármaco durante su existencia, y la marca

9

30

http://www.nlm.nih.gov/

Detección de fármacos genéricos en textos biomédicos

La categorización en familias farmacológicas proporcionada por los afijos es más específica y detallada que la proporcionada por los tipos semánticos de UMLS. Además, los afijos permiten identificar nombres de fármacos que aún no han sido registrados en el Metatesauro UMLS.

comercial que es el nombre dado por la compañía farmacéutica que lo comercializa. La selección de un nombre para un nuevo fármaco es un proceso complejo. En Estados Unidos, el consejo U.S. Adopted Name (USAN) es la institución responsable de la creación y asignación de un nombre genérico a un nuevo fármaco. En la selección de un nombre, se consideran los siguientes aspectos: la seguridad del paciente, la facilidad de pronunciación, la ausencia de conflictos con marcas comerciales y la utilidad para los profesionales de la salud.

Afijos -ast -cromil -atadine -tibant -adol, -adol-butazone -eridine -fenine -fentanil -adox -ezolid -mulin -penem -oxacin

Figura 1 Un subconjunto de la Red Semántica de UMLS

-planin -prim

Las prácticas actuales para nombrar fármacos recaen en el uso de afijos. Estos afijos clasifican los fármacos dependiendo de su estructura química, indicación o mecanismo de acción. Por ejemplo, el nombre de un analgésico podría contener alguno de los siguientes afijos:–adol, -adol-, -butazone, fenine, -eridine y –fentanil. En este trabajo, la clasificación de los fármacos se ha basado en los afijos recomendados por USAN10. La lista utilizada no es exhaustiva, debido a que no incluye ni todos los afijos aprobados por el consejo USAN, ni los recomendados por otras organizaciones. La Tabla 1 muestra algunos de los sufijos empleados en la clasificación.

10

-pristin -arol -irudin -rubicin -fungin

Definición antiasthmatics/antiallergics antiallergics (cromoglicic). Ej: nedocromil tricyclic antiasthmatics. Ej: olopatadine antiasthmatics (bradykinin antagonists). Ej: icatibant analgesics (mixed opiate receptor agonists/antagonists). Ej: tazadolen anti-inflammatory analgesics. Ej: mofebutazone analgesics (meperidine). Ej: anileridine analgesics (fenamic). Ej: floctafenine narcotic analgesics. Ej: alfentanil antibacterials (quinoline dioxide). Ej: carbadox oxazolidinone antibacterials Ej: eperezolid antibacterials (pleuromulin) Ej: retapamulin antibacterial antibiotics, Ej: tomopenem antibacterials (quinolone). Ej: difloxacin antibacterials (Actinoplane) Ej: mideplanin Antibacterials (trimethoprim type). Ej: ormetoprim Antibacterials (pristinamycin) Ej: quinupristin anticoagulants (dicumarol). Ej: dicumarol anticoagulants (hirudin). Ej: desirudin antineoplastic antibiotics (daunorubicin) Ej: esorubicin antifungal antibiotics Ej. kalafungin

Tabla 1: Algunos afijos empleados por USAN

4

Descripción del sistema

Se ha trabajado con una colección de 1481 resúmenes de artículos científicos de PubMed recuperada mediante búsquedas de los nombres

http://www.ama-assn.org/ama/pub/category/4782.html

31

Isabel Segura-Bedmar, Paloma Martínez, Dooa Samy

contrario, alguno de los tipos semánticos es “Pharmacologic Substance” o “Antibiotic”, el término se etiqueta como fármaco, junto el resto de la información obtenida de UMLS. Los términos que no se encuentran en UMLS, se etiquetan como candidatos a nuevos fármacos no registrados en UMLS. Por último, el modulo que implementa las recomendaciones del consejo USAN es el responsable de clasificar los términos etiquetados como fármacos por el modulo anterior. Para cada uno de los términos, el modulo devuelve la lista de los afijos que están contenidos dentro del nombre, consiguiendo así, la lista de sus posibles familias farmacológicas.

de familias farmacológicas, tales como “antiallergics”, “antiasthmatics”, “analgesics”, “antibacterials”, “anticoagulants”, etc. Esta colección se obtuvo mediante un Web Crawler implementado para la recuperación de los resúmenes. La arquitectura del sistema (Figura 2) consta de tres módulos que se ejecutan de forma secuencial: (1) un módulo encargado del procesamiento de los resúmenes, (2) un módulo que identifica los términos que son fármacos, y por último, (3) el módulo responsable de la clasificación y de detectar nuevos fármacos que aún no han sido registrados en UMLS. Para cada uno de los resúmenes de la colección, cada módulo produce como salida un fichero XML con la información obtenida por él. En primer lugar, los resúmenes se dividen en oraciones, se identifican los tokens y se analizan morfosintácticamente. Este módulo utiliza los procesos Sentence Splitter, Tokenizer y POS tagger de la infraestructura GATE11. El análisis morfosintáctico es necesario para identificar aquellos tokens cuya categoría morfosintáctica es nombre (común, propio o plural). A continuación, cada uno de estos nombres se busca en WordNet para descartar aquellos nombres que no son específicos del dominio biomédico, debido a que WordNet es un lexicón de carácter general. La lista inicial de candidatos está formada por aquellos nombres no encontrados en WordNet. El segundo módulo busca en el Metatesauro de UMLS cada uno de los términos que no han sido encontrados en WordNet. Esta búsqueda es implementada utilizando el API de Java que proporciona UMLSKS y que permite consultar información en su servidor remoto. El servidor devuelve un fichero XML con los resultados de la búsqueda. Si se ha encontrado uno o más conceptos, el módulo trata la respuesta y localiza sus posibles tipos semánticos. Si ninguno de ellos se corresponda con “Pharmacological Substance” o “Antibiotics” entonces el término pertenece a otro tipo de entidades (genes, proteínas, etc.). Aunque estas entidades están fuera del alcance del presente estudio, la información relativa a sus tipos semánticos, así como el nombre del concepto, idioma, recurso de información origen, y su identificación dentro de UMLS, queda registrada en el fichero XML que produce el módulo como salida. Si por el 11

Figura 2. Arquitectura del sistema Algunos afijos son demasiado ambiguos, tales como: -ac, -vin-,-vir-, -vin, -mab-, -kin, glil-, -dil, -sal- etc. Dichos afijos podrían disminuir la precisión del sistema, clasificando términos en familias incorrectas. Por este motivo, en la implementación del módulo se decidió prescindir de los afijos con menos de tres letras. Claramente, la clasificación no es exhaustiva, debido a la eliminación de estos afijos ambiguos, y al hecho de que la lista considerada inicialmente no era completa. Por otro lado, con el objeto de detectar posibles candidatos de nuevos fármacos que aún no han sido registrados en el Metatesauro, el módulo procesa el conjunto de términos que no fueron encontrados en UMLS. Como se analizará en el siguiente apartado, el número de nuevos candidatos detectados exclusivamente por las reglas es muy pequeño.

http://www.gate.ac.uk/

32

Detección de fármacos genéricos en textos biomédicos

5

actualizado frecuentemente y con una elevada cobertura en el dominio de la farmacología, pensamos que las reglas USAN podrían detectar fármacos que aún no han sido registrados en el metatesauro. Por este motivo, el módulo de clasificación se ejecutó sobre este conjunto, detectándose 102 nuevos candidatos. Un experto del dominio evaluó manualmente el conjunto de candidatos concluyendo que sólo 82 de estos candidatos eran realmente fármacos no incluidos en UMLS (versión 2007AC). Algunos ejemplos de estos fármacos son: spiradolene, mideplanin, efepristin, tomopenem. Del resto de candidatos, 579 se correspondían con entidades del dominio general tales como organizaciones, nombres de personas, etc. Esto se debe a que los resúmenes, además de contener el título del artículo, también contenían información sobre los autores y su afiliación que no se había filtrado previamente. Los restantes 830 son términos del dominio de la biomedicina que no están registrados en UMLS, tales como nonherbal, suboptimal, thromboprophylaxis, interpatient, coadministration, etc. Finalmente, los resultados globales de la evaluación se muestran en la Tabla 3. El sistema consigue una cobertura del 97% y una precisión del 100% si se utiliza únicamente información de UMLS. La combinación de UMLS y las reglas USAN aumentan ligeramente la cobertura, pero disminuye la precisión del sistema.

Evaluación del sistema

Una vez procesados los 1481 resúmenes y descartados los nombres de dominio general, es decir, aquellos que fueron encontrados en WodNet, la lista inicial de candidatos está formada por 10.743 tokens. Cada uno de estos términos se busca en el metatesauro de UMLS. Un 10.5% de ellos (1.129) están registrados en el Metatesauro, pero ninguno de sus tipos semánticos es “Pharmacologic Substances” o “Antibiotics”. Es decir, estos términos pertenecen a otros tipos semánticos como “Organic Chemical”, “Lipid2, “Carbohydrate”, etc., Como se comentó anteriormente, este subconjunto está fuera del alcance del presente estudio. El 75.4% (8.103) de los 10.743 candidatos iniciales se corresponden con sustancias farmacológicas o antibióticos. El módulo que implementa las reglas USAN consigue clasificar un 35% (2.893) de ellos. La Tabla 2 muestra parte de la distribución de familias farmacológicas en la colección de resúmenes. Familia Antineoplastics

Anticoagulants Antihistaminics antiasthmatics or antiallergics Anxiolytic sedatives Antibacterials

Antifungals Antivirals Anti-inflammatory

Afijos -abine, -antrone, -bulin, -platin, -rubicin, -taxel, -tinib, -tecan, -trexate, -vudine -arol-, -grel-tadine, -astine -azoline, -cromil

% (num) 7% (205)

-azenil, -azepam, -bamete, -peridone, -perone -ezolid, -mulin, -oxacin, -penem, -planin, -prim, -pristin -conazole, -fungin -cavir, -ciclovir, -navir, -vudine, -virenz, -bufen, -butazone, -icam, -nidap, -profen,

0,8%(24)

1,3%(37) 1,5%(44) 2,1%(61)

5%(146)

1,8%(53) 4,7%(137) 4,9%(141)

Immunomodulator s

-imod, -leukin

5,3%(154)

Antidiabetics Vasodilators Analgesics

-glinide, -glitazone -dipine, -pamil -adol, -butazone, -coxib -eridine, -fentanil

0,7%(22) 2,4%(71) 3,9%(115)

Cobertura

Precisión

97% 99.8%

100% 99,3%

UMLS UMLS + Rules

Tabla 3. Resultados del sistema

6

Conclusiones

La implementación de las reglas USAN puede mejorar la detección de nuevos fármacos aún no registrados en el Metatesauro UMLS. Sin embargo, los resultados demuestran que la mejora es realmente pequeña. Por esta razón, es lógico concluir que UMLS tiene una elevada cobertura en el dominio de la farmacología. Por otro lado, la categorización aportada por UMLS en lo que se refiere a los fármacos es insuficiente a la hora de desarrollar sistemas automáticos para la extracción automática de

Tabla 2. Distribución de las familias farmacológicas en el corpus UMLS no detectó ningún concepto para el 14% (1.511) de los candidatos iniciales (10.743). Aunque UMLS es un recurso

33

Isabel Segura-Bedmar, Paloma Martínez, Dooa Samy

Collier N, Takeuchi K. 2004. Comparison of character-level and part of speech features for name recognition in biomedical texts:423–35.

información. Las reglas USAN pueden contribuir a completar la clasificación de UMLS. Conocer la clase o familia de un determinado fármaco es una valiosa pista a la hora de determinar la presencia real de una interacción. Este enfoque preliminar es el primer paso hacia un sistema de extracción de información en el campo de la farmacología. Ampliar la cobertura de la clasificación gracias a la inclusión de un mayor número de afijos, el tratamiento de términos multipalabra, así como la resolución de acrónimos y abreviaturas son algunos de los siguientes pasos dentro de la planificación de nuestro trabajo. La evaluación del sistema fue realizada por un farmacéutico, debido a la falta de corpus etiquetados para el dominio farmacológico. Este proceso manual, además de tedioso, implica una gran cantidad de tiempo y esfuerzo. Por este motivo, con el objeto de reducir la carga de nuestro experto, hemos supuesto que la información aportada por UMLS es correcta. Sin embargo, una revisión manual de una pequeña muestra de los conceptos clasificados como sustancias farmacológicas en UMLS, mostró que algunos de ellos no eran sustancias, sino acciones o funciones farmacológicas. Esta inconsistencia semántica también fue reportada Schulze-Kremer y colegas (Schulze-Kremer et al., 2004). Por tanto, somos conscientes que es imprescindible evaluar manualmente el conjunto de conceptos clasificados por UMLS para conseguir una estimación real de la precisión y cobertura del sistema. Integrar un modulo para el reconocimiento de entidades del dominio general, así como una lista de términos biomédicos no incluidos en UMLS son algunas de las medidas futuras para reducir el coste de la evaluación.

The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res 2003;31(1):172– 5. Friedman, C., Shagina, L., Lussier, Y. and Hripcsak, G., 2004. Automated encoding of clinical documents based on natural language processing. J. Am. Med. Inform. Assoc. 11, 392–402 Fukuda, K., A. Tamura, T. Tsunoda, and T. Takagi. 1998. “Toward information extraction: identifying protein names from biological papers”. In: Proceedings of Pac Symp Biocomput.: 707-718. Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P. 2003. Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics;19(1):135–43. Hobbs JR. 2002. Information extraction from biomedical text. J Biomed Inform;35(4):260–4. Hirschman L, Morgan AA, Yeh AS. 2002. Rutabaga by any other name: extracting biological names. J Biomed Inform;35(4):247–59. Lee KJ, Hwang YS, Kim S, Rim HC. 2004. Biomedical named entity recognition using two phase model based on SVMs. J Biomed Inform. 37(6):436–47. Narayanaswamy M, Ravikumar KE, Vijay-Shanker K. A biological named entity recognizer. In: Proceedings of Pacific Symposium on Biocomputations. 2003. pp. 427– 38. Rindflesch, T.C., Tanabe,L., Weinstein,J.N. and Hunter,L. 2000. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac. Symp. Biocomput. 5, 517–528 Smith JW, Seidl LG y Cluff LE, 1969. Studies on the epidemiology of adverse drug interactions. V. Clinical factors influencing susceptibility. Ann Intern Med: 65, 629 . Stockley, I. 2004. Stockley Interacciones farmacológicas. Pharma Editores. Barcelona. Tanabe, L. y Wilbur, W.J. 2002. Tagging gene and protein names in biomedical text. Bioinformatics 18, 1124–1132 Tsuruoka Y, Tsujii J. 2003. Boosting precision and recall of dictionarybased protein name recognition. En: Proceedings of NLP in Biomedicine, ACL. Sapporo, Japan; 41–8.

Agradecimientos

Wilbur WJ, Hazard GF Jr, Divita G, Mork JG, Aronson AR, Browne AC. 1999. Analysis of biomedical text for chemical names: a comparison of three methods. Proc. AMIA Symp. 176–180

Los autores agradecen a María Segura Bedmar, responsable del centro de información de medicamentos del Hospital de Móstoles, su valiosa ayuda en la evaluación del sistema.

Zhang J, Shen D, Zhou G, Su J, Tan CL. 2004. Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J Biomed Inform. 37(6):411–22.

Bibliografía Ananiadou, S. 1994. A Methodology for Automatic Term Recognition. En: Proceedings of COLING-94. Kyoto, Japan. 1034-1038

Schulze-Kremer S, B. Smith, A. Kumar. 2004. Revising the UMLS Semantic Network. In: Fieschi M, Coiera E, Li YC, editors. Proceedings of Medinfo. San Francisco, CA; 2004. p. 1700.

Chiang, J.-H. and Yu, H.-C. 2003. Meke: Discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics, Vol. 19(11): 1417– 1422.

34

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 35-42

recibido 29-01-08, aceptado 03-03-08

Bases de Conocimiento Multil´ıng¨ ues para el Procesamiento Sem´ antico a Gran Escala∗ Multilingual Knowledge Resources for wide–coverage Semantic Processing Montse Cuadros [email protected] TALP Research Center, UPC Barcelona, Spain

German Rigau [email protected] IXA Group, UPV/EHU Donostia-San Sebastian, Spain

Resumen: Este art´ıculo presenta el resultado del estudio de un amplio conjunto de bases de conocimiento multil´ıng¨ ues actualmente disponibles que pueden ser de inter´es para un gran n´ umero de tareas de procesamiento sem´antico a gran escala. El estudio incluye una amplia gama de recursos derivados de forma manual y autom´atica para el ingl´es y castellano. Con ello pretendemos mostrar una imagen clara de su estado actual. Para establecer una comparaci´on justa y neutral, la calidad de cada recurso se ha evaluado indirectamente usando el mismo m´etodo en dos tareas de resoluci´on de la ambig¨ uedad sem´antica de las palabras (WSD, del ingl´es Word Sense Disambiguation). En concreto, las tareas de muestra l´exica del ingl´es del Senseval-3. Palabras clave: Adquisici´on y Representaci´on del Conocimiento L´exico, WSD Abstract: This report presents a wide survey of publicly available multilingual Knowledge Resources that could be of interest for wide–coverage semantic processing tasks. We also include an empirical evaluation in a multilingual scenario of the relative quality of some of these large-scale knowledge resources. The study includes a wide range of manually and automatically derived large-scale knowledge resources for English and Spanish. In order to establish a fair and neutral comparison, the quality of each knowledge resource is indirectly evaluated using the same method on a Word Sense Disambiguation task (Senseval-3 English Lexical Sample Task). Keywords: Adquisition and Representation of Lexical Knowledge, WSD

1.

Introduci´ on

El uso de bases de conocimiento de amplia cobertura, tales como WordNet (Fellbaum, 1998), se ha convertido en una pr´actica frecuente, y a menudo necesaria, de los sistemas actuales de Procesamiento del Lenguaje Natural (NLP, del ingl´es Natural Language Processing). Incluso ahora, la construcci´on de bases de conocimiento suficientemente grandes y ricas para un procesamiento sem´antico de amplia cobertura, requiere de un gran y costoso esfuerzo manual que involucra a grandes grupos de investigaci´on durante largos per´ıodos de desarrollo. De hecho, centenares de a˜ nos/persona se han invertido en Este trabajo ha sido parcialmente financiado por grupo IXA de la UPV/EHU y los proyectos KNOW (TIN2006-15049-C03-01) y ADIMEN (EHU06/113) ∗

ISSN 1135-5948

el desarrollo de wordnets para varios idiomas (Vossen, 1998). Por ejemplo, en m´as de diez a˜ nos de construcci´on manual (desde 1995 hasta 2006, esto es desde la versi´on 1.5 hasta la 3.0), WordNet ha pasado de 103.445 a 235.402 relaciones sem´anticas1 . Es decir, alrededor de unas mil nuevas relaciones por mes. Sin embargo, estas bases de conocimiento no parecen ser suficientemente ricas como para ser usadas directamente por aplicaciones avanzadas basadas en conceptos. Parece que estas aplicaciones no se mostrar´an eficaces en dominios abiertos (y tambi´en en dominios espec´ıficos) sin un conocimiento sem´antico de amplia cobertura m´as detallado y m´ as rico construido mediante procedimientos autom´aticos. Obviamente, este hecho ha sido un 1

Las relaciones sim´etricas se han contado una sola

vez.

© Sociedad Española para el Procesamiento del Lenguaje Natural

Montse Cuadros, German Rigau

2.

obst´aculo al progreso del estado del arte en NLP. Afortunadamente, en los u ´ltimos a˜ nos, la comunidad investigadora ha desarrollado un amplio conjunto de m´etodos y herramientas innovadoras para la adquisici´on autom´atica de conocimiento l´exico a gran escala a partir de fuentes estructuradas y no estructuradas. Entre otros podemos mencionar eXtended WordNet (Mihalcea y Moldovan, 2001), grandes colecciones de preferencias sem´anticas adquiridas de SemCor (Agirre y Martinez, 2001)o adquiridas de British National Corpus (BNC) (McCarthy, 2001), Topic Signatures2 para cada synset adquiridas de la web (Agirre y de la Calle, 2004) o adquiridas del BNC (Cuadros, Padr´o, y Rigau, 2005). Evidentemente, todos estos recursos sem´anticos han sido adquiridos mediante un conjunto muy diferente de procesos, herramientas y corpus, dando lugar a un conjunto muy amplio y variado de nuevas relaciones sem´anticas entre synsets. De hecho, cada uno estos recursos sem´anticos presentan vol´ umenes y exactitudes muy distintas cuando se eval´ uan en un marco com´ un y controlado (Cuadros y Rigau, 2006). De hecho, que sepamos, ning´ un estudio emp´ırico se ha llevado a cabo tratando de ver la forma en que estos grandes recursos sem´anticos se complementan entre s´ı. Adem´as, dado que este conocimiento es independiente de idioma (conocimiento representado en el plano sem´antico, es decir, como relaciones entre conceptos), hasta la fecha ninguna evaluaci´on emp´ırica se ha llevado a cabo mostrando: a) hasta qu´e punto estos recursos sem´anticos adquiridos de un idioma (en este caso ingl´es) podr´ıan ser de utilidad para otro (en este caso castellano), y b) c´omo estos recursos se complementan entre s´ı. Este art´ıculo est´a organizado de la siguiente manera. Tras esta breve introducci´on, mostramos los recursos sem´anticos multil´ıng¨ ues que analizaremos. En la secci´on 3 presentamos el marco de evaluaci´on multil´ıng¨ ue utilizado en este estudio. La secci´on 4 describe los resultados cuando evaluamos para el ingl´es estos recursos sem´anticos a gran escala y en la secci´on 5 para el castellano. Por u ´ltimo, la secci´on 6 se presentan algunas observaciones finales y el trabajo futuro.

Recursos Sem´ anticos Multil´ıng¨ ues

La evaluaci´on que aqu´ı presentamos abarca una amplia variedad de recursos sem´anticos de gran tama˜ no: WordNet (WN) (Fellbaum, 1998), eXtended WordNet (Mihalcea y Moldovan, 2001), grandes colecciones de preferencias sem´anticas adquiridas de SemCor (Agirre y Martinez, 2001)o adquiridos del BNC (McCarthy, 2001), y Topic Signatures para cada synset adquiridas de la web (Agirre y de la Calle, 2004). A pesar de que estos recursos se han obtenido utilizando diferentes versiones de WN, utilizando la tecnolog´ıa para alinear autom´aticamente wordnets (Daud´e, Padr´o, y Rigau, 2003), la mayor´ıa de estos recursos se han integrado en un recurso com´ un llamado Multilingual Central Repository (MCR) (Atserias et al., 2004). De esta forma, mantenemos la compatibilidad entre todas las bases de conocimiento que utilizan una versi´ on concreta de WN como repositorio de sentidos. Adem´as, estos enlaces permiten transportar los conocimientos asociados a un WN particular, al resto de versiones de WN.

2.1.

MCR

El Multilingual Central Repository3 (MCR) sigue el modelo propuesto por el proyecto EuroWordNet. EuroWordNet (Vossen, 1998) es una base de datos l´exica multiling¨ ue con wordnets de varias lenguas europeas, que est´an estructuradas como el WordNet de Princeton. El WordNet de Princeton contiene informaci´on sobre los nombres, verbos, adjetivos y adverbios en ingl´es y est´a organizado en torno a la noci´ on de un synset. Un synset es un conjunto de palabras con la misma categor´ıa morfosint´actica que se pueden intercambiar en un determinado contexto. La versi´on actual del MCR (Atserias et al., 2004) es el resultado del proyecto europeo MEANING del quinto programa marco4 . El MCR integra siguiendo el modelo de EuroWordNet, wordnets de cinco idiomas diferentes, incluido el castellano (junto con seis versiones del WN ingl´es). Los wordnets est´ an vinculados entre s´ı a trav´es del Inter-LingualIndex (ILI) permitiendo la conexi´on de las 3

http://adimen.si.ehu.es/cgibin/wei5/public/wei.consult.perl 4 http://nipadio.lsi.upc.es/˜nlp/meaning

2

Topic Signatures es el t´ermino en ingl´es para referirse a las palabras relacionadas con un t´ opico o tema.

36

Bases de Conocimiento Multilíngües para el Procesamiento Semántico a Gran Escala

palabras en una lengua a las palabras equivalentes en cualquiera de las otras lenguas integradas en el MCR. De esta manera, el MCR constituye un recurso ling¨ u´ıstico multil´ıng¨ ue de gran tama˜ no u ´til para un gran n´ umero de procesos sem´anticos que necesitan de una gran cantidad de conocimiento multil´ıng¨ ue para ser instrumentos eficaces. Por ejemplo, el synset en ingl´es est´a vinculado a trav´es del ILI al synset en castellano . El MCR tambi´en integra WordNet Domains (Magnini y Cavagli`a, 2000), nuevas versiones de los Base Concepts y la Top Con´ cept Ontology (Alvez et al., 2008), y la ontolog´ıa SUMO (Niles y Pease, 2001). La versi´on actual del MCR contiene 934.771 relaciones sem´anticas entre synsets, la mayor´ıa de ellos adquiridos autom´aticamente5 . Esto representa un volumen casi cuatro veces m´as grande que el de Princeton WordNet (235.402 relaciones sem´anticas u ´nicas en WordNet 3.0). En lo sucesivo, nos referiremos a cada recurso sem´antico de la siguiente forma: WN (Fellbaum, 1998): Este recurso contiene las relaciones directas y no repetidas codificadas en WN1.6 y WN2.0 (por ejemplo, tree#n#1–hyponym–>teak#n#2). Tambi´en hemos estudiado WN2 utilizando las relaciones a distancia 1 y 2, WN3 utilizando las relaciones a distancias 1 a 3 y WN4 utilizando las relaciones a distancias 1 a 4. XWN (Mihalcea y Moldovan, 2001): Este recurso contiene las relaciones directas codificadas en eXtended WN (por ejemplo, teak#n#2–gloss–>wood#n#1). WN+XWN: Este recurso contiene las relaciones directas incluidas en WN y XWN. Tambi´en hemos estudiado (WN+XWN)2 (utilizando relaciones de WN o XWN a distancias 1 y 2). spBNC (McCarthy, 2001): Este recurso contiene 707.618 preferencias de selecci´on con los sujetos y objetos t´ıpicos adquiridos del BNC. spSemCor (Agirre y Martinez, 2001): Este recurso contiene las preferencias de selecci´on con los sujetos y los objetos t´ıpicos adquiridos de SemCor (por ejemplo, read#v#1–tobj–>book#n#1). MCR (Atserias et al., 2004): Este recurso contiene las relaciones directas incluidas en el MCR. Sin embargo, en los experimentos

descritos a continuaci´on se excluy´o el recurso spBNC debido a su pobre rendimiento. As´ı, el MCR contiene las relaciones directas de WN , XWN, y spSemCor. Obs´ervese que el MCR no incluye las relaciones indirectas de (WN+XWN)2 . No obstante, tambi´en hemos evaluado (MCR)2 (utilizando las relaciones a distancia 1 y 2), que s´ı integra las relaciones de (WN+XWN)2 .

2.2.

Topic Signatures

Las Topic Signatures (TS) son vectores de palabras relacionadas con un tema (o t´opico) (Lin y Hovy, 2000). Las TS pueden ser construidas mediante la b´ usqueda en un corpus de gran tama˜ no del contexto de un tema (o t´opico) objetivo. En nuestro caso, consideramos como un tema (o t´opico) el sentido de una palabra. Para este estudio hemos usado dos conjuntos de TS distintos. Las primeras TS constituyen uno de los mayores recursos sem´anticos disponibles actualmente con alrededor de 100 millones de relaciones sem´anticas (entre synsets y palabras) que ha sido adquirido autom´aticamente de la web (Agirre y de la Calle, 2004). Las segundas TS se han obtenido directamente de SemCor. TSWEB6 : Inspirado en el trabajo de (Leacock, Chodorow, y Miller, 1998), estas Topic Signatures se adquirieron utilizando para la construcci´on de la consulta del t´opico (o sentido de WN en nuestro caso), los sentidos monos´emicos pr´oximos al t´opico en WordNet (esto es, sin´onimos, hiper´onimos, hip´onimos directos e indirectos, y hermanos), consultando en Google y recuperando hasta un millar de fragmentos de texto por consulta (es decir, por sentido o t´opico), y extrayendo de los fragmentos las palabras con frecuencias distintivas usando TFIDF. Para estos experimentos, se ha utilizado como m´aximo las primeras 700 palabras distintivas de cada TS resultante. Debido a que ´este es un recurso sem´antico entre sentidos y palabras, no es posible transportar sus relaciones al wordnet castellano sin introducir gran cantidad de errores. El cuadro 1 presenta un ejemplo de TSWEB para el primer sentido de la palabra party. TSSEM: Estas TS se han construido utilizando SemCor, un corpus en ingl´es donde todas sus palabras han sido anotadas

5

No consideramos las preferencias de selecci´ on adquiridos del BNC (McCarthy, 2001).

6

37

http://ixa.si.ehu.es/Ixa/resources/~sensecorpus

democratic tammany alinement federalist missionary anti-masonic nazi republican alcoholics

0.0126 0.0124 0.0122 0.0115 0.0103 0.0083 0.0081 0.0074 0.0073

Montse Cuadros, German Rigau

tal de las relaciones transportadas es de s´olo 586.881.

3.

Con el fin de comparar los distintos recursos sem´anticos descritos en la secci´on anterior, hemos evaluado todos estos recursos como Topic Signatures (TS). Esto es, para cada synset (o t´opico), tendremos un simple vector de palabras con pesos asociados. Este vector de palabras se construye reuniendo todas las palabras que aparecen directamente relacionados con un synset. Esta simple representaci´on intenta ser lo m´as neutral posible respecto a los recursos utilizados. Todos los recursos se han evaluado en una misma tarea de WSD. En particular, en la secci´on 4 hemos utilizado el conjunto de nombres de la tarea de muestra l´exica en ingl´es de Senseval-3 (Senseval-3 English Lexical Sample task) que consta de 20 nombres, y en la secci´on 5 hemos utilizado el conjunto de nombres de la tarea de muestra l´exica en castellano de Senseval-3 (Senseval-3 Spanish Lexical Sample task) que consta de 21 nombres. Ambas tareas consisten en determinar el sentido correcto de una palabra en un contexto. Para la tarea en ingl´es se us´o para la anotaci´on los sentidos de WN1.7.1. Sin embargo, para el castellano se desarroll´o especialmente para la tarea el diccionario MiniDir. La mayor´ıa de los sentidos de MiniDir tienen v´ınculos a WN1.5 (que a su vez est´a integrado en el MCR, y por tanto enlazado al wordnet castellano). Todos los resultados se han evaluado en los datos de prueba usando el sistema de puntuaci´on de grano fino proporcionado por los organizadores. Para la evaluaci´on hemos usado s´olo el conjunto de nombres etiquetados porque TSWEB se contruy´o s´olo para los nombres, y porque la tarea de muestra l´exica para el ingl´es usa como conjunto de sentidos verbales aquellos que aparecen en el diccionario WordSmyth (Mihalcea, T., y A., 2004), en lugar de los que aparecen en WordNet. As´ı, el mismo m´etodo de WSD se ha aplicado a todos los recursos sem´anticos. Se realiza un simple recuento de las palabras coincidentes entre aquellas que aparecen en la Topic Signature de cada sentido de la palabra objetivo y el fragmento del texto de test7 . El synset que tiene el recuento mayor es seleccionado. De hecho, se trata de un m´eto-

Cuadro 1: Topic Signature de party#n#1 obtenida de la web (9 de las 15.881 palabras totales) political party#n#1 party#n#1 election#n#1 nominee#n#1 candidate#n#1 campaigner#n#1 regime#n#1 government#n#1 authorities#n#1

2.3219 2.3219 1.0926 0.4780 0.4780 0.4780 0.3414 0.3414 0.3414

Cuadro 2: Topic Signature para party#n#1 obtenida de SemCor (9 de los 719 sentidos totales)

sem´anticamente. Este corpus tiene un total de 192.639 palabras lematizadas y etiquetadas con su categor´ıa y sentido seg´ un WN1.6. Para cada sentido objetivo (o t´opico), obtuvimos todas las frases donde aparec´ıa ese sentido. De esta forma derivamos un subcorpus de frases relativas al sentido objetivo. A continuaci´on, para cada subcorpus se obtuvo su TS de sentidos utilizando TFIDF. En el cuadro 2, mostramos los primeros sentidos obtenidos para party#n#1. Aunque hemos probado con otras medidas, los mejores resultados se han obtenido utilizando la f´ormula TFIDF (Agirre y de la Calle, 2004).

T F IDF (w, C) =

wfw N × log maxw wfw Cfw

Marco de evaluaci´ on

(1)

Donde w es la palabra del contexto, wf la frecuencia de la palabra, C la colecci´on (todo el corpus reunido para un determinado sentido), y Cf es la frecuencia en la colecci´on. El n´ umero total de las relaciones entre synsets de WN adquiridos de SemCor es 932.008. En este caso, debido al menor tama˜ no del wordnet castellano, el n´ umero to-

7

Tambi´en consideramos los t´erminos multipalabra que aparecen en WN.

38

Bases de Conocimiento Multilíngües para el Procesamiento Semántico a Gran Escala

4.2.

do muy simple de WSD que s´olo considera la informaci´on de contexto en torno a la palabra que se desea interpretar. Por u ´ltimo, debemos se˜ nalar que los resultados no est´an sesgados (por ejemplo, para resolver empates entre sentidos), mediante el uso del sentido m´as frecuente en WN o cualquier otro conocimiento estad´ıstico. A modo de ejemplo, el cuadro 3 muestra uno de los textos de prueba de Senseval-3 correspondiente al primer sentido de la palabra party. En negrita se muestran las palabras que aparecen en la TS correspondiente al sentido party#n#1 de la TSWEB.

4. 4.1.

Evaluaci´ on de cada recurso en ingl´ es

El cuadro 4 presenta ordenadas por la medida F1, las referencias y el rendimiento de cada uno de los recursos presentados en la secci´on 2 y el tama˜ no medio de las TS por sentido de palabra. El tama˜ no medio de las TS de cada recurso es el n´ umero de palabras asociadas a un synset de promedio. Obviamente, los mejores recursos ser´an aquellos que obtengan los mejores resultados con un menor n´ umero de palabras asociadas al synset. Los mejores resultados de precisi´on, recall y medida F1 se muestran en negrita. Tambi´en hemos marcado en cursiva los resultados de los sistemas de referencia. Los mejores resultados son obtenidos por TSSEM (con F1 de 52,4). El resultado m´as bajo se obtiene por el conocimiento obtenido directamente de WN debido principalmente a su escasa cobertura (R, de 18,4 y F1 de 26,1). Tambi´en es interesante notar que el conocimiento integrado en el (MCR) aunque en parte derivado por medios autom´aticos obtiene mucho mejores resultados en t´erminos de precisi´on, recall y medida F1 que utilizando cada uno de los recursos que lo integran por separado (F1 con 18,4 puntos m´as que WN, 9,1 m´ as que XWN y 3,7 m´as que spSemCor). A pesar de su peque˜ no tama˜ no, los recursos derivados de SemCor obtienen mejores resultados que sus hom´ologos usando corpus mucho mayores (TSSEM vs. TSWEB y spSemCor vs. spBNC). En cuanto a los sistemas de referencia b´asicos, todos los recursos superan RANDOM, pero ninguno logra superar ni WNMFS, ni TRAIN-MFS, ni TRAIN. S´olo TSSEM obtiene mejores resultados que SEMCOR-MFS y est´a muy cerca del sentido m´as frecuente de WN (WN-MFS) y el corpus de entrenamiento (TRAIN-MFS). En cuanto a las expansiones y otras combinaciones, el rendimiento de WN se mejora utilizando palabras a distancias de hasta 2 (F1 de 30,0), y hasta 3 (F1 de 34,8), pero disminuye utilizando distancias de hasta 4 (F1 de 33,2). Curiosamente, ninguna de estas ampliaciones de WN logra los resultados de XWN (F1 de 35,4). Por u ´ltimo, (WN+XWN)2 va mejor que WN+XWN y (MCR)2 ligeramente mejor que MCR8 .

Evaluaci´ on para el ingl´ es Referencias b´ asicas para el English

Hemos dise˜ nado una serie de referencias b´asicas con el fin de establecer un marco de evaluaci´on que nos permita comparar el rendimiento de cada recurso sem´antico en la tarea WSD en ingl´es. RANDOM: Para cada palabra este m´etodo selecciona un sentido al azar. Esta referencia puede considerarse como un l´ımite inferior. SEMCOR-MFS: Esta referencia selecciona el sentido m´as frecuente de la palabra seg´ un SemCor. WN-MFS: Esta referencia selecciona el sentido m´as frecuente seg´ un WN (es decir, el primer sentido en WN1.6). Los sentidos de las palabras en WN se ordenaron utilizando las frecuencias de SemCor y otros corpus anotados con sentidos. As´ı, WN-MFS y SemCorMFS son similares, pero no iguales. TRAIN-MFS: Esta referencia selecciona el sentido m´as frecuente de la palabra objetivo en el corpus de entrenamiento. TRAIN: Esta referencia utiliza el corpus de entrenamiento de cada sentido proporcionado por Senseval-3 construyendo directamente una TS con las palabras de su contexto y utilizando la medida TFIDF. T´engase en cuenta que en los marcos de evaluaci´on de WSD, este es un sistema muy b´asico. Sin embargo, en nuestro marco de evaluaci´on, este sistema ”de referencia”podr´ıa ser considerado como un l´ımite superior. No esperamos obtener mejores palabras relativas a un sentido que de su propio corpus.

8

39

No se han probado extensiones superiores.

Montse Cuadros, German Rigau

Up to the late 1960s , catholic nationalists were split between two main political groupings . There was the Nationalist Party , a weak organization for which local priests had to provide some kind of legitimation . As a party , it really only exercised a modicum of power in relation to the Stormont administration . Then there were the republican parties who focused their attention on Westminster elections . The disorganized nature of catholic nationalist politics was only turned round with the emergence of the civil rights movement of 1968 and the subsequent forming of the SDLP in 1970 .

Cuadro 3: Ejemplo de prueba n´ umero 00008131 para party#n cuyo sentido correcto es el primero. KB TRAIN TRAIN-MFS WN-MFS TSSEM SEMCOR-MFS MCR2 MCR spSemCor (WN+XWN)2 WN+XWN TSWEB XWN WN3 WN4 WN2 spBNC WN RANDOM

P 65.1 54.5 53.0 52.5 49.0 45.1 45.3 43.1 38.5 40.0 36.1 38.8 35.0 33.2 33.1 36.3 44.9 19.1

R 65.1 54.5 53.0 52.4 49.1 45.1 43.7 38.7 38.0 34.2 35.9 32.5 34.7 33.1 27.5 25.4 18.4 19.1

F1 65.1 54.5 53.0 52.4 49.0 45.1 44.5 40.8 38.3 36.8 36.0 35.4 34.8 33.2 30.0 29.9 26.1 19.1

Size

quiere interpretar. Para cada sentido, se agregar´an las posiciones de cada uno de los recursos evaluados. El sentido que tenga un orden menor (m´as cercano a la primera posici´on), ser´a el escogido como el correcto.

103 26,429 129 56 5,730 74 1,721 69 503 2,346 105 128 14

El cuadro 5 presenta las medidas de F1 correspondientes a las mejores combinaciones de dos, tres y cuatro recursos usando los tres m´etodos de combinaci´on. Observando el m´etodo de combinaci´ on aplicado, los m´etodos de la Combinaci´on de Probabilidad (PM) y la combinaci´on basada en el orden (Rank) son los que dan mejores resultados, comparando con el de Combinaci´on Directa (DV), sin embargo, el m´etodo basado en el orden da mejores resultados.

Cuadro 4: Resultados de los recursos evaluados individualmente para el Ingl´es seg´ un las medidas de P, R y F1.

4.3.

La combinaci´on de los cuatro recursos sem´anticos obtiene mejores resultados que usando s´olo tres, dos o un recurso. Parece ser que la combinaci´on de los recursos aporta un conocimiento que no tienen los diferentes recursos individualmente. En este caso, 19.5 puntos por encima que TSWEB, 17.25 puntos por encima de (WN+XWN)2 , 11.0 puntos por encima de MCR y 3.1 puntos por encima de TSSEM.

Combinaci´ on de Recursos

Con el objetivo de evaluar de forma m´as detallada la contribuci´on que tiene cada recurso, proporcionamos un peque˜ no an´alisis de su aportaci´on combinada. Las combinaciones se han evaluado usando tres estrategias b´asicas diferentes (Brody, Navigli, y Lapata, 2006). DV (del ingl´es Direct Voting): Cada recurso sem´antico tiene un voto para el sentido predominante de la palabra a interpretar. Se escoge el sentido con m´as votos. PM (del ingl´es Probability Mixture): Cada recurso sem´antico proporciona una distribuci´on de probabilidad sobre los sentidos de las palabras que ser´an interpretadas. Estas probabilidades (normalizadas), ser´an contabilizadas y se escoger´a el sentido con mayor probabilidad. Rank: Cada recurso sem´antico proporciona un orden de sentidos de la palabra que se

Observando las referencias b´asicas, esta combinaci´on supera el sentido m´as frecuente de SemCor (SEMCOR-MFS con F1 de 49.1), WN (WN-MFS con F1 de 53.0) y el conjunto de entrenamiento (TRAIN-MFS con F1 de 54.5). Este hecho, indica que la combinaci´on resultante de recursos a gran escala codifica el conocimiento necesario para tener un etiquetador de sentidos para el ingl´es que se comporta como un etiquetador del sentido m´as frecuente. Es importante mencionar que el sentido m´as frecuente de una palabra, de acuerdo con el orden de sentidos de WN es un desafio dif´ıcil de superar en las tareas de WSD (McCarthy et al., 2004). 40

Bases de Conocimiento Multilíngües para el Procesamiento Semántico a Gran Escala

KB 2.system-comb: MCR+TSSEM 3.system-comb: MCR+TSSEM+(WN+XWN)2 4.system-comb: MCR+(WN+XWN)2 +TSWEB+TSSEM

PM 52.3 52.6 53.1

DV 45.4 37.9 32.7

Rank 52.7 54.6 55.5

Cuadro 5: Combinaciones de 2, 3, y 4 sistemas seg´ un la medida de F1

5.

Knowledge Bases TRAIN MiniDir-MFS MCR WN2 (WN+XWN)2 TSSEM XWN WN RANDOM

Evaluaci´ on en castellano

Del mismo modo que en el caso del ingl´es, hemos definido unas referencias b´asicas para poder establecer un marco de evaluaci´on completo y comparar el comportamiento relativo de cada recurso sem´antico cuando es evaluado en la tarea de WSD en castellano. RANDOM: Para cada palabra este m´etodo selecciona un sentido al azar. Esta referencia puede considerarse como un l´ımite inferior. Minidir-MFS: Esta referencia selecciona el sentido m´as frecuente de la palabra seg´ un el diccionario Minidir. Minidir es un diccionario construido para la tarea de WSD. La ordenaci´on de sentidos de palabras corresponde exactamente a la frecuencia de los sentidos de palabras del conjunto de entrenamiento. Por eso, Minidir-MFS ´es el mismo que TRAINMFS. TRAIN: Esta referencia usa el conjunto de entrenamiento para directamente construir una Topic Signature para cada sentido de palabra usando la medida de TFIDF. Igual que para el ingl´es, en nuestro caso, esta referencia puede considerarse como un l´ımite superior. Debemos indicar que el WN castellano no codifica la frecuencia de los sentidos de las palabras y que para el castellano no hay disponible ning´ un corpus suficientemente grande que est´e etiquetado a nivel de sentido del estilo del italiano9 . Adem´as, solamente pueden ser transportadas de un idioma a otro sin introducir demasiados errores las relaciones que existan en un recurso entre sentidos10 . Como TSWEB relaciona palabras en ingl´es a un synset, no ha sido transportado ni evaluado al castellano.

5.1.

9

R 68.0 52.7 41.1 29.0 41.2 33.2 27.1 13.6 21.3

F1 74.3 59.2 43.5 42.5 41.3 33.4 33.1 22.5 21.3

Size 66 51 1,892 208 24 8

Cuadro 6: Resultados de los recursos evaluados individualmente pare el castellano seg´ un las mediadas de P, R y F1.

referencias b´asicas y recursos sem´anticos, ordenados por la medida de F1. En cursiva aparecen las referencias y en negrita los mejores resultados. Para el castellano, el recurso TRAIN ha sido evaluado con un tama˜ no de vector m´aximo de 450 palabras. Como se esperaba, RANDOM obtiene el menor resultado, y el sentido m´as frecuente obtenido de Minidir (Minidir-MFS, que es igual a TRAINMFS) es bastante m´as bajo que las TS obtenidas del corpus de entrenamiento (TRAIN). WN obtiene la precision m´as alta (P de 65.5) pero dado su peque˜ na cobertura (R de 13.6), tiene la F1 m´as baja (F1 de 22.5). Es interesante notar que en terminos de precisi´on, recall y F1, el conocimiento integrado en el MCR supera a los resultados de TSSEM. Este hecho, posiblemente indica que el conocimiento actualmente contenido en el MCR es m´as robusto que TSSEM. Este hecho tambi´en parece indicar que el conocimiento de t´opico obtenido de un corpus anotado a nivel de sentido de un idioma, no puede ser transportado directamente a otro idioma. Otros posibles motivos de los bajos resultados podr´ıan ser el menor tama˜ no de los recursos en castellano (compar´andolos con los existentes en ingl´es) o los diferentes marcos de evaluaci´on, incluyendo el diccionario (diferenciaci´on de sentidos y enlace a WN). Observando los sistemas de referencia, todos los recursos de conocimiento superan

Evaluando cada recurso del castellano por separado

El cuadro 6 presenta las medidas de precisi´on (P), recall (R) y F1 de las diferentes 10

P 81.8 67.1 46.1 56.0 41.3 33.6 42.6 65.5 21.3

http://multisemcor.itc.it/ Es decir, relaciones sem´ anticas synset a synset.

41

Montse Cuadros, German Rigau

˜ Brody, S., R.Navigli, y M. Lapata. 2006. Ensemble methods for unsupervised wsd. En Proceedings of COLING-ACL, p´aginas 97–104.

RANDOM, pero ninguno de ellos llega a Minidir-MFS (que es igual a TRAIN-MFS) ni a TRAIN. De todas formas, podemos remarcar que el conocimiento contenido en el MCR (F1 de 43.5), parcialmente derivado con medios autom´aticos y transportado al WN castellano del ingles, casi dobla los resultados del WN castellano original (F1 de 22.5).

6.

Cuadros, M., L. Padr´o, y G. Rigau. 2005. Comparing methods for automatic acquisition of topic signatures. En Proceedings of RANLP, Borovets, Bulgaria. Cuadros, M. y G. Rigau. 2006. Quality assessment of large scale knowledge resources. En Proceedings of EMNLP.

Conclusiones

Daud´e, J., L. Padr´o, y G. Rigau. 2003. Validation and Tuning of Wordnet Mapping Techniques. En Proceedings of RANLP, Borovets, Bulgaria.

Creemos, que un procesamiento sem´antico de amplia cobertura (como WSD) debe basarse no s´olo en algoritmos sofisticados sino tambi´en en aproximaciones basadas en grandes bases de conocimiento. Los resultados presentados en este trabajo, sugieren que es necesaria mucha m´as investigaci´on en la adquisici´on y uso de recursos sem´anticos a gran escala. Adem´as, el hecho que esos recursos presenten relaciones sem´anticas a nivel conceptual, nos permite trasladar estas relaciones para ser evaluadas en otros idiomas. Por lo que sabemos, esta es la primera vez que un estudio emp´ırico demuestra que las bases de conocimiento adquiridas autom´aticamente obtienen mejores resultados que los recursos derivados manualmente, y que la combinaci´on del conocimiento contenido en estos recursos sobrepasa al clasificador que usa el sentido m´as frecuente para el ingl´es. Tenemos planificada la validaci´on emp´ırica de esta hip´otesis en las tareas donde se interpretan todas las palabras de un texto allwords.

Fellbaum, C., editor. 1998. WordNet. An Electronic Lexical Database. The MIT Press. Leacock, C., M. Chodorow, y G. Miller. 1998. Using Corpus Statistics and WordNet Relations for Sense Identification. Computational Linguistics, 24(1):147–166. Lin, C. y E. Hovy. 2000. The automated acquisition of topic signatures for text summarization. En Proceedings of COLING. Strasbourg, France. Magnini, B. y G. Cavagli`a. 2000. Integrating subject field codes into wordnet. En Proceedings of LREC, Athens. Greece. McCarthy, D. 2001. Lexical Acquisition at the Syntax-Semantics Interface: Diathesis Aternations, Subcategorization Frames and Selectional Preferences. Ph.D. tesis, University of Sussex. McCarthy, D., R. Koeling, J. Weeds, y J. Carroll. 2004. Finding predominant senses in untagged text. En Proceedings of ACL, p´aginas 280– 297. Mihalcea, R. y D. Moldovan. 2001. extended wordnet: Progress report. En Proceedings of NAACL Workshop on WordNet and Other Lexical Resources, Pittsburgh, PA.

Bibliograf´ıa Agirre, E. y O. Lopez de la Calle. 2004. Publicly available topic signatures for all wordnet nominal senses. En Proceedings of LREC, Lisbon, Portugal.

Mihalcea, R., Chlovski T., y Killgariff A. 2004. The senseval-3 english lexical sample task. En Proceedings of ACL/SIGLEX Senseval-3, Barcelona.

Agirre, E. y D. Martinez. 2001. Learning classto-class selectional preferences. En Proceedings of CoNLL, Toulouse, France. ´ Alvez, J., J. Atserias, J. Carrera, S. Climent, A. Oliver, y G. Rigau. 2008. Consistent annotation of eurowordnet with the top concept ontology. En Proceedings of Fourth International WordNet Conference (GWC’08).

Niles, I. y A. Pease. 2001. Towards a standard upper ontology. En Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001), p´aginas 17–19. Chris Welty and Barry Smith, eds. Vossen, P., editor. 1998. EuroWordNet: A Multilingual Database with Lexical Semantic Networks . Kluwer Academic Publishers .

Atserias, J., L. Villarejo, G. Rigau, E. Agirre, J. Carroll, B. Magnini, y Piek Vossen. 2004. The meaning multilingual central repository. En Proceedings of GWC, Brno, Czech Republic. 42

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 43-49

recibido 30-01-08, aceptado 03-03-08

From knowledge acquisition to information retrieval∗ De la adquisici´ on del conocimiento a la recuperaci´ on de informaci´ on M. Fern´ andez Gavilanes S. Carrera Carrera M. Vilares Ferro Computer Science Department, University of Vigo Campus As Lagoas s/n, 32004 Ourense, Spain {mfgavilanes,sccarrera,vilares}@uvigo.es Resumen: Introducimos una propuesta en recuperaci´on de informaci´ on basada en la consideraci´on de recursos sint´acticos y sem´anticos complejos y autom´aticamente generados a partir de la propia colecci´on documental. Se describe una estrategia donde el lenguaje y el dominio de documentos son independientes del proceso. Palabras clave: adquisici´on del conocimiento, an´alisis sint´actico,extracci´on de t´erminos, recuperaci´on de informaci´on, representaci´ on del conocimiento Abstract: We introduce a proposal on information recovery based on the consideration of complex syntactic and semantic resources which are automatically generated from the documentary collection itself. The paper describes a strategy where the language and the domain of documents are independent of the process. Keywords: information retrieval, knowledge acquisition, knowledge representation, parsing, term extraction

1

Introduction

Efficiency in dealing with information retrieval (IR) tools is related to the consideration of relevant semantic data describing terms and concepts in the specific domain considered. This kind of resources are often taken from an external and generic module (Aussenac-Gilles and Mothe, 2004), which implies that we probably lose a number of interesting properties we would be able to recover if semantic processing was directly performed on the text collection we are dealing with. In order to solve this and produce practical understandable results, we should allow easy integration of background knowledge from possible complex document representations, fully exploiting linguistic structures. So, we could compensate for missing domain-specific knowledge, which is a significant advantage for redeploying the system when no external resources are yet available. Also, access to a concept hierarchy so generated allows information to be structured into categories, fostering its search and reuse; as well as to integrate an interestWork partially supported by the Spanish Government from research projects TIN2004-07246C03-01 and HUM2007-66607-C04-02, and by the Autonomous Goverment of Galicia from projects PGIDIT05PXIC30501PN, 07SIN005206PR and the Galician Network for nlp and ir. ∗

ISSN 1135-5948

ing strategy to relate languages, using it as a semantic pipeline between them (Bourigault, Aussenac-Gilles, and Charlet, 2004; Aussenac-Gilles, Condamines, and Szulman, 2002). In the state-of-the-art, methods to automatically derive a concept hierarchy from text can be grouped into similarity-based approaches and set-theoretical ones. The first type is characterized by the use of a distance in order to compute the pairwise similarity between vectors of two words in order to decide if they can be clustered (Faure and N´edellec, ; Grefenstette, 1994). Settheoretical ones partially order the objects according to the existing inclusion relations between their attribute sets (Petersen, 2001). Both approaches adopt a vector-space model and represent a term as a vector of attributes derived from a corpus. Typically some syntactic features are used to identify which attributes are used for this purpose. Our proposal aims to facilitate the knowledge acquisition task through a hybrid approach that combines natural language processing (nlp) strategies, such as shallow parsing and semantic markers, with statistical techniques and term extraction. A modular architecture allows for the addition of textual fonts on different topics and languages, providing the basis for dealing with multilingual ir. A collection of parallel texts on the © Sociedad Española para el Procesamiento del Lenguaje Natural

Milagros Fernández Gavilanes, Sara Carrera Carrera, Manuel Vilares Ferro

TERM sociedad de gesti´ on (”management society”) inversi´ on directa (”direct investment”) fondo luxemburgu´ es (”Luxembourg fund”) sesi´ on de subida (”rise session”) d´ olar por euro (”dollar for euro”)

head sociedad (”society”) inversi´ on (”investment”) fondo (”fund”) sesi´ on (”session”) d´ olar (”dollar”)

expansion de gesti´ on (”management”) directa (”direct”) luxemburgu´ es (”Luxembourg”) de subida (”rise”) por euro (”for euro”)

Table 1: Example of terms extracted economy in French and Spanish is used as a running corpus to illustrate our proposal.

2

Once the extractor has provided all the base terms and, possibly, associated their syntactic and/or morpho-syntactic variations; we can differentiate between the head and the expansion of each term, often a nominal syntagm. The former is the kernel of the syntagm, usually a noun, around which we assume the meaning of the term is structured. The expansion is the complement of the head, modifying it and defining the context where it appears. This set of identified heads provides a local look around the meaning of the text, focused on the syntagms recognized as terms. In order to extend these primary semantic links to the full text, we apply a simple recursive process by generating a hash table whose entries we baptize as main elements. Mains elements are all heads whose pos-tag is a noun. The key of each entry is a main element, to which we associate the list of contexts where it appears either as an expansion or as an head. As a result, we obtain a simple graph structure capturing the essential meaning of the text, as seen in Table 1. The next step consists of grouping terms in semantic classes, filtering out non-relevant features. To deal with in practice, we go through the hash table generated, comparing different contexts by applying as a similarity1 measure the dice coefficient (Bourigault and Lame, 2002):

Knowledge acquisition

Intuitively, we are interested in strategies allowing semantic relations to emerge from text, which implies grouping relevant terms in classes according to their similarity and establishing semantic links between them. We approach this task from two different points of view. The former is a classic termbased strategy, that only takes into account lexical data. For the second, we incorporate explicit semantic hypotheses. In both cases, our framework is based on two general principles: the distributional semantic model (Harris, 1968) establishing that words whose meaning is close often appear in similar syntactic contexts, and the assumption that terms shared by these contexts are usually nouns and adjectives (Bouaud et al., 1995). As a general purpose, our work has an experimental interest as a testing frame for comparing different knowledge acquisition strategies, but also considering about the possibility of complementing capabilities. In effect, as we shall see, a term-based approach allows the acquisition task to be performed automatically. Although the results so obtained cannot compare with the quality of the semi-automatic dependency-based proposal introduced later, it could serve as a starting point for this function, generating the initial set of semantic classes we need to initialize an iterative process in order to establish more complex relationships.

2.1

dice(C1 , C2 ) =

|C1 ∩ C2 | (|C1 | + |C2 |)/2

where C1 and C2 are contexts, and |Ci | represents the cardinal of Ci , i = 1, 2. Intuitively, we are computing the common terms between C1 and C2 , and then applying normalization. At this point, the generation of classes is an iterative process. In each iteration we join

A term-based approach

Our starting point here is the information provided by a classic term extractor running on a tagging environment. No particular architecture has been considered at this point.

1 we can define a similarity between entities as the number of common properties shared by them.

44

From knowledge acquisition to information retrieval

E_CN de E_CN de CN de

bajada

la

de

CN a

deuda

la

a

E_CN a

CN de

largo

plazo

de

Japon

deuda:nc el:det

bajada:nc

de:prep

el:det

a:prep deudo:nc

SA

CN de SUJ

...

CN a

CN de

dejar:v

a

frio:adj

de:prep

bolsa el:det

bolsa:nc

Japon:np

E_CN a CN de

la a:prep

plazo:nc

E_SA

E_SUJ/CN dejar a

ha dejado fria

...

largo:adj

de de:prep

E_CN de

Tokio Tokio:np

CN de

CC a

Parser Dependencies

E_CN de

Extracted Dependencies

Figure 1: Graph of dependencies from a parse to detect and delete these useless structures. We first introduce, from the sentence ”la bajada de la deuda a largo plazo de Jap´ on ... ha dejado fria a la bolsa de Tokio” in Fig. 1, some simple notations to describe parses. So, rectangular shapes, called clusters, show positions in the input string. Lemmas with their corresponding lexical categories are represented by ellipses baptized as nodes. Green arcs represent binary dependencies between words through some syntactic construction. The parsing frame provides the mechanisms to deal with a posterior semantic phase of analysis, by avoiding the elimination of syntactic data until we are sure it is unnecessary for knowledge acquisition. So, the lexical ambiguity illustrated in Fig. 1 should be decided in favor of the first alternative4 , because we have the intuitive certainty that the word ”deuda” is related to ”debt” and not to ”relative”. Given that we are dealing with a specialized corpus, we should confirm this by exploring the corpus in depth. That is, in order to solve the ambiguity we only need the information we are looking for, which leads us to consider an iterative learning process to attain our goal. In particular, we are more interested in dependencies between nouns and adjectives. This justifies filtering those dependencies, as shown in Fig. 1, following the dotted lines. So, the word ”plazo” (”term”) is connected to ”largo” (”long”), the latter being an adjective. Furthermore, we are also interested in extracting dependencies between nouns through, for example, prepositions such as ”bolsa de Tokio” (”Tokyo Stock Exchange”) and through verbs such as ”bajada dejar a

the pair of main elements whose dice value turns out to be the highest computed from the hash. So, in each step the hash table is reduced in an element and the process finishes when only dice coefficients equal to zero can be computed. In other words, when no more context sharing is possible. Once the iteration loop stops, entries in the hash are words semantically related together with their associated unified contexts. This hash outcome is stored in an xml2 file, in such a way that similar elements are grouped representing a new and previously undefined semantic class. This file is later converted to an owl3 (Szulman and Bi´ebow, 2004) format, in order to facilitate ulterior retrieval tasks.

2.2

A dependency-based approach

We start now from a robust parse based on a cascade of finite automata (Vilares, Alonso, and Vilares, 2004). So, we can identify relevant terms in nominal and verbal phrases, namely, those nouns and verbs relaying essential semantic information, as well as local relationships between them. As result, we obtain a graph of dependencies of the type governor/governed, as is shown in Fig. 1 by using dotted lines going from the governor term to the governed one. 2.2.1 Filtering out dependencies Once these primary syntactic dependencies have been established, possibly including a number of lexical and syntactic ambiguities generating useless dependencies, we try to effectively extract the latent semantics in the document. The idea consists of compiling additional information from the corpus in order 2 3

4 which corresponds to ”The long-term debt descent of Japan has left cold to the Stock Exchange of Tokyo”.

see http://www.w3.org/XML/ see http://www.w3.org/TR/owl-features/

45

Milagros Fernández Gavilanes, Sara Carrera Carrera, Manuel Vilares Ferro

1.

2.

P (deuda:uc:money, [ CNde], Jap´ on:up:country)local(0)

2.1

P (deuda:uc:money, [ CNde], X)global(n+1) =

2.2

P (Y, [ CNde], Jap´ on:up:country)global(n+1) =

2.3

3.

on:up)local(0) P (deuda:uc, [ CNde], Jap´ P (deuda:uc:money)local(0) P (Jap´ on:up:country)local(0) = ΣX,Y P (deuda:uc:X)local(0) P (Jap´ on:up:Y)local(0)

ΣX P (deuda:uc:money ,[

CNde],X)local(n)

#deplocal(n) (deuda ) ΣY P (Y,[

CNde],Jap´ on:up:country)local(n) #deplocal(n) (Jap´ on)

P (deuda:uc:money, [ CNde], Jap´ on:up:country)global(n+1) =

P (deuda:uc:money, [ CNde], Jap´ on:up:country)local(n+1) =

P (deuda:uc:money, [ CNde], X)global(n+1) P (Y, [ CNde], Jap´ on:up:country)global(n+1)

on:up:country)local(n) P (deuda:uc:money, [ CNde], Jap´ P (deuda:uc:money, [ CNde], Jap´ on:up:country)global(n+1) ΣX,Y P (deuda:uc:X, [ CNde], Jap´ on:up:Y)local(n) P (deuda:uc:X, [ CNde], Jap´ on:up:Y)global(n+1)

Table 2: Extraction of classes for ”deuda de Jap´ on” bolsa” (”descent leave Stock Exchange”). In order to identify the most pertinent dependencies, and also using dotted lines, we focus on detecting and later eliminating those dependencies that are found to be less probable in sentences, since they include terms with a low frequency. Nodes and arcs in the resulting graph are baptized as pivot terms and strong dependencies, as is shown in Fig. 1. A supplementary simplification phase consists of applying a simple syntactic constraint establishing that a governed word can only have one governor. So, for example, and indicated with a simple line in the sentence of Fig. 1, ”Jap´ on” (”Japan”) is governed by ”deuda” (”debt”), but also by ”deuda” (”relative”) and, in consequence, we should eliminate one of these dependencies. No other topological restrictions are considered and, in consequence, a governor word can have more than one governed one, as in the second interpretation of Fig. 1 (”long-term debt descent of Japan”), where ”bajada” (”descent”) is the governor for ”plazo” (”term”) and ”Jap´ on” (”Japan”), also indicated with a simple line. The same word could be governor and governed at the same time, this being the case of ”plazo” (”term”), which is the governor for ”largo” (”long”), but is also governed by ”deuda” (”debt”) in the first interpretation.

Villemonte de La Clergerie, 2006). This technique combines two complementary iterative processes. For a given iteration, the first one computes, for each governor/governed pair in a sentence, the probability of the corresponding dependency; taking as its starting point the statistical data provided by the original error-mining strategy and related to the lexical category of the pivot terms. The second process computes, from the former, the most probable semantic class to be assigned to terms involved in the dependency. So, in each iteration we look for both semantic and syntactic disambiguation, each profiting from the other. A fixed point assures the convergence of the strategy (Sagot and Villemonte de La Clergerie, 2006). We illustrate term clustering on our running example in Fig. 1, focusing on the dependency labeled [ CNde] relating ”deuda” (”debt”) and ”Jap´ on” (”Japan”). We do so by introducing both iterative processes in this particular case, talking without distinction about weight, probability or preference to refer the same statistical concept. So, from Table 2, we have that: 1. To begin with, we compute the local probability of the dependency in each sentence, which depends on the weight of each word, this in turn depending on the word having the correct lexical category. To start the process, first category assumptions are provided by the error-mining algorithm (Sagot and Villemonte de La Clergerie, 2006). We also take into account the initial probability

2.2.2 Term clustering The generation of semantic classes is inspired by an error-mining proposal originally designed to identify missing and erroneous information in parsing systems (Sagot and 46

From knowledge acquisition to information retrieval

for the dependency considered, a simple ratio on all possible dependencies involving the lexical categories concerned. The normalization is given by the preferences for the possible lexical categories involving each of the terms considered.

whole corpus locally in the sentences in order to re-compute the weights of all the possible classes in the sentence. In order to obtain this, we first compute the probability in the whole corpus (2.1 and 2.2) for each term and semantic class, disregarding the right and left context, represented by variables X and Y respectively. The final probability (2.3) is a combination of the two previous ones.

2. We reintroduce the local probabilities into the whole corpus locally in the sentences, in order to re-compute the weights of all possible dependencies,after which we then estimate globally the most probable ones. The normalization is given by the number of dependencies connecting the terms considered.

3. After each iteration, we re-inject the previous global weight to obtain a new local one, by reinforcing the local probabilities. The normalization is done by the addition of the preferences corresponding to the terms and classes involved in the dependency, for all the possible semantic classes considered.

3. The local value in the new iteration should take into account both the global preferences and the local injection of these preferences in the sentences, reinforcing the local probabilities. The normalization is given by previous local and global weights for the dependency involving all possible lexical categories associated to each of the terms considered.

After applying these last two approaches, a hierarchy can be built according to the different elements obtained in all classes.

3

In dealing with semantic class assignment, the sequence of steps is shown in Table 2 illustrating the computation of the probability that ”deuda”(”debt”) refers to the group of money and ”Jap´ on”(”Japan”) refers to a country, taking again the dependency labeled [ CNde] in Fig. 1, both money and country classes having been defined prior to the launch of the process in a list of semantic classes:

Information retrieval

Work in the field of IR increasingly aims to improve text indexing or query formulation with the help of different kinds of knowledge structures such as hierarchies or ontologies. These structures are expected to bring different targeted gains (Masolo, 2001) for example improving recall and precision or helping users to express their needs more easily.

3.1

A general approach

Generally, users have no precise idea of what they can find in a document collection, and the consideration of a hierarchical structure as a guideline to describe and organize contents could simply facilitate the two essential ir tasks, information indexing and retrieval. We propose an approach where hierarchies, built up from the semantic relations emerging from text, are used in a more unusual and promising way in combination with visualization tools for guided exploration of the information space. In dealing with ir, concept hierarchies and documents can be related in a simple way through the indexing task, by associating each document to those concepts matching its content. So, in our running corpus the hierarchy is structured according to classes such as money or dates; and is automatically connected to documents after projection of the terms where they occur. We also consider

1. In each sentence, we compute the local probability of this dependency if ”deuda” (”debt”) and ”Jap´ on” (”Japan”) are referring to money and a country. We start from the local weight previously computed in Table 2, and also the initial preferences of the terms involved corresponding to the classes considered5 . The normalization is given by the probabilities for the possible classes involving each one of the terms considered, without specifying any particular class and is here represented by variables X and Y. 2. We then calculate this preference at global level, by re-introducing it to the 5

this is fixed by the user if the term is in a list associated to that class. Otherwise, this probability is obtained as a ratio of the total number of classes considered.

47

Milagros Fernández Gavilanes, Sara Carrera Carrera, Manuel Vilares Ferro

Figure 2: Sub-hierarchy for the query ”acci´ on”(”share”) using a term-based strategy a graphical interface to show these structures to the user, as is shown in Figs. 2 and 3 for our running example.

3.2

sentences, these are firstly parsed to locate possible and/or-like operators and, in this case, we transfer them to Lucene which can perform directly this kind of queries. In other cases, we first eliminate stop-words to later look for physical proximity and order criteria between words and, finally, re-send the query to the search engine, also after expansion. Independently of the approach considered to generate the conceptual hierarchy, once a single-word query is introduced, we locate the corresponding class in the knowledge hierarchy we are dealing with. From this, we can identify the set of related classes, which also allows us to introduce a simple relevance criterion for the answers so obtained, based on the distance from the initial one. So, given that indexing was previously performed using the terms in these classes, we recover all the documents associated to them, assuming that they are related to the query. At this point, the choice of strategy impacts both the type and number of the semantic relations involved in the process described. In order to illustrate this, we study the answer given by the system for the query ”acci´ on” (”share”) first using the term-based strategy and then the dependency-based one. Focusing on the term-based approach, Fig. 2 shows the sub-hierarchy for the query, from

A practical approach

In practice, a major factor impacting the consideration of such an approach is the knowledge acquisition process itself. We have described two different techniques, a termbased approach and a dependency-based one, which we have integrated in a single prototype in order to define a common testing frame allowing us to effectively compare them. Although the tool can combine several domains of knowledge on a variety of different languages, we are going to focus on our running corpus by using Lucene6 as a standard search text engine. That is, the system identifies, in parsing stage, the set of indexes to be considered for the effective retrieval task, using Lucene. Once we have located the indexes we apply what we have baptized an expansion phase. This process enlarges identification of relevant terms from the conceptual structure, which will be later sent to the search engine. In order to facilitate understanding, we illustrate the proposal through queries limited to single words. In dealing with general query 6

see http://lucene.apache.org/

48

From knowledge acquisition to information retrieval

Figure 3: Sub-hierarchy for the query ”acci´ on”(”share”) using a dependency-based strategy by round blue shapes, as the word ”acci´ on” (”share”) which is pointed to by the concept9 ”dineros” (”money”) and is related, for example, to that of ”entidades” (”entities”). Also, some of the properties that are related to it are ”subyacente” (”underlying”) which is a ”tipo” (”type”) property, and ”de febrero” (”of February”), which is a ”tiempo” (”time”) one. A particular case occurs when the governor and governed words are both concepts in a extracted parse dependency. We then represent these in the same rectangular shape using a tag governor governed. So, in the case of ”acci´ on de Standard and Poor’s” (”Standard and Poor’s share”), it is associated to ”dineros entidades” (”money entities”) the governor being ”dineros” (”money”) and the governed ”entidades” (”entities”). In this way, the query word ”acci´ on” (”share”) is a ”dineros” (”money”) concept which is related to ”Standard and Poor’s”, which is an ”entidades” (”entities”) one by means of arrows. If the governor is a concept and the governed is a property, only the property is represented in the rectangular shape without indicating the class of the concept. In this case, the query word ”acci´ on” (”share”) is related with different kinds of properties, such as ”de

which the system will search for the answers. The strategy groups in a class7 , the words8 ”moneda” (”currency”), ”deuda” (”debt”), ”acci´ on” (”share”), ”fondo” (”fund”) and ”inversi´ on” (”investment”) due to their similarities are considered high enough. Round blue shapes are heads whose expansions are indicated by arrows as in ”deuda de Jap´ on” (”Japan debt”), where the head ”deuda” (”debt”) points to ”de Jap´ on” (”of Japan”). The new class, baptized as ”grupo 41”, shows the way to identify the answers included at the bottom, with the documents classified according to the information retrieved and organized by their relevance and in different tabs related to the word. Applying now the dependency-based strategy, Fig. 3 shows the sub-hierarchy considered for retrieval purposes. Classes are already defined and separated in domain concepts such as ”dineros” (”money”), ”entidades” (”entities”) or ”paises” (”countries”); whilst properties are similarly treated as concept features such as ”tipo” (”type”), ”tiempo” (”time”) or ”tama˜ no” (”size”). The hierarchy represents the organization of the relations between the concepts and their features. Here, the governors are represented 7 8

here represented by a rectangular yellow shape. here represented by round blue shapes.

9

49

here represented in a rectangular shape.

Milagros Fernández Gavilanes, Sara Carrera Carrera, Manuel Vilares Ferro

febrero” (”of February”), which is a ”tiempo” (”time”) property; and ”subyacente” (”underlying”), which is a ”tipo” (”type”) one.

4

Bourigault, Didier, Nathalie Aussenac-Gilles, and Jean Charlet. 2004. Construction de ressources terminologiques ou ontologiques `a partir de textes : un cadre unificateur pour trois ´etudes de cas. Revue d’Intelligence Artificielle (RIA), Num´ero sp´ecial sur les techniques informatiques de structuration de terminologies, M. Slodzian (Ed.), 18(1/2004):87–110.

Conclusion

We have introduced an ir strategy based on intelligent indexing that benefits from semantic relations between concepts in the text collection. In contrast with previous works, we generate dynamically the conceptual structure serving as a basis for the ir module, which would appear to be a promising approach exploring new knowledge domains as well as providing the user with a more flexible technique. Although the primary purpose of this kind of hierarchies is not to classify documents, but rather to order global concepts, linking them through linguistic expressions, deductions can nevertheless be made on the texts and index creation facilitades. This factor is important because it eliminates the human factor in decision-making, this also being reflected in the ability to specify the queries launched. In effect, it is possible from these structures to infer correlation between notions present in the source text. This fact is crucial for the refinement of queries that will allow mistakes introduced by classical search engines, such as polysemy or synonymy, to be avoided.

Faure, D. and C. N´edellec. A corpusbased conceptual clustering method for verb frames and ontology acquisition. In Paola Velardi, editor, LREC workshop on Adapting lexical and corpus ressources to sublanguages and applications, pages 5– 12. Grefenstette, Gregory. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Norwell, MA, USA. Harris, Z.S. 1968. Mathematical Structures of Languages. J. Wiley & Sons, USA. Masolo, C. 2001. Ontology driven information retrieval. report of the ikf (information and knowledge fusion). eureka project e!2235. Petersen, Wiebke. 2001. A set-theoretical approach for the induction of inheritance hierarchies. Electr. Notes Theor. Comput. Sci., 53. ´ Villemonte de La Clergerie. Sagot, B. and E. 2006. Error mining in parsing results. In Proc. of the 21st Int. Conf. on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 329–336, Australia.

References Aussenac-Gilles, Nathalie, Anne Condamines, and Sylvie Szulman. 2002. Prise en compte de l’application dans la constitution de produits terminologiques. In 2e Assises Nationales du GDR I3, Nancy (F).

Szulman, S. and B Bi´ebow. 2004. Owl et terminae. In IC: Journ´ees Francophones d` Ing´enieurie des connaissances, pages 41– 52.

Aussenac-Gilles, Nathalie and Josiane Mothe. 2004. Ontologies as background knowledge to explore document collections. In RIAO 2004 , Avignon.

Vilares, J., M.A. Alonso, and M. Vilares. 2004. Morphological and syntactic processing for text retrieval. Lecture Notes in Computer Science, 3180:371–380.

Bouaud, J., B. Bachimont, J. Charlet, and P. Zweigenbaum. 1995. Methodological principles for structuring an ontology. Bourigault, D. and G. Lame. 2002. Analyse distibutionnelle et structuration de terminologie, application `a la construction d’une ontologie documentaire de droit. In TAL: Traitement automatique des langues, pages 129–150, vol 43, n 1, Paris, France. Herm`es. 50

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 51-58

recibido 30-01-08, aceptado 03-03-08

Desarrollo de un Robot-Guía con Integración de un Sistema de Diálogo y Expresión de Emociones: Proyecto ROBINT Development of a Tour-Providing Robot Integrating Dialogue System and Emotional Speech: ROBINT Project Juan Manuel Lucas Cuesta, Rosario Alcázar Prior, Juan Manuel Montero Martínez, Fernando Fernández Martínez, Roberto Barra-Chicote, Luis Fernando D’Haro Enríquez, Javier Ferreiros López, Ricardo de Córdoba Herralde, Javier Macías Guarasa, Rubén San Segundo Hernández, José Manuel Pardo Muñoz Grupo de Tecnología del Habla, UPM Avenida Complutense s/n. 28040. Madrid [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Resumen. Este artículo presenta la incorporación de un sistema de diálogo hablado a un robot autónomo, concebido como elemento interactivo en un museo de ciencias capaz de realizar visitas guiadas y establecer diálogos sencillos con los visitantes del mismo. Para hacer más atractivo su funcionamiento, se ha dotado al robot de rasgos (como expresividad gestual o síntesis de voz con emociones) que humanizan sus intervenciones. El reconocedor de voz es un subsistema independiente del locutor (permite reconocer el habla de cualquier persona), que incorpora medidas de confianza para mejorar las prestaciones del reconocimiento, puesto que se logra un filtrado muy importante de habla parásita. En cuanto al sistema de comprensión, hace uso de un sistema de aprendizaje basado en reglas, lo que le permite inferir información explícita de un conjunto de ejemplos, sin que sea necesario generar previamente una gramática o un conjunto de reglas que guíen al módulo de comprensión. Estos subsistemas se han evaluado previamente en una tarea de control por voz de un equipo HIFI, empleando nuestro robot como elemento de interfaz, obteniendo valores de 95,9% de palabras correctamente reconocidas y 92,8% de conceptos reconocidos. En cuanto al sistema de conversión de texto a voz, se ha implementado un conjunto de modificaciones segmentales y prosódicas sobre una voz neutra, que conducen a la generación de emociones en la voz sintetizada por el robot, tales como alegría, enfado, tristeza o sorpresa. La fiabilidad de estas emociones se ha medido con varios experimentos perceptuales que arrojan resultados de identificación superiores al 70% para la mayoría de las emociones, (87% en tristeza, 79,1% en sorpresa). Palabras clave: reconocimiento de habla, medidas de confianza, síntesis de voz con emociones. Abstract. This paper describes the implementation of a spoken dialogue system on an autonomous robot which presents a high degree of interaction with the visitors in a Science Museum, providing interactive guided tours. Our main purpose was to provide the robot with some features towards the generation of more human-like interaction. These features are gestual expressivity and emotional speech synthesis. The speech recognition module is a speaker-independent recognizer which makes use of confidence measures, achieving the recognition of utterances spoken by any person, and a high reduction of the impact of noise in speech. The language understanding module makes use of a self-learning rule-based approach, which allows the system to infer information from the available example utterances. Thus, the generation of a formal grammar becomes unnecessary. Both modules have been evaluated on a task which includes dialogues between our robot and a

ISSN 1135-5948

© Sociedad Española para el Procesamiento del Lenguaje Natural

J.M. Lucas, R. Alcázar, J. M. Montero, F. Fernández, R.Barra-Chicote, L.F. D'Haro, J. Ferreiros, R. de Córdoba, J. Macías-Guarasa, R. San Segundo, J.M. Pardo

human speaker. This task has been the control of a HI-FI system. The results of this experiment are 95.9% in Word Accuracy, and 92.8% in Concept Accuracy. We have also implemented a voice synthesizer that makes use of several prosodic and segmental modifications of the synthesized speech. This way, our system generates a speech with several emotions, such as happiness, anger, sadness or surprise. The performance of this module has been measured with several experiments for emotion identification, that show identification rates higher than 70% for most of tested emotions, (87% for sadness, or 79.1% for surprise). Keywords: speech recognition, confidence measures, emotional speech synthesis. distinta ante diferentes intervenciones de dichos visitantes. Se quiere orientar el funcionamiento del robot hacia uno de los grupos mayoritarios de visitantes de un museo, como puede ser el formado por niños en edad escolar. Los motivos que nos impulsan a tener en cuenta este sector son varios. En primer lugar, es un sector de población en el que las intervenciones habladas son más espontáneas. Además, los grupos de escolares suelen hacer este tipo de excursiones de manera obligada, por lo que resulta complicado mantener la atención de los mismos durante toda la visita, en especial si durante la misma se producen presentaciones excesivamente prolongadas (Willeke, Kunz y Nourbakhsh, 2001). En la actualidad ya existen robots capaces de interactuar con niños. Se trata sobre todo de sistemas de terapia de niños hospitalizados (Plaisant et al., 2000, Saldien et al., 2006, Shibata et al., 2001) o que presentan problemas en su comportamiento, como autismo (Dautenhahn y Werry, 2000). Estos robots suelen tener la forma de animales de compañía, con una serie de sensores y actuadores que permiten que los robots respondan a los estímulos producidos por la actividad de los niños. En cuanto a sistemas con capacidad de narrar una historia, (Silva, Vala y Paiva, 2001) desarrollan un agente virtual, mientras que (Druin et al., 1999), o (Plaisant et al., 2000), analizan un robot con capacidad de contar cuentos, aplicado en un contexto de rehabilitación pediátrica. En nuestro caso, el sistema cuentacuentos contará con un nivel expresivo mayor, gracias a su expresión de emociones, tales como la alegría, la tristeza o el enfado, a través de la voz, de tal manera que dicha emoción pueda ser percibida por los niños a lo largo de las intervenciones del robot. Se pretende, por tanto, dotar al robot de la capacidad de reconocer el habla de cualquier persona, y de generar habla sintética

1. Introducción La interacción entre seres humanos y máquinas ha pasado de ser un paradigma de investigación a convertirse en la actualidad en una realidad que se da en diferentes niveles. El nivel de interacción más básico, más próximo a la máquina que al hombre, lleva décadas siendo usado (a través de dispositivos como teclados, generando comandos que la máquina debe interpretar). Sin embargo, el campo más interesante es el desarrollo de plataformas que permitan una interacción a niveles más próximos a los que el ser humano emplea de manera intuitiva, tales como el uso de la voz o la expresión corporal. Si se concibe la interacción personamáquina como el establecimiento de una comunicación entre un ser humano y un robot, aparecen robots que desempeñan tareas con un elevado número de interacciones con seres humanos diferentes de sus programadores. Así, en (Fong, Nourbakhsh y Dautenhahn, 2003), se definen los robots sociales como aquellos robots en los que la interacción persona-máquina adquiere un nivel relevante. En la actualidad, tales robots se encuentran todavía en una fase de investigación, si bien se pueden encontrar ya implantados en determinados contextos, entre los que destacan su empleo como guías en museos (Willeke, Kunz y Nourbakhsh, 2001), (disam, 2008) o para la rehabilitación de niños hospitalizados (Plaisant et al., 2000), (Saldien et al., 2006). En función de la complejidad del escenario en el que se produce la interacción, (Breazeal, 2003) clasifica los robots sociales en cuatro grupos: socialmente evocativos, robots de interfaz social, socialmente receptivos, y sociables. Atendiendo a las características de esta clasificación, nuestro robot puede clasificarse dentro del tipo socialmente receptivo, pues ha de permitir la interacción natural con los visitantes del museo, además de responder de manera

52

Desarrollo de un Robot-Guía con Integración de un Sistema de Diálogo y Expresión de Emociones: Proyecto ROBINT

resultado de los mismos. En nuestro caso, que la visita por el museo se desarrolle de manera satisfactoria, no restringiéndose a la visita, sino incluyendo otras actividades didácticas, tales como juegos o relatos educativos.

expresiva. A mayor nivel, se pretende que el robot pueda narrar historias, modificando la voz emitida de acuerdo al contexto de la narración, o bien en función de las intervenciones de sus interlocutores humanos. Este artículo se estructura como sigue. La sección 2 presenta la plataforma física que soporta las estructuras de la cara y el brazo, así como el sistema de localización del robot. La sección 3 está dedicada a los bloques que componen el sistema de diálogo, y las pruebas realizadas sobre los mismos. La sección 4 presenta las conclusiones extraídas del trabajo realizado, además de plantear posibles líneas futuras de investigación.

2. Arquitectura física y sistema de guiado Figura 1: arquitectura de un sistema de diálogo

El robot consta de una plataforma móvil sobre la cual se ha construido una estructura que da soporte a la cara y el brazo de nuestro robot. El desplazamiento que se puede aplicar a los párpados, labios y brazo puede ser modificado de acuerdo a la emoción que se desee expresar, por ejemplo elevando las cejas para indicar sorpresa, o frunciendo los labios para denotar tristeza. La estructura lleva dos procesadores empotrados. El primero se encarga de las tareas de guiado, construcción del mapa y movimiento del robot. Para ello, hace uso de una técnica conocida como SLAM (Localización y Mapeo Simultáneos), desarrollada en (Rodríguez-Losada, 2004) y (drodri, 2008), que le permite determinar su posición en tiempo real. El segundo equipo lleva a cabo parte de las tareas de diálogo. El resultado de la síntesis de voz se obtiene a través de dos altavoces incorporados a la plataforma. Adicionalmente, se emplea un ordenador portátil al que se conecta un micrófono, y en el cual se ejecuta el módulo de reconocimiento de voz. La comunicación entre el equipo portátil y el robot se lleva a cabo mediante sockets a través de un enlace Ethernet de radio.

Los bloques que constituyen un sistema de diálogo son el módulo de reconocimiento de habla, que determina la transcripción escrita de la frase enunciada por el hablante, y la evalúa mediante la estimación de una serie de medidas de confianza; el sistema de comprensión de lenguaje natural, que extrae los conceptos relevantes del texto anterior; el gestor de diálogo, que determina las acciones a realizar a partir de los conceptos extraídos, y genera los conceptos de salida hacia el usuario; el bloque de generación de respuesta, que genera un texto comprensible con los conceptos del gestor de diálogo; y el conversor de texto a voz, que genera una locución que reproduce el texto que le entrega el generador de respuesta.

3.1. Reconocimiento de habla El módulo de reconocimiento de habla permite reconocer habla en castellano e inglés, pero en el presente proyecto sólo se empleará el sistema en castellano. En un primer momento se debe determinar si se dispone de alguna señal acústica válida a la entrada del sistema, es decir, si el micrófono está recibiendo algo diferente al eventual ruido ambiente. En caso afirmativo, se extraen los parámetros significativos de la señal (Huang, Acero y Hon, 2001), mediante el análisis trama a trama de la misma, y el cálculo de los coeficientes perceptuales de predicción lineal (PLP) y la energía de la señal en cada trama, más sus correspondientes derivadas de primer y segundo orden, dando lugar a un vector de 39 parámetros para cada trama.

3. Sistema de diálogo El objetivo de un sistema de diálogo es establecer una interacción hablada con un interlocutor humano con una finalidad doble: por un lado, interpretar la intervención del usuario para identificar los servicios que éste solicita, y por otro, prestar dichos servicios y ofrecer al usuario información acerca del

53

J.M. Lucas, R. Alcázar, J. M. Montero, F. Fernández, R.Barra-Chicote, L.F. D'Haro, J. Ferreiros, R. de Córdoba, J. Macías-Guarasa, R. San Segundo, J.M. Pardo

etapas del sistema, pero la más empleada es la basada en medidas de confianza, es decir, valores de mérito que informan al propio sistema del grado de bondad que alcanzan sus hipótesis. Siguiendo el trabajo presentado en (Ferreiros et al., 2005), la medida de confianza empleada se basa en la obtención de un grafo de palabras y la evaluación de la pureza de cada una de las mismas, entendida como la fracción de hipótesis en el grafo que incluyen una palabra concreta en un instante dado. Mediante el establecimiento de un umbral de confianza, se fija un primer nivel de control de corrección de palabras reconocidas: si una palabra ha sido reconocida con una confianza inferior al umbral, no se tendrá en cuenta en etapas posteriores del sistema de diálogo (como, por ejemplo, en el módulo de comprensión). Además de la confianza de cada palabra, se calcula el valor de la confianza media para toda la frase. Este valor se obtiene mediante la ponderación de la contribución de cada palabra por el número de tramas que ocupa, valor que da una idea de la duración de dicha palabra. Este cálculo se ha planteado teniendo en cuenta que las palabras más largas suelen incluir información importante (y, por tanto, son de especial relevancia para etapas posteriores del sistema de diálogo). Las pruebas realizadas muestran una mejora significativa en el sistema de comprensión de lenguaje cuando se adopta esta modificación en el sistema de reconocimiento (Ferreiros et al., 2005; Sama et al., 2005).

El reconocedor de habla es de desarrollo propio, basado en modelos ocultos de Markov (HMM) de tres estados por alófono. Se hace uso de un modelo de lenguaje que contribuye a limitar el número de hipótesis entre las que el reconocedor ha de optar en cada instante para determinar cuál es la secuencia de palabras más probable que se está recibiendo. El modelo empleado actualmente se basa en bigramas, es decir, se modela la probabilidad de aparición de cada palabra condicionada a la aparición de la anterior. Un avance importante con respecto al proyecto URBANO ha sido el empleo de micrófonos de habla cercana (close-talk) en la obtención de la señal acústica, que ha permitido, por un lado, una reducción significativa del ruido ambiente (de unos 45 dB a unos 30 dB) y, por otro, de una menor aparición de errores de tipo “false match” (determinar que hay una señal acústica a la entrada cuando sólo hay ruido ambiente), que hacen que el reconocedor asuma que se ha pronunciado alguna palabra, lo que provoca una mayor confusión del sistema. La evaluación del reconocedor de habla pasa por obtener, como cifras de mérito de las prestaciones (sobre un conjunto de enunciados de prueba) la fracción de palabras reconocidas correctamente, la fracción de palabras erróneas (porcentaje de sustituciones), y las fracciones de palabras insertadas o borradas. La suma de sustituciones, inserciones y borrados se conoce como tasa de error (ER) del reconocedor, cuyo complementario (es decir, 100% - ER) se conoce como Word Accuracy, WA. Para estimar el WA de nuestro sistema, se ha empleado el robot como interfaz para el control de un sistema domótico sencillo, como puede ser un equipo HI-FI (Fernández et al., 2005), lo cual asegura un vocabulario reducido (en torno a 500 palabras diferentes), con lo que el reconocimiento es más seguro que en vocabularios más amplios, puesto que el sistema ha de tomar una decisión sobre un menor número de hipótesis. Las pruebas realizadas sobre un conjunto de referencia de 1200 frases compuestas por un total de 6185 palabras, arrojan valores de WA del orden del 95,86%. Si bien el valor anterior resulta de utilidad para un evaluador humano, la tasa de error aporta muy poca información al propio sistema de diálogo. Se han planteado varias fuentes de información entre las diferentes

3.2. Comprensión del lenguaje natural El módulo de comprensión de lenguaje recibe como entrada la hipótesis que el reconocedor de habla ha determinado como más probablemente enunciada por el locutor, a partir de la cual debe extraer los conceptos clave incluidos en aquélla. A fin de determinar qué conceptos están contenidos en un enunciado concreto, es necesario establecer diferentes categorías de palabras, es decir, grupos de palabras con características comunes, extraídas de un conjunto de frases de entrenamiento. Además, se ha de indicar que la clasificación de una palabra no depende únicamente de sí misma, sino también del contexto en el que se localiza. Las diferentes palabras pueden ser categorizadas manualmente por un experto, o

54

Desarrollo de un Robot-Guía con Integración de un Sistema de Diálogo y Expresión de Emociones: Proyecto ROBINT

mantener un marco con dos tipos de campos, denominados atributo y valor. En el primero de ellos, el sistema mantiene identificados los conceptos de interés para la tarea que está realizando en ese momento. En el campo de valor, el gestor almacenará las palabras que el módulo de comprensión ha etiquetado como uno de los conceptos presentes en la lista de atributos. Si el sistema no puede rellenar todos los campos a partir de un único enunciado por parte del locutor, el gestor de diálogo enviará al generador de respuesta uno o varios conceptos que aún no tienen un valor asociado, de tal manera que se solicite al usuario tal información. El generador de respuesta aplicará sobre dichos conceptos las plantillas oportunas para construir un enunciado comprensible por el usuario, y lo pasará al conversor texto-voz para que éste sintetice la frase, estableciendo de esta manera un diálogo con el interlocutor humano. Dicho diálogo continuará hasta que el robot disponga de todos aquellos datos necesarios para que realice la acción deseada.

bien realizar una clasificación automática basada en un conjunto de reglas. El primero de los métodos tiene como ventaja la exactitud en la clasificación de cada palabra, mientras que el segundo método permite fijar un número concreto de clases, y es mucho más rápido que el primero, pero es más complicado que la clasificación se realice de acuerdo a la semántica de la lengua, cosa que el primer método permite. Una vez se conoce las diferentes categorías a las que puede pertenecer cada palabra, el módulo de comprensión evalúa el enunciado reconocido, obteniendo una serie de conceptos que se pasarán al gestor del diálogo. Como cifras de mérito, se obtendrán medidas de confianza a nivel de concepto, además de la tasa de acierto de conceptos, o Concept Accuracy (CA). Para evitar ambigüedades en las palabras más comunes del vocabulario, se incluyó en el cálculo de medidas de confianza el concepto de palabras no confiables: son aquellas palabras que carecen de una categoría propia, pero que contribuyen a definir la categoría de las palabras a las que acompañan. Dentro de este grupo de palabras se incluyen determinantes, preposiciones o conjunciones. A la hora de estimar la confianza de un conjunto de conceptos, las palabras no confiables se excluirán del cálculo, de tal manera que sólo se tienen en cuenta las palabras categorizadas. Esto asegura una mejor estimación de las medidas de confianza, puesto que se eliminan aquellas palabras que no sólo no incluyen información, sino que además presentan mayor confusión entre sí. El módulo de comprensión completo, al igual que el reconocedor de habla, se ha evaluado incluyendo el robot como interfaz para el control domótico de un equipo HI-FI. El valor de CA obtenido ha sido de 92,78%.

3.4. Conversor texto a voz El conversor texto a voz genera un enunciado a partir del texto que le proporciona el generador de respuesta. Para ello, hace uso de un conjunto de parámetros prosódicos, como son el pitch, o frecuencia percibida como frecuencia fundamental de vibración de las cuerdas vocales; la intensidad, o energía de la señal, y la duración temporal de cada sonido. Uno de los objetivos planteados a la hora de comenzar este proyecto era tratar de humanizar lo más posible el comportamiento del robot. Para eso, uno de los medios imprescindibles consiste en dotarle de una voz más expresiva y capaz de transmitir emociones, que se vea acompañada de los gestos tanto de la cara como del brazo que refuercen la expresión emitida por la voz. La síntesis de voz con emociones que ofrece una mayor calidad es la consistente en la concatenación de unidades acústicas (generalmente, difonemas) a partir de un corpus amplio constituido por voz grabada de actores expresando diferentes emociones. Sin embargo, hemos optado por realizar la síntesis a partir de la modificación de los formantes de la voz neutra por varios motivos. En primer lugar, porque el modelado matemático de la voz permite aplicar cualquier tipo de

3.3. Gestor de diálogo Las tareas que ha de desempeñar el gestor de diálogo son dos. Por un lado, y a partir de los conceptos que el módulo de comprensión ha extraído, debe generar una serie de acciones que el sistema (en nuestro caso, el robot) debe llevar a cabo. Por otra parte, el gestor ha de determinar los conceptos de una eventual respuesta vocal del robot, expresable a través del sistema de conversión de texto a voz. El gestor de diálogo está basado en marcos. Esta aproximación consiste en

55

J.M. Lucas, R. Alcázar, J. M. Montero, F. Fernández, R.Barra-Chicote, L.F. D'Haro, J. Ferreiros, R. de Córdoba, J. Macías-Guarasa, R. San Segundo, J.M. Pardo

Emoción identificada Emoción simulada Alegría Enfado en frío Sorpresa Tristeza Neutra

Alegría

Enfado en frío

Sorpresa

53,9% 7% 17,4%

9,6% 70,4% 2,6% 1,7% 3,5%

20,9% 14,8% 79,1%

1,7%

2,6%

Tristeza

Neutra

Otra

2,6%

7,8% 3,5%

87% 7,8%

10,4% 83,5%

7,8% 1,7% 0,9% 0,9% 0,9%

Tabla 1: Matriz de confusión de emociones sintetizadas. modificación en la señal generada, pudiendo obtener así una voz que exprese una emoción concreta a partir de una señal de voz neutra. Además, este método no requiere un corpus tan amplio como el anterior, puesto que sólo requiere un conjunto de frases de voz neutra, sobre la que se realizarán las modificaciones pertinentes, y un pequeño grupo de frases con las emociones que se desean sintetizar, a fin de obtener los parámetros para adaptar la voz neutra a la emoción objetivo. Así, basta con aplicar una serie de modificaciones sobre los elementos prosódicos de la voz original. (Barra et al., 2006) analiza las características de cuatro emociones básicas: alegría, tristeza, sorpresa y enfado, identificando los rasgos que permiten sintetizar una emoción a partir de voz neutra. Las modificaciones planteadas sobre la voz neutra dependen de la emoción a sintetizar:

enunciado, y una mayor duración de las sílabas tónicas. Por último, el enfado es una emoción con una importante componente no vocal, dado que casi siempre va acompañado de gestos corporales. La modificación planteada estriba en aumentar la intensidad de las sílabas tónicas y aumentar el rango de variación del pitch. Además, para simular el efecto de voz contenida y temblorosa característico del enfado en frío, se ha añadido una fuente de ruido aditivo síncrono con el pitch.

Este sistema de síntesis se ha evaluado presentando a un grupo de oyentes un conjunto de frases sintetizadas con diversas emociones, y solicitándoles que identificasen la emoción que, a su juicio, expresaba el locutor. Dicha emoción debía elegirse de un conjunto cerrado, que incluía las emociones sintetizadas, además de la voz neutra. Los resultados de esta evaluación se muestran en la tabla 1. Se puede ver que la confusión es especialmente elevada entre alegría y sorpresa. Esto se debe a que, puesto que la sorpresa es un breve estado transitorio, si se pretende transmitir sorpresa en un enunciado largo, hay que mantener constantemente las modificaciones sobre la voz original, y dichas modificaciones son muy similares a las aplicadas para la síntesis de alegría, por lo que la confusión mutua entre ambas emociones aumenta significativamente. Además se observa cómo la voz que expresa tristeza está, a juicio de los oyentes, muy bien lograda, puesto que apenas presenta confusión con otras emociones.

La alegría necesita una modificación del ancho de banda de la señal original, así como una elevación del pitch y de su rango de variación, y un aumento de la velocidad de locución. La tristeza requiere una mayor lentitud en la expresión de la frase sintetizada y una reducción en la intensidad de la señal, además de un menor ancho de banda efectivo. Una mejora adicional consiste en modificar el pitch mediante la adición de un jitter, o pequeña variabilidad del mismo, de tal manera que se simula el temblor de la voz característico de una persona próxima a llorar. La sorpresa es especialmente difícil de sintetizar, puesto que se trata de una emoción transitoria que evoluciona rápidamente hacia otra emoción. Las modificaciones realizadas consisten en un aumento tanto del pitch como de su rango de variación, en un grado más acusado que en el caso de la alegría. Asimismo, se propone un contorno de frecuencia fundamental creciente hacia el final del

4. Conclusiones A la luz de los resultados mostrados en el presente trabajo, además de los resultados subjetivos obtenidos al emplear el robot en un contexto real, realizando las actividades propuestas con varios grupos de escolares entre 3 y 11 años, podemos afirmar que las prestaciones de los diferentes módulos que

56

Desarrollo de un Robot-Guía con Integración de un Sistema de Diálogo y Expresión de Emociones: Proyecto ROBINT

contribuyan a una mayor expresividad del mismo, variando su posición de manera simultánea a la síntesis de voz, humanizando así sus intervenciones. Las pruebas realizadas con varios grupos de escolares demostraron que la identificación de la emoción se ve potenciada cuando ésta no sólo se expresa con la voz, sino también mediante gestos corporales. En resumen, se ha logrado que el robot genere un mayor interés en el ámbito de un Museo de Ciencias.

componen nuestro robot lo hacen idóneo para cumplir una función fuertemente interactiva en el contexto de un museo de ciencias, no como sustituto de un guía humano, sino como un elemento más del museo al que se le añade una elevada capacidad de interacción con los visitantes. El robot se desenvuelve de manera óptima en un entorno controlado (como puede ser una de las salas del museo) gracias al sistema de navegación. Este control del entorno permite además el empleo de un vocabulario reducido, lo que asegura un número controlado de alternativas en el modelo de lenguaje empleado en el reconocedor de habla. La medida de confianza básica se ha visto modificada mediante la definición de confianzas ponderadas y de palabras no confiables. Todas estas medidas de confianza son independientes de la tarea a realizar, lo que permite mantenerlas activas en cualquier entorno en el que se desee disponer del reconocedor de habla. Las pruebas realizadas sobre el sistema demuestran que el cálculo modificado de medidas de confianza, junto con el empleo de un micrófono close-talk, han contribuido de manera importante a mejorar las tasas del reconocedor de habla y del sistema de comprensión, lo que permite que el robot responda a las intervenciones humanas con mayor eficacia, sin necesidad de volver a consultar con el interlocutor. La capacidad del módulo de comprensión de aprender gradualmente de los ejemplos que se presentan a su entrada asegura unas tasas de Concept Accuracy muy elevadas en entornos controlados, además de no requerir una gramática previa o un conjunto de reglas para inferir los conceptos de una frase. La inclusión de emociones en la voz sintetizada ha sido un gran acierto para hacer más atractivas las interacciones del robot con grupos de niños. Las modificaciones en los parámetros del sintetizador (valores medios y rangos del pitch, la amplitud, etcétera) han conducido a la obtención de una señal de voz capaz de expresar emociones. La evaluación de esta voz sintética demuestra cómo las modificaciones propuestas conducen a tasas significativas de reconocimiento de emociones por parte de oyentes no entrenados. Se ha logrado que los movimientos del brazo y el rostro del robot (párpados y labios)

Agradecimientos El presente trabajo ha sido parcialmente financiado por el Ministerio de Educación y Ciencia, bajo los contratos DPI2007-66846C02-02 (ROBONAUTA), DPI2004-07908C02 (ROBINT) y por la UPM_CAM, bajo el contrato CCG06-UPM/CAM-516 (ATINA). Los autores desean agradecer la colaboración de Nuria Pérez Magariños, así como el trabajo desarrollado por Ramón Galán y Diego Rodríguez-Losada, responsables de la estructura y el guiado del robot.

Bibliografía Barra, R., Montero, J.M., Macías, J., D’Haro, L.F., San Segundo, R. and Córdoba, R., ‘Prosodic and Segmental Rubrics in Emotion Identification’. Proceedings of the IEEE International Conference in Acoustics, Speech and Signal Processing (ICASSP’06) Pag. 1085-1088. 2006. Breazeal, C., ‘Toward Sociable Robots’. Robots and Autonomous Systems, n 42. Pag. 167-175. 2003. Dautenhahn, K. and Werry, I., ‘Issues of Robot-Human Interaction Dynamics in the Rehabilitation of Children with Autism’. Proceedings of the Sixth International Conference on the Simulation of Adaptive Behavior (SAB2000). Pag. 519-528. 2000. Druin, A., Montemayor, J., Hendler, J., McAlister, B., Boltman, A., Fiterman, E., Plaisant, A., Kruskal, A., Olsen, H., Revett, I., Plaisant Schwenn, T., Sumida, L. and Wagner, R., ‘Designing PETS: a Personal Electronic Teller of Stories’. Human Factors in Computing Systems (CHI 99). ACM Press. Pag. 326-329. May 1999.

57

J.M. Lucas, R. Alcázar, J. M. Montero, F. Fernández, R.Barra-Chicote, L.F. D'Haro, J. Ferreiros, R. de Córdoba, J. Macías-Guarasa, R. San Segundo, J.M. Pardo

Fernández, F., Ferreiros, J., Sama, V., Montero, J.M., San Segundo, R., Macías, J. and García, R., ‘Speech Interface for Controlling an Hi-Fi Audio System based on a Bayesian Belief Networks Approach for Dialog Modeling’. Proceedings of the 9th Conference on Speech Communications and Technology (INTERSPEECH 2005). Pag. 3421-3424. September 2005.

del lenguaje natural Nº 35, pp. 229-234, ISSN 1135-5948. Septiembre 2005. Shibata, T., Mitsui, T., Wada, K., Touda, A., Kumasaka, T., Tagami, K. and Tanie, K., ‘Mental Commit Robots and its Application to Therapy of Children’. Proceedings of the IEEE/ASME International Conference on Advanced Intelligence Mechatronics. Pag. 10531058. 2001.

Ferreiros, J., San Segundo, R., Fernández, F., D’Haro, L.F., Sama, V., Barra, R. and Mellén, P., ‘New Word-Level and Sentence-Level Confidence Scoring using Graph Theory Calculus and its Evaluation on Speech Understanding’. In Proceedings of the 9th Conference on Speech Communication and Technology (INTERSPEECH 2005). Pag. 3377-3380. September 2005.

Silva, A., Vala, M. and Paiva, A., ‘Papous: the Virtual Storyteller’. Intelligent Virtual Agents. Springer. 2001. Willeke, T., Kunz, C. and Nourbakhsh, I., ‘The History of the Mobot Museum Robot Series: An Evolutionary Study’. American Association for Artificial Intelligence (www.aaai.org). 2001. drodri http://www.disam.upm.es/~drodri/, 2008.

Fong, T., Nourbakhsh, I. and Dautenhahn, K., ‘A Survey of Socially Interactive Robots’. Robots and Autonomous Systems, n 42. Pag. 143-166. 2003.

disam 2008.

Huang, X., Acero, A. and Hon, H., ‘Spoken Language Processing. A Guide to Theory, Algorithm and System Development’. Prentice Hall. New Jersey. 2001. Plaisant, C., Druin, A., Lathan, C., Dakhane, K., Edwards, K., Maxwell Vice, J. and Montemayor, J., ‘A Storytelling Robot for Pediatric Rehabilitation’. Proceedings of the Fourth International ACM Conference on Assistive Technologies. Pag. 50-55. 2000. Rodríguez-Losada, D., ‘SLAM Geométrico en Tiempo Real para Robots Móviles en Interiores basado en EKF’. PhD Thesis (Unpublished). Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid. 2004. Saldien, J., Goris, K., Vanderborght, B., Verrelst, B., Van Ham, R. and Lefeber, D., ‘ANTY: The Development of an Intelligent Huggable Robot for Hospitalized Children’. Vrije Universiteit Brussel (http://anty.vub.ac.be). 2006. Sama, V., Ferreiros, J., Fernández, F., San Segundo, R., Pardo, J.M., ‘Utilización de medidas de confianza en sistemas de comprensión del habla’. Procesamiento

58

http://www.disam.upm.es/control/,

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 59-66

recibido 30-01-08, aceptado 03-03-08

Experiments with an ensemble of Spanish dependency parsers∗ Experimentos con un sistema combinado de analizadores sint´ acticos de dependencias para el espa˜ nol Roser Morante Vallejo Tilburg University Postbus 90153, 5000 LE Tilburg, The Netherlands [email protected] Resumen: Este art´ıculo presenta un sistema combinado de analizadores sint´ acticos de dependencias del espa˜ nol que integra tres analizadores basados en aprendizaje autom´ atico. El sistema opera en dos etapas. En la primera cada analizador procesa una frase y produce un grafo de dependencias. En la segunda un sistema de votaci´ on decide cual es el an´alisis final a partir de los an´alisis producidos en la primera etapa. Palabras clave: Analizadores sint´ acticos de dependencias, sistema combinado, MaltParser, aprendizaje basado en memoria. Abstract: This article presents an ensemble system for dependency parsing of Spanish that combines three machine-learning-based dependency parsers. The system operates in two stages. In the first stage, each of the three parsers analyzes an input sentence and produces a dependency graph. In the second stage, a voting system distills a final dependency graph out of the three first-stage dependency graphs. Keywords: Dependency parsers, ensemble system, MaltParser, memory-based learning.

1

Introduction

This article presents the results of experiments with an ensemble system for dependency parsing of Spanish. The system has been developed as part of the project T´ecnicas semiautom´ aticas para el etiquetado de roles sem´ anticos en corpus del espa˜ nol, which focuses on researching semiautomatic techniques for semantic role labeling. The final goal of the project is to annotate with semantic roles a seventy million word corpus, starting from an eighty thousand word train corpus. It is well known that semantic role labelers that use syntactic information perform better. This is why a parser is needed in the project that performs as accurately as possible. Since parser combination has proved to improve the performance of individual parsers (Henderson and Brill, ˇ 1999; Zeman and Zabokrtsk´ y, 2005; Sagae and Lavie, 2006), experimenting with an en∗

This research has been funded by the postdoctoral grant EX2005–1145 awarded by the Ministerio de Educaci´ on y Ciencia of Spain to the project T´ecnicas semiautom´ aticas para el etiquetado de roles sem´ anticos en corpus del espa˜ nol.

ISSN 1135-5948

semble of parsers that integrates one of the best dependency parsers for Spanish (MaltParser) seemed to be an appropriate first step. The system combines three machinelearning-based dependency parsers: Nivre’s MaltParser (Nivre, 2006; Nivre et al., 2006), Canisius’ memory-based constraintsatisfaction inference parser (Canisius and Tjong Kim Sang, 2007), and a new memorybased parser that operates with a single word-pair relation classifier. Like in Sagae and Lavie (2006), the ensemble system operates in two stages. In the first stage, each of the three parsers analyzes an input sentence and produces a dependency graph. The unlabeled attachment scores in this stage range from 82 to 87 %, according to the evaluation metrics used in the CoNLL Shared Task 2006 (Buchholz and Marsi, 2006). In the second stage, a voting system distills a final dependency graph out of the three first-stage dependency graphs. The system achieves a 4.44% error reduction over the best parser.

© Sociedad Española para el Procesamiento del Lenguaje Natural

Roser Morante

N. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

FORM Asimismo defiende la financiaci´ on p´ ublica de la on investigaci´ b´ asica y pone de manifiesto que las empresas se centran m´ as en la I+D con objetivos de mercado .

LEMMA asimismo defender el financiaci´ on p´ ublica de el investigaci´ on b´ asico y poner de manifiesto que el empresa ´el centrar m´ as en el I+D con objetivo de mercado .

CPOS r v d n a s d n a c v s n c d n p v r s d n s n s n F

POS rg vm da nc aq sp da nc aq cc vm sp nc cs da nc p0 vm rg sp da np sp nc sp nc Fp

FEATS num=s|per=3|mod=i|tmp=p num=s|gen=f num=s|gen=f num=s|gen=f for=s num=s|gen=f num=s|gen=f num=s|gen=f num=s|per=3|mod=i|tmp=p for=s gen=m|num=s gen=f|num=p gen=f|num=p per=3 num=p|per=3|mod=i|tmp=p for=s num=s|gen=f for=s gen=m|num=p for=s gen=m|num=s

HEAD 2 0 4 2 4 4 8 6 8 2 10 11 12 18 16 18 18 11 20 18 22 20 18 23 24 25 2

DEP.REL MOD ROOT ESP CD CN CN ESP CN CTE CDO CC ESP SUJ CD CREG ESP CC CN PUNC

Table 1: Example sentence of the revised Cast3LB–CoNLL corpus of Spanish. and not all verbs are equally frequent1 . Table 1 shows an example sentence of the corpus. Like in the CoNLL Shared Task 2006 sentences are separated by a blank line and fields are separated by a single tab character. A sentence consists of tokens, each one starting on a new line. A token consists of the following 8 fields that contain information about morphosyntactic features and non-projective dependencies:

The results presented here are preliminary. Because the MaltParser performs substantially better than the other two parsers, the results of the ensemble do not improve significantly over the results of the MaltParser. Consequently, more parsers will have to be added to the ensemble, and additional combination techniques will have to be experimented. The article is structured as follows. The corpus used is described in Section 2. Section 3 presents the parsers that were integrated in the ensemble, which is introduced in Section 4. The results are reported in Section 5, and compared to related work in Section 6. Finally, some conclusions are put forward in Section 7.

2

1. ID: token counter, starting at 1 for each new sentence. 2. FORM: word form or punctuation symbol. 3. LEMMA: lemma of word form. 4. CPOSTAG: speech tag.

The Cast3LB–CoNLL corpus of Spanish

coarse-grained

part-of-

5. POSTAG: fine-grained part-of-speech tag.

The experiments described in this paper were carried out on the Cast3LB–CoNLL Corpus of Spanish (Morante, 2006), which is a revised version of the Cast3LB treebank (Civit, Mart´ı, and Buf´ı, 2006; Civit, 2003; Navarro et al., 2003) used in the CoNLL Shared Task 2006 (Buchholz and Marsi, 2006). It contains 89199 words in 3303 sentences. As for verbs, it contains 11023 forms, and 1443 lemmas,

1 1369 verbs appear less than 20 times; 54 verbs, from 20 to 50 times; 12 verbs, 50 to 100 times: tratar (51), dejar (53), acabar (55), pasar (59), parecer (62), seguir (62), quedar (67), encontrar (68), llevar (68), poner (68), deber (75), querer (78), dar (86). 6 verbs, from 100 to 300 times: saber (101), llegar (107), ver (121), ir (132), decir (210), tener (243), hacer (253), poder (282), estar (296); and 2 verbs appear more than 800 times: ser, 1348 times and haber, 812 times.

60

Experiments with an ensemble of Spanish dependency parsers

POS POS POS POS POS POS POS POS POS POS POS POS POS FEATS FEATS FEATS FEATS DEP DEP DEP DEP LEX LEX LEMMA LEMMA LEMMA CPOS CPOS CPOS CPOS

6. FEATS: unordered set of syntactic and/or morphological features, separated by a vertical bar. If features are not available, the value of the feature is an underscore. The complete description of the CPOSTAG, POSTAG, and FEATS tags can be found in Civit (2002). 7. HEAD: head of the current token, which is either a value of ID or zero (’0’) for the sentence root. 8. DEPREL: dependency relation to the HEAD. The set of tags is described in Morante (2006).

3

Single parsers

This section describes the parsers that were integrated into the ensemble system and their results.

3.1

MaltParser 0.4 (MP)

The MaltParser 0.42 (Nivre, 2006; Nivre et al., 2006) is an inductive dependency parser that, according to Nivre et al. (2006), uses four essential components: a deterministic algorithm for building labeled projective dependency graphs; history-based feature models for predicting the next parser action; support vector machines for mapping histories to parser actions; and graph transformations for recovering non-projective structures. The MaltParser participated in the CoNLL-X Shared Task on multi-lingual dependency parsing obtaining the second best results for Spanish (81.29 % labeled attachment score). In these experiments we used the following model for Spanish: The learner type was support vector machines (LIBSVM (Chang and Lin, 2005)), with the same parameter options used by Nivre et al. (2006) in the CoNLL Shared Task 2006. The parser algorithm used was Nivre, with the options arc order eager, shift before reduce and allow reduction of unattached tokens.

3.2

STACK INPUT INPUT INPUT INPUT STACK STACK STACK STACK INPUT STACK INPUT STACK STACK INPUT INPUT STACK STACK STACK STACK INPUT STACK INPUT STACK INPUT INPUT STACK INPUT INPUT STACK

1 2 3 1 0 0 0 0 0 0 2

0 0 0 0 1 -1

1 0 0 0

1 0

0

1

0 0 0

0 0 0

0 0 0

-1 1 -1

0

0

0

-1

1 0

0

1

-1 1 -1

Table 2: Model of the MaltParser used.

uses three memory-based classifiers that predict weighted soft-constraints on the structure of the parse tree. Each predicted constraint covers a small part of the complete dependency tree, and overlap between them ensures that global output structure is taken into account. A dynamic programming algorithm for dependency parsing is used to find the optimal solution to the constraint satisfaction problem thus obtained.

3.3

Memory-based constraint satisfaction parser (MB1)

The memory-based constraint satisfaction parser (Canisius and Tjong Kim Sang, 2007) 2 Web page of MaltParser 0.4: http://w3.msi.vxu.se/∼nivre/research/MaltParser.html.

61

Memory-based single classifier parser (MB2)

The memory-based single classifier parser is a new parser developed for performing the experiments reported here. It consists of a single classifier that predicts the relation between two words in a sentence, and a decision heuristics that chooses among the dependency relations that the classifier has predicted for one word, based on information from the classifier output. Given two words, w1 and w2, the task that the classifier performs is predicting at the same time the direction of the dependency and the type of dependency. A dummy class NONE represents absence of

Roser Morante

relation. For a sentence like El gato come pescado, the instances in the train corpus would be:

distance metric with global feature weights that account for relative differences in discriminative power of the features. The IB1 algorithm was parametrized by using Overlap as the similarity metric, Information Gain for feature weighting, 11 k-nearest neighbors, and weighting the class vote of neighbors as a function of their inverse linear distance (Daelemans et al., 2007). Because the classifier might predict more than one dependency relation for one word, a decision heuristics is applied in order to disambiguate. The decision heuristics uses information about the class distribution and the distance to the nearest neighbor produced by TiMBL.

w1:el w2:gato features class w1:el w2:come features class w1:el w2:pescado features class w1:gato w2:come features class w1:gato w2:pescado features class w1:come w2:pescado features class An instance is composed of the following features: • Lemma, POS, CPOS gender, number, person, mode, tense of the focus word w1 and focus word w2, and of the two previous and two next words to the focus words.

Algorithm 1 Heuristics to filter the output of the classifier in MB1.

• Number of coordinative conjunctions, subordinate conjunctions, prepositions, punctuation signs, main verbs, auxiliary verbs, pronouns, relative pronouns, nouns, and adjectives.

if the predicted class is different than NONE then if there is not a NONE class among the nearest neighbors then if the distance is bigger than 6 then turn the prediction into NONE; else keep the predicted and tag it with a “not-none” flag; end if else if there is a NONE class among the nearest neighbors then if its class distribution is bigger than 0.70, and the difference between the probability of the predicted class and the NONE class is lower than 3 then turn the prediction into NONE; else keep the predicted class and tag it with a “possible-none” flag; end if end if else keep the NONE prediction; end if

We performed 10-fold cross-validation experiments. Instances with the NONE class in the train corpus were downsampled in a 1:1 proportion. We use the IB1 classifier as implemented in TiMBL (version 6.0) (Daelemans et al., 2007), a supervised inductive algorithm for learning classification tasks based on the knearest neighbor classification rule (Cover and Hart, 1967). In IB1, similarity is defined by a feature-level distance metric between a test instance and a memorized example. The metric combines a per-feature value

In the first step the output of the classifier is filtered according to Algorithm 1. In the second step the dependency tree is reconstructed and the dependency relations are disambiguated, if more than one dependency is predicted for a word. The system gives preference to the class tagged with a “not-none” flag that has the lower distance to the nearest neighbor. If no classes are tagged with the “not-none” flag, the system gives preference to the class tagged with a “possible-none” flag that has the lower distance to the nearest neighbor.

• Features that express if w2 is placed between w1 and the first coordination / main verb / preposition / noun / adjective to the right of w1. • Features that expresses if w2 is placed between w1 and the second coordination / main verb / preposition / noun / adjective to the right of w1. • Features that expresses if w1 is placed between w2 and the first coordination / main verb / preposition / noun / adjective to the left of w2. • Features that expresses if w1 is placed between w2 and the second coordination / main verb / preposition / noun / adjective to the left of w2.

62

Experiments with an ensemble of Spanish dependency parsers

DEPREL AP ATR AUX CA CAG CC CD CDO CI CN CPRED.CD CPRED.SUJ CREG CTE ENUM ESP ET IMPERS MOD NEG PASS PER ROOT SUJ -

n.train 64 142 92 152 4 660 450 326 67 1171 9 28 83 263 3 1313 68 11 50 76 35 64 331 532 1896

rec 45.31 79.58 95.65 72.37 50.00 71.67 78.89 70.86 56.72 82.49 33.33 57.14 57.83 61.22 0.00 94.59 41.18 81.82 42.00 84.21 85.71 73.44 91.54 75.75 82.70

MP prec 54.72 84.96 93.62 72.37 66.67 63.15 71.43 66.38 79.17 80.10 75.00 72.73 67.61 62.65 0.00 92.89 54.90 69.23 72.41 88.89 90.91 75.81 71.46 80.12 90.64

F1 49.57 82.18 94.62 72.37 57.14 67.14 74.97 68.54 66.09 81.27 46.15 63.99 62.33 61.92 0.00 93.73 47.06 75.00 53.16 86.48 88.23 74.60 80.26 77.87 86.48

rec 40.62 79.58 86.96 63.16 50.00 53.64 74.44 66.56 50.75 81.81 0.00 42.86 33.73 55.13 0.00 95.05 41.18 63.64 36.00 85.53 48.57 65.62 76.74 68.80 80.80

MB1 prec 50.00 75.33 93.02 66.67 50.00 54.29 70.38 58.49 68.00 72.80 0.00 70.59 66.67 54.51 0.00 92.10 65.12 87.50 66.67 89.04 68.00 76.36 74.05 72.76 84.69

F1 44.82 77.39 89.88 64.86 50.00 53.96 72.35 62.26 58.12 77.04 0.00 53.33 44.79 54.81 0.00 93.55 50.45 73.68 46.75 87.24 56.66 70.58 75.37 70.72 82.69

rec 51.56 80.28 90.22 69.08 50.00 48.48 72.44 71.47 53.73 83.60 0.00 0.00 51.81 55.51 0.00 93.60 29.41 81.82 42.00 85.53 85.71 89.06 61.03 65.79 81.75

MB2 prec 55.00 73.55 86.46 61.05 66.67 61.19 69.81 54.31 60.00 73.33 0.00 0.00 55.13 54.28 0.00 91.58 31.75 75.00 48.84 82.28 88.24 55.34 85.59 74.63 84.19

F1 53.22 76.76 88.29 64.81 57.14 54.09 71.10 61.71 56.69 78.12 0.00 0.00 53.41 54.88 0.00 92.57 30.53 78.26 45.16 83.87 86.95 68.26 71.25 69.93 82.95

Table 3: Precision, recall and F1 of MP, MB1 and MB2 per dependency relation.

3.4

Results of the individual parsers

Marsi, 2006). The MP performs significantly better than MB1 and MB2, whereas MB1 and MB2 perform similarly in spite of the fact that their approach to memory-based learning is different: MB1 applies constraint satisfaction, and MB2 is based on only one classifier and heuristics that rely on the distance of the predicted class to the nearest neighbor and on the class distribution.

Table 3 shows precision, recall, and F1 of each of the single parsers per syntactic function. The n.train column contains the number of instances that have a certain dependency relation in the train corpus. The MP has the best F1 for 19 of the 25 dependency relations. This fact indicates that it is difficult to improve over the MP results with the ensemble system. MB1 has the best F1 for dependency relation ET and NEG, and MB2 for AP and IMPERS. LAS UAS LAc

MP 80.45 % 87.42 % 85.12 %

MB1 75.74 % 82.44 % 81.95 %

4

Ensemble dependency parser

The ensemble system operates in two stages. In the first stage, each of the three parsers analyzes an input sentence and produces a dependency graph. The results of the individual parsers were presented in Table 4 in the previous section. In the second stage, a voting system distills a final dependency graph out of the three first-stage dependency graphs. Voting techniques have been previously applied to dependency parsing (Sagae ˇ and Lavie, 2006; Zeman and Zabokrtsk´ y, 2005). We provide results of three different voting systems, that take into account agreement among classifiers and/or the normalized F1 value of each classifier for each dependency

MB2 75.44 % 82.75 % 81.35 %

Table 4: Results of the individual parsers. The global results of the three parsers are shown in Table 4 in terms of Labeled Attachment Score (LAS), Unlabeled Attachment Score (UAS), and Label Accuracy (LAc) according to the evaluation metrics used in the CoNLL Shared Task 2006 (Buchholz and 63

Roser Morante

relation:

LAS UAS LAc

• VS1: the system votes for the solution of the single classifier that has the higher F1 for the dependency relation that the single classifier predicts.

LAS UAS LAc

• VS3: the system votes for the solution of the MP, unless MB1 and MB2 agree or the three parsers disagree. In the first case, the MB1 and MB2 solution is chosen, and in the second, the system votes for the solution of the single classifier that has the higher F1 for the syntactic function that the single classifier predicts.

LAS UAS LAc

VS3 vs MP +0.64 +0.26 +0.66

VS3 79.71% 86.07% 85.92%

VS3 vs MP -0.74 -1.35 +0.80

Table 8: LAS, UAS, and LAc of VS4. VS1 is the system that improves the least because the MP has the better F1 scores for 19 of the 25 dependency relations. That VS2 and VS3 do no improve significantly might be due to the fact that some agreement cases between MB1 and MB2 can be errors. VS3 is the voting system that performs better: by voting for the agreement between MB1 and MB2, or for the system with higher F1 in case of complete disagreement, more errors are eliminated than errors are introduced. For further research it would be interesting to analyze if it is possible to eliminate more errors by introducing specific voting strategies per dependency relation. Table 9 shows that precision and recall in VS3 increase for some dependency relations (AP, ATR, CD, NEG, PASS, PER, SUJ), as compared to precision and recall per dependency relation of the MaltParser, although they also decrease for other (AUX, CC, ET).

As Sagae and Lavie (2006) point out “This very simple scheme guarantees that the final set of dependencies will have as many votes as possible, but it does not guarantee that the final voted set of dependencies will be a well–formed dependency tree”. We are aware of this limitation. Future research will focus on converting the resulting graph into a wellformed tree.

Results

The results of the different versions of the ensemble system are presented in Tables 5, 6, 7, and 8, as well as the improvement over the MP. Results show that combined systems VS1, VS2 and VS3 perform better than the best parser, although the difference is insignificant, since it reduces the error of MP in less than 5% (4.44%). Combined system VS4 improves only in accuracy over the results of the best system. LAS UAS LAc

VS3 81.09% 87.68% 85.78%

Table 7: LAS, UAS, and LAc of VS3.

• VS4: the system votes for system VS1 unless two single systems agree. In this case, the system votes for the solution agreed by them.

VS1 80.53% 87.43% 85.22%

VS2 vs MP +0.59 +0.26 +0.59

Table 6: LAS, UAS, and LAc of VS2.

• VS2: the system votes for the solution of the MP, unless MB1 and MB2 agree, in which case the MB1 and MB2 solution is chosen.

5

VS2 81.04% 87.68% 85.71%

6

VS1 vs MP +0.08 +0.01 +0.10

Related work

The related work we are aware of deals with languages other than Spanish. Zeman ˇ and Zabokrtsk´ y (2005) tested several approaches for combining dependency parsers for Czech. They found that the best method was accuracy-aware voting, which reduced the error of the best parser in 13%. Differences between their approach an ours are that

Table 5: LAS, UAS, and LAc of VS1.

64

Experiments with an ensemble of Spanish dependency parsers

AP ATR AUX CA CAG CC CD CDO CI CN CPRED.CD CPRED.SUJ CREG CTE ENUM ESP ET IMPERS MOD NEG PASS PER ROOT SUJ -

MP rec 45.31 79.58 95.65 72.37 50.00 71.67 78.89 70.86 56.72 82.49 33.33 57.14 57.83 61.22 0.00 94.59 41.18 81.82 42.00 84.21 85.71 73.44 91.54 75.75 82.70

prec 54.72 84.96 93.62 72.37 66.67 63.15 71.43 66.38 79.17 80.10 75.00 72.73 67.61 62.65 0.00 92.89 54.90 69.23 72.41 88.89 90.91 75.81 71.46 80.12 90.64

VS3 rec +7.81 +4.93 -1.08 +0.66 0.00 -5.76 +0.84 +3.68 +2.98 +1.71 -11.11 -3.57 -2.41 0.00 0.00 +0.99 -2.94 0.00 0.00 +2.63 +5.72 +6.25 -1.21 +2.82 +0.69

7

prec +1.60 +2.20 -0.07 -4.69 0.00 -1.97 +3.99 -1.23 -2.25 -2.03 +0.25 +6.22 +1.05 -0.96 0.00 +0.06 -3.92 +12.59 +2.59 +2.78 +0.52 +0.31 +6.81 +2.65 +0.33

Conclusions and future research

In this paper we presented an ensemble system for dependency parsing of Spanish that combines three machine-learning-based dependency parsers. As far as we know, this is the first attempt to combine dependency parsers for Spanish. The results of the ensemble of parsers are only slightly better than the results of the best parser; the error reduction of the label accuracy score reaches 4.44%. This is due to the fact that there are only three parsers, one of which performs clearly better than the other two, which perform very similarly. The best results were obtained by the voting system that gives priority to the decisions of the best parser, unless the other two parsers agree, in which case their solution is chosen, or the three parsers disagree, in which case the system votes for the solution of the single classifier that has the higher F1 for the dependency relation that the single classifier predicts. We consider the results to be promising enough to continue our research. In the future we will integrate more parsers in the ensemble and we will explore additional combination techniques, like metalearning, and additional voting strategies that allow us to build well-constructed trees.

Table 9: Recall and precision of VS3 compared to precision and recall of MP per dependency relation.

they experiment with seven parsers, they perform stacking, and they check that the resulting structure is a well-formed tree.

Acknowledgements

Sagae and Lavie (2006) experiment with six parsers on the Wall Street Journal corpus. They apply a two stage procedure of reparsing focusing on unlabeled dependencies. In the first stage, m different parsers analyze an input sentence. In the second stage, a parsing algorithm is applied taking into account the analyses produced by each parser in the first stage. They reparse the sentence based on the output of m parsers in order to maximize the number of votes for a well-formed dependency structure. Their experiments increase the accuracy of the best parser in 1.7%.

We would like to thank Sander Canisius and Joakim Nivre for making their parsers available and for being very helpful. Thanks also to the three anonymous reviewers for their valuable comments.

References Buchholz, S. and E. Marsi. 2006. CoNLLX shared task on multilingual dependency parsing. In Proceedings of the X CoNLL Shared Task. SIGNLL. Canisius, S. and E. Tjong Kim Sang. 2007. A constraint satisfaction approach to dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLPCoNLL 2007, pages 1124–1128.

Nivre et al. (2007) combined the outputs of the parsers participating in the CoNLL Shared Task 2007 on dependency parsing using the method of Sagae and Lavie (2006). They show that accuracy never falls below the performance of the top three systems, although it degrades after ten different parsers have been added.

Chang, C.C. and C.J. Lin. 2005. LIBSVM: A library for support vector machines. URL:http://www.csie.ntu.edu.tw/ ∼cjlin/papers/libsvm.pdf. 65

Roser Morante

Civit, M. 2002. Gu´ıa para la anotaci´on morfol´ ogica del corpus CLiC-TALP (versi´ on 3). X-TRACT-II WP-00-06, CliC–UB.

Nivre, J., J. Hall, J. Nilsson, G. Eryigit, and S. Marinov. 2006. Labeled pseudo– projective dependency parsing with support vector machines. In Proceedings of the Tenth Conference on Computational Natural Language Learning, CoNLL-X, New York City, NY, June.

Civit, M. 2003. Gu´ıa para la anotaci´ on sint´ actica de Cast3LB: un corpus del espa˜ nol con anotaci´on sint´ actica, sem´ antica y pragm´atica. X-TRACT-II WP-03-06 y 3LB-WP-02-01, CliC–UB.

Sagae, K. and A. Lavie. 2006. Parser combination by reparsing. In Proceedings of the Human Language Technology Conference on the North American Chapter of the ACL, pages 129–132, New York. ACL. ˇ Zeman, D. and Z. Zabokrtsk´ y. 2005. Improving parsing accuracy by combining diverse dependency parsers. In Proceedings of the International Workshop on Parsing Technologies, Vancouver, Canada.

Civit, M., M.A. Mart´ı, and N. Buf´ı, 2006. Advances in Natural Language Processing (LNAI, 4139), chapter Cat3LB and Cast3LB: from Constituents to dependencies, pages 141–153. Springer Verlag, Berlin. Cover, T. M. and P. E. Hart. 1967. Nearest neighbor pattern classification. Institute of Electrical and Electronics Engineers Transactions on Information Theory, 13:21–27. Daelemans, W., J. Zavrel, K. van der Sloot, and A. van den Bosch. 2007. TiMBL: Tilburg memory based learner, version 6, reference guide. Technical Report Series ILK 07-03, Tilburg University, Tilburg, The Netherlands. Henderson, J. and E. Brill. 1999. Exploiting diversity in natural language processing: combining parsers. In Proceedings of the Fourth Conference on Empirical Methods in Natural Language Processing (EMNLP), College Park, Maryland. Morante, R. 2006. Semantic role annotation in the Cast3LB-CoNNL-SemRol corpus. Induction of Linguistic Knowledge Research Group Technical Report ILK 0603, Tilburg University, Tilburg. Navarro, B., M. Civit, M.A. Mart´ı, R. Marcos, and B. Fern´ andez. 2003. Syntactic, semantic and pragmatic annotation in cast3lb. In Proceedigns of the Shallow Processing of Large Corpora (SProLaC) Workshop of Corpus Linguistics 2003, Lancaster,UK. Nivre, J. 2006. Inductive Dependency Parsing. Springer. Nivre, J., J. Hall, S. K¨ ubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret. 2007. The CoNLL-2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLPCoNLL 2007, pages 915–932, Prague. 66

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 67-74

recibido 14-02-08, aceptado 03-03-08

Predicción estadística de las discontinuidades espectrales del habla para síntesis concatenativa Statistical prediction of spectral discontinuities of speech in concatenative synthesis Manuel Pablo Triviño y Francesc Alías GTAM – Grup de Recerca en Tecnologies Audiovisuals i Multimèdia Enginyeria i Arquitectura La Salle. Universitat Ramon Llull Quatre Camins, 2. 08022 Barcelona, España {st08726, falias}@salle.url.edu Resumen: La estimación de discontinuidades espectrales es uno de los mayores problemas en el ámbito de la síntesis concatenativa del habla. Este artículo presenta una metodología basada en el estudio del comportamiento estadístico de medidas objetivas sobre uniones naturales. El objetivo es definir un proceso automático para seleccionar qué medidas emplear como coste de unión para sintetizar un habla lo más natural posible. El artículo presenta los resultados objetivos y subjetivos que permiten validar la propuesta. Palabras clave: Medida objetiva, discontinuidad espectral, tipificación, correlación Abstract: The estimation of spectral discontinuities is one of the most common problems in speech concatenative synthesis. This paper introduces a methodology based on analyzing the statistical behaviour of objective measures for natural concatenations. The main goal is defining an automatic process capable of including the most appropriate measures as concatenation cost to generate high quality synthetic speech. This paper describes both the objective and subjective results for validating the proposal. Keywords: Objective measure, spectral discontinuity, standardization, correlation.

1

Introducción

Este trabajo se ubica en el ámbito de la generación de habla sintética a partir de texto o conversión de texto en habla (CTH). Existen distintas técnicas para obtener voz a partir de un texto cualquiera. Una de ellas es la síntesis por concatenación de unidades, en la que el habla sintetizada se genera uniendo segmentos de voz previamente grabados en un corpus. Uno de los problemas inherentes de este tipo de síntesis concatenativa es la aparición de discontinuidades audibles que se producen al unir las unidades acústicas (fonemas, difonemas, etc.). En este contexto, la CTH basada en selección de unidades trabaja con corpus de voz de dimensión considerable (mayor a 1 hora de voz) (Hunt, 1996). Como su nombre indica, esta técnica se basa en seleccionar los segmentos del corpus que permitan generar un habla sintetizada lo más natural posible. El proceso de selección considera la bondad de la unión de las unidades a seleccionar para

ISSN 1135-5948

minimizar la presencia de discontinuidades en el habla sintética mediante criterios de coste basados en medidas objetivas (Hunt, 1996). La bondad de estas medidas vendrá determinada por su capacidad para detectar discontinuidades espectrales perceptibles. Hasta el momento, la dificultad que conlleva mapear esta subjetividad provoca que todavía no se haya definido una medida objetiva única capaz de estimar el grado audible de una discontinuidad producida al concatenar dos unidades acústicas cualesquiera. Por ello, en la literatura sobre el tema se pueden encontrar diversos estudios que presentan resultados divergentes. En (Wouters, 1998) se concluye que la mejor distancia es la Euclídea aplicada sobre coeficientes MFCC (o incorporando sus derivadas). Sin embargo, en (Klabbers, 2001) se argumenta que la mejor predicción se consigue con la combinación de la distancia de Kullback-Leibler y los coeficientes LPC, mientras que en (Stylianou, 2001) se apuesta por la misma distancia pero con coeficientes FFT. Por su parte, en (Donovan, 2001) se

© Sociedad Española para el Procesamiento del Lenguaje Natural

Manuel Pablo Triviño, Francesc Alías

define una medida basada en la distancia de Mahalanobis que mejora los resultados obtenidos en la literatura hasta el momento. Posteriormente, en (Vepa, 2006) el mejor resultado se obtiene para un coste basado en coeficientes LSF (Line Spectral Frequencies), propuesta que se completa con un método de interpolación lineal de concatenación de unidades usando también LSFs. Desde otro punto de vista, se pueden encontrar trabajos que, además de estudiar las medidas objetivas, incorporan métodos de clasificación o regresión de las unidades acústicas. En (Syrdal, 2005) se aplica regresión lineal y CART (Classification and Regression Trees) a partir del etiquetado fonético y espectral del corpus. Se concluye que la agrupación por variables fonéticas permite una mejor predicción de las discontinuidades. Ante la dificultad de detectar las discontinuidades a través de medidas objetivas de forma fidedigna, últimamente han aparecido nuevas propuestas con un enfoque distinto: el empleo de modelos harmónicos y componentes AM-FM (Pantzanis, 2005), el estudio de la influencia del tamaño de ventana y las discontinuidades de fase (Kirpatrick, 2006) o el análisis de la influencia de la variación de las características espectrales de los formantes (Klabbers, 2007).

2

análisis estadístico y la tipificación (también conocida como z-score) del comportamiento de las medidas objetivas. Metodología

Listado MRT (Test de Rimas Modificado)

Clúster Fonético

Clúster Espectral

Cálculo distancias

Análisis Estadístico

Tipificación

Selección distancias

Definición de pruebas

Pruebas Subjetivas

Evaluación Método

Figura 1: Esquema del proceso seguido en el estudio de las medidas objetivas de estimación de discontinuidades.

Una vez seleccionadas las medidas que presentan un comportamiento más homogéneo, estadísticamente hablando, se procederá a evaluar la hipótesis de partida realizando una serie de pruebas subjetivas sobre un conjunto reducido de monosílabos tipo CVC, donde C indica consonante y V vocal obtenidos de un test de rimas (Stylianou, 2001;Syrdal, 2005). El objetivo es determinar qué distancias objetivas presenta una mayor correlación con los usuarios al estimar la naturalidad de las uniones CVC. Las distancias consideradas en el estudio son: i) Itakura-Saito, con coeficientes FFT y ii) Euclídea, Mahalanobis y Donovan, con coeficientes LPC, LSF, información de los tres primeros formantes (frecuencia, ancho de banda y energía) denotada como C3F, MFCC y MFCC con coeficientes delta (MFCC D) y energía (MFCC E) o con ambos (MFCC DE). Este conjunto de parejas distancia-coeficiente cubre la mayoría de los casos presentados en la literatura clásica sobre el tema. Asimismo, el estudio considera las características fonéticas y espectrales del corpus empleado.

Enfoque del problema

A partir del análisis de los trabajos anteriormente citados, se observa que todavía no se ha conseguido definir una medida que destaque sobre las demás y parece que se empieza a trabajar en otras direcciones de investigación. En este contexto, este trabajo pretende presentar una nueva metodología para seleccionar qué combinación medida-parámetro permite detectar mejor las discontinuidades espectrales. Esta metodología parte de la hipótesis que las distancias con comportamiento más homogéneo (i.e. con media más cercana a 0 y desviación estándar menor) obtenidas al evaluar uniones naturales serán las más eficientes a la hora de detectar discontinuidades. Esta metodología sigue distintas fases (véase la Figura 1). Primero se realiza un análisis del corpus de voz utilizado basado en: agrupación (clustering) fonética y espectral (para calcular la media y la desviación de los parámetros), cálculo de las medidas en estudio empleando la información extraída de la agrupación y el

3

Agrupación del corpus

Dada la dificultad de definir una única distancia como coste de unión para todos los contextos fonéticos en los que se puede encontrar una

68

Predicción estadística de las discontinuidades espectrales del habla para síntesis concatenativa

unidad acústica, generalmente se opta por organizarlos mediante agrupación fonética y/o espectral (Donovan, 2001; Syrdal, 2005). En este trabajo se ha utilizado un corpus neutro de voz femenina en catalán, cedido por la UPC, con una duración de 1,5 h. Nótese que la voz femenina permite una tasa de detección de discontinuidades audibles mayor que la voz masculina (Syrdal, 2001). A continuación, se presentan los resultados obtenidos del proceso de agrupación sobre el corpus en estudio.

3.1

Número de estímulos CVC

12000 9701

8000 6000 3295

2748 1321

2000

809

0,44

/a/

0,13

0,11

/e/

0,06

0,07

/E/

0,04

0,03 0,17

/i/

0,15

/o/

0,02

0,05

/O/

0,03

0,03

/u/

0,12

0,12

Por otro lado, trabajos previos concluyen que la aparición de discontinuidades espectrales en las vocales depende de su contexto fonético previo y posterior (Syrdal, 2001). Por ello, los estímulos se agrupan considerando el modo de articulación de su contexto consonántico (Syrdal, 2005), así como su sonoridad, ya que la detección de discontinuidades es más elevada en contextos consonánticos sonoros (Syrdal, 2001). Esto es debido a que las consonantes sonoras tienen una fuerte influencia en términos de coarticulación sobre la vocal que las precede. Por lo tanto, se establecen 8 categorías de CVC según la consonante prevocálica (no se incluye el contexto fonético silencio) y 9 según la postvocálica. Los contextos fonéticos en estudio son: aproximante, fricativa sonora y sorda, lateral, nasal, oclusiva sonora y sorda, vibrante y silencio (sólo para postvocálico). La Figura 3 muestra su distribución en el corpus.

Según (Syrdal, 2005), el efecto del contexto fonético tiene más influencia a la hora de detectar discontinuidades que la información espectral, por lo que en este trabajo se organiza el análisis de las discontinuidades acústicas según su contexto fonético. Como primer paso, se agrupan los fonemas del corpus en estructuras CVC según su fonema vocálico, sobre un total de 21654 estímulos. Como se muestra en la Figura 2, el conjunto mayoritario es el que contiene como núcleo vocálico la vocal /@/ 1, que está presente en casi la mitad de los estímulos CVC del corpus.

4000

Rafel

0,45

Tabla 1: Frecuencia de los fonemas vocálicos en los estímulos CVCs respecto a (Rafel, 1979).

Clúster fonético

10000

CVCs /@/

2669 382

729

/o/

/O/

7000 Contextos Fonéticos Prevocálicos

6000

0 /a/

/e/

/E/

/i/

Número de Estímulos CVC

Contex tos Fonétic os Pos tv oc álic os

/@/

/u/

Fonemas vocálicos

Figura 2: Histograma de la distribución de los estímulos CVC por fonema vocálico.

5000 4000 3000 2000 1000

Sil en cio

Vib ra nt e

So rd a

So no ra O clu siv a

Na sa l

O clu siv a

La te ra l

So no ra

Fr ica tiv a

Fr ica tiv a

Ap ro xim an te

So rd a

0

Al diseñar un corpus de propósito general, generalmente, se tienen en cuenta las características estadísticas de la lengua que trata (i.e. frecuencia de los fonemas), por lo que el corpus suele presentar una buena correlación con la distribución estadística de los fonemas del idioma de trabajo. En este caso, la correlación entre la frecuencia de los fonemas vocálicos en los CVCs extraídos del corpus respeto a la de la lengua catalana de (Rafel, 1979) se obtiene una correlación de =0.99 (véase la Tabla 1).

Conte x tos Foné ticos Consoná nticos

Figura 3: Histograma de la distribución de los estímulos CVC para contextos fonéticos prevocálicos y postvocálicos.

Si se calcula la correlación entre los porcentajes de fonemas consonánticos en los CVCs en estudio respecto a los indicados en (Rafel, 1979), se obtiene una correlación de =0.9 (véase la Tabla 2). Por lo tanto, de los resultados de correlación obtenidos, se puede concluir que los estímulos considerados son representativos del idioma de

1

En este artículo ese emplea notación SAMPA. Véase www.phon.ucl.ac.uk/home/sampa/home.htm

69

Manuel Pablo Triviño, Francesc Alías

trabajo (i.e. el estudio utiliza información fonéticamente balanceada). CVCs

Rafel

Aproximante

0,14

0,10

Fricativa

0,21

0,20

Lateral

0,11

0,12

Nasal

0,20

0,19

Oclusiva

0,27

0,37

Vibrante

0,07

0,11

tiene un comportamiento más estable, independientemente del coeficiente empleado, y suele presentar una media cercana a cero (1).

Tabla 2: Frecuencia de los fonemas consonánticos en los CVCs y en (Rafel, 1979).

4 Análisis de las distribuciones de las medidas objetivas sobre uniones naturales Cuando se calcula la distancia espectral entre dos difonemas CV-VC procedentes del habla natural, teóricamente su valor debería de ser nulo (o muy cercano a cero). Sin embargo, no todas las combinaciones distancia-parámetro presentan este comportamiento. Con el objetivo de determinar qué medidas objetivas presentan una distribución de valores con media más cercana a cero y menor desviación típica, se estudia la forma de las distribuciones de las medidas objetivas en estudio sobre uniones naturales. Este trabajo parte de la hipótesis que cuanto menos oscile el valor de las distancias respecto a la media en las uniones naturales (idealmente una delta de Dirac), la probabilidad de que la medida objetiva sea un buen detector de discontinuidad aumenta. Del resultado de este análisis se escogerán las combinaciones distanciaparámetro que presenten un comportamiento más cercano al deseado para ser usadas en los experimentos subjetivos.

4.1

Figura 4: Distribución de la media de las medidas Euclídea-LSF e Itakura Saito sobre los estímulos /C@C/.

4.2

Desviación de las distribuciones

Además de considerar la media de la distribución, se estudia también su desviación (que también debe tender a 0). El problema surge al intentar comparar las distribuciones, ya que éstas presentan distribuciones muy distintas entre sí, según la medida objetiva considerada, para todos los contextos fonéticos analizados. Por lo tanto, resulta necesario homogeneizar las distribuciones para compararlas correctamente. En este trabajo, se ha optado por aplicar el teorema del límite central (TLC) sobre las distribuciones de partida, para obtener una distribución muestral del valor de la media de la distribución original. Las variables empleadas en el TLC son: 1000 ciclos, que nos garantiza poder calcular con fiabilidad el tercer y cuarto momentos, y 40 muestras/ciclo, para todos los contextos fonéticos (valor único para uniformizar la disparidad de tamaños existente). Dado que no se consigue el número mínimo de muestras para todos los contextos en todos los fonemas vocálicos en estudio, se decidió agrupar los datos de las vocales /e/+/E/ y /o/+/O/, dada su similitud espectral –al igual que en (Syrdal, 2005), donde no se tiene en cuenta la influencia de la apertura de las vocales en el estudio de las discontinuidades. La figura 6 presenta la media y la desviación de la simetría o skewness (S) y la kurtosis (K) de las distribuciones resultantes después de aplicar el TLC. Se puede observar como aparecen dos tipologías distintas de distribuciones. Por un lado, las distribuciones de las vocales /@/ e /i/ tienen forma

Media de las distribuciones

Como primera parte del estudio, se analiza la media de las distribuciones de las medidas objetivas consideradas. Este estudio se ha centrado en los estímulos CVC con vocal /@/, ya que éste es el grupo más numeroso en el corpus, por tanto, de mayor robustez estadística. En términos de combinación distanciacoeficiente, se observa que la distancia que presenta una media menor es la Euclídea aplicada sobre parámetros LSF (véase la Figura 4). En el otro extremo se encuentra la distancia de Itakura, que es la que presenta la media más alta del conjunto de medidas objetivas estudiado. La distancia de Donovan es la que

70

Predicción estadística de las discontinuidades espectrales del habla para síntesis concatenativa

leptocúrtica (K>3) y una media estirada hacia la izquierda (S1). Por otro lado, se encuentra el resto de vocales, con valores de K y S cercanos a los típicos de las distribuciones gaussianas, cuestión corroborada, mediante la aplicación test de Kolmogorov-Smirnov, con p |M2 (D)| or M (D) = M2 (D) if |M1 (D)| < |M2 (D)|. Note that, in this case, M is not guarantee on the enhancement of both methods; nevertheless that |M1 (D)| > |M2 (D)|, we could obtain E(M1 ) < E(M2 ). This is possible if M1 has a decision criteria from which false positive cases take advantage. Besides, if congruence is obtained, |M1 (D)| < |M2 (D)| then E(M1 ) < E(M2 ), the methods would be complementary and, we can rise their performance: E(M ) ≥ E(M1 ) and E(M ) ≥ E(M2 ), i.e. a significative improvement, thus M1 and M2 are considered complementary. We are interested in knowing if methods based on different strategies have inherently different results. This fact may conjecture if they are complementary, whenever combining their results there exists a significative improvement. Now, we give an overview of the applied techniques.

Combination of methods

In this work, we apply three approaches to sentence extraction and combine some of them to observe possible relationships among them. Our goal is to analyse these approaches in order to strengthen a simple algorithm without losing their efficacy. We stablish three possible levels to combine methods: (1) high level, joining the results of the methods; (2) middle level, combining partial results; and (3) low level, embedding one method in another one. Some examples of these levels follow. In (1), iterative algorithms which in each step refine their results may be considered; Brill’s POS tagger may be seen (Brill, 1994), at least, as the application of two methods: tag assignment and correction assignment. For (2), combining of scores to choose a partial result; voting algorithms used, for instance, in text categorization (Montejo, Urena, and Steinberg, 2005). And in (3) some approaches are: merging, a clear example is quick-sort which can use another sort algorithm to end the recursive process; resources, each method works on some kind of data providing a step within the whole method, word sense disambiguation has some examples of this approach (Ng and Lee, 1996); fusion, in this class any improvement of an algorithm may be considered. In our context, high level combination could require combining of sentences in a similar way to text generation do for summarization. Low level would imply formulating a new method. As we can see, middle level is the simplest one, and according to the results we can investigate other combining strategies. The power of a combination

3

Sentence Extraction Methods

In this section we give some details on the used methods. Let T be a text and [o1 , . . . , on ] the sentences that make up T .

3.1

Text-rank

The algorithm page-rank and its derivatives (Kleinberg, 1999) use a graph. Broadly speaking, at the beginning of page-rank a value is assigned to each node. Then, in an iterative fashion, it updates the values. After an −convergency to its fixed point is reached, every node has a score; which means the importance degree of the node as a function of the role it played in the paths of the graph. These algorithms belong to the class of iterative algorithms that look for a fix point; similar to the Gauss-Seidel algorithm 116

Test of complementarity on sentence extraction methods

to solve simultaneous equations. The edges arrangement can be done in one of the following ways: a directed graph with forward edges (previous sentences pointing to posterior ones); a directed graph with backward edges (posterior sentences pointing to previous ones); or an undirected graph. Let G = (V, E) be the graph that we have constructed, where V is the set of nodes, and E ⊂ V × V is the set of edges. For each vi ∈ V , let In(vi ) be the set of nodes pointing to vi , and let Out(vi ) be the set of nodes pointed by vi (in the case of undirected graphs In(vi ) = Out(vi )). The weighting of the graph is done from a text: each sentence labels a node of the graph, the similarity between two sentences is the weight of the edge that links the corresponding nodes. The similarity between sentences is a measure computed in different ways; for example, by using the following formula: sim(o1 , o2 ) =

inter(o1 , o2 ) log(|o1 |) + log(|o2 |)

foreach oi ∈ T do si = sim(oi , kywr) 0 T = project2 (sort([(s1 , o1 ), . . . , (sn , on )])) end

Now, we will see two methods which obtain an input of the algorithm, namely kywr. 3.2.1

In this case (Kw), an undirected and not weighed graph is constructed taking lexical units as nodes. To define the edges between nodes the co-occurrence criteria, of both terms in a window of N units (Mihalcea and Tarau, 2004) is used. We select the 10 first terms with highest score. 3.2.2

(1)

where o1 and o2 are the sentences under consideration, inter(o1 , o2 ) is the number of words belonging to both o1 and o2 , and |oi | the number of words of oi . The text-rank method (TR) is convergent with margin of error . The score of each node is computed as follows: T R(oi ) = (1−d)+d∗

X

wji P

oj ∈In(oi )

T R(oj )

ok ∈Out(oj ) wjk

,

(2) where, wij is the weight of the vertex joining oi and oj (sim(oi , oj )), and d is a fix value between 0 and 1. After getting the initial scores, T R is iterated until a fix point is reached using ; see (Mihalcea, 2004) for more details.

3.2

Text-rank

Extracting keywords

Two methods to get keywords from a text are presented. They get the sentences score by computing the similarity between the set of keywords of the text and the sentence (formula (1)). The next code may clear the previous statement:

Transition rank

Another method used in this work takes terms of mid-frequency as the base to get an extract. It has been seen (Urbizag´asteguiAlvarado, 1999) that such terms have high semantical contents. We use the transition point (TP) method to get terms of midfrequency. The TP is a frequency that divides the vocabulary of a text into words of high and low frequency. In this way, the terms with a frequency around the TP are candidates for important terms; therefore, to choose mid-frequencies, a threshold must be given. This method was used in (Bueno-Tecpanecatl, Pinto, and Jim´enezSalazar, 2005) to get extracts. Also TP has been used in text clustering (Jim´enezSalazar, Pinto, and Rosso, 2005). In the present work, we use the transition rank method (see (P´erez et al., 2006)) because it does not need to define a threshold around the TP in order to select terms. When the terms of mid-frequency have been found, they are used to compute the score of each sentence accounting the mid-frequency terms contained in the sentence. An analog procedure may followed taking the keywords provided by text-rank algorithm (Mihalcea, 2004). Essentially the procedure (TPR) is to choose terms with a frequency in a rank from the lowest not repeated frequencies to the highest repeated frequencies. The terms with such frequencies presumably have high semantical contents, and they are taken as the keywords of the text.

Algorithm: Ordering of sentences; input T : list of sentences; kywr : list of words; output T 0 : list of sentences; // ordered begin 117

Alberto Bañuelos-Moro, Héctor Jiménez-Salazar, José de Jesús Lavalle-Martínez

3.3

Method TR TPR Kw RI

Representation index

In (Marcu, 1999) a simple method to generate the extract of a text was proposed. The key idea of this method is the representativeness index of a sentence, which in turn, the index is determined in the following way: the importance degree of a sentence oi is determined inversely to the similarity between the text T removing oi and T ; since if oi is important, and removing it from T make less similar this text to T. Then, the sentences are ordered according to its index: o1 , . . . , on , where sim(T − [oi ], T ) ≤ sim(T − [oi+1 ], T ), 1 ≤ i < n. We made a little variant to this method using the sentence instead of text diminished by the sentence: o1 , . . . , on , where sim([oi ], T ) ≥ sim([oi+1 ], T ), 1 ≤ i < n. This method (RI) directly computes the score of each sentence oi applying the formula (1) to the sentence and the full text: sim(oi , T ). RI uses the same code as above (Odering of sentences) replacing T instead kywr in the similarity function.

4

4.3

Evaluation

To evaluate the results, the automatic summaries evaluation package, ROUGE was used, it is based on statistics of N-grams. ROUGE was used with: ROUGE-L, confidence interval of 95%, without reserved words, score formula model average, assigning the same importance to precision and recall, and averaging the score of the units. Table 1 shows the values gotten in evaluating the results by ROUGE. The representation index method had the highest value (0.6284).

Experiments

Dataset

5

Discussion

Three approaches to sentence extraction were applied to the collection DUC 2002: keyword-based (TPR, Kw), representationbased (RI) and, graph-based (TR). The best method was RI. Combining its results, through score maximization, the evaluation revealed they are not complementary; one of them can not help the other. Since they share score function and data from the text, the combination improved only one method: E(M1 ) < E(M ) < E(M2 ). In Table 1 we can see higher scores are shown by methods which use the full sentences in order to determine the score. Those methods whose parameters were a reduced set of words, i.e. keyword-based, got the lowest evaluation. And how they calculate the keywords was not important because the difference between score values was very small.

Applied procedure

The methods described above were applied: TPR, transition rank; Kw, keywords using text-rank; RI, representation index; and TR, text-rank. In the case of the text-rank algorithm, having the text already preprocessed, a graph was constructed applying the formula (2) with d = 0.85 (Mihalcea, 2004). The initial value assigned to each node was 1, and the convergency error was = 0.001. It took 1 Document Understanding http://duc.nist.gov/.

Score 0.5416 0.5498 0.4813 0.6148

an average of 18 iterations to reach the fixed point. To produce the extract from each text the 7 sentences with the highest score were taken, independently of the method considered. Some method combinations were made in order to know the possible relationship between them. The combination consisted of getting the score of each sentence, by computing the maximum between the score of two methods M1 , M2 : max(score(M1 ), score(M2 )).

The experiments were made on 533 articles, about news in the English language, from the DUC 2002 collection1 they have no format at all. Each text was converted to lower-case, spaces were inserted to separate punctuation symbols. The texts were divided into sentences (taking the period as a separator), empty lines and stopwords were deleted.

4.2

Method max(TR,TPR) max(TR,Kw) max(Kw,TPR) max(TR,RI)

Table 1: Evaluation of the methods and some combinations.

A description of the used data, its preprocessing, and an evaluation of the results is now given.

4.1

Score 0.5761 0.4711 0.4969 0.6284

Conference,

118

Test of complementarity on sentence extraction methods

This result is explained by the lose of information, since they only worked with isolated terms. For high score, the differences among the methods are mainly given by the parameters used in the similarity function. RI method used as a parameter the whole text to calculate the score, while TR method extends the similarity between sentences to all sentences indirectly through iteration. In spite of using the whole text, RI could introduce noise in the computation of similarity, when it was used the highest performance was obtained. It seems that used information in graph-based method cannot be incorporated throughout iteration as it was done in the representation-based method. The strength of TR is the iteration2 , which refine scores of sentences, whilst the strength for RI is the use of full text. These features may help to formulate a better algorithm considering a deeper representation of the text sentence, for instance using relative position of terms in the sentence; and a richer class of nodes in the graph-based method, as the application of TR to connected components instead of nodes. These issues as well as test of combination at high or low level, varying the dataset and evaluation system will be considered as future work.

Jim´enez-Salazar, H´ector, David Pinto, and Paolo Rosso. 2005. Uso del punto de transici´on en la selecci´on de t´erminos ´ındice para agrupamiento de textos cortos. Procesamiento del Lenguaje Natural, 35:383–390. Kleinberg, J.M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632. Marcu, Daniel. 1999. The automatic construction of large-scale corpora for summarization research. In Proceedings of the SIGIR of ACM 99, pages 137–144. Mihalcea, Rada. 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In The Companion Volume to the Proc. of 42st Annual Meeting of the ACL, pages 170–173, Barcelona, Spain, July. Association for Computational Linguistics. Mihalcea, Rada and Paul Tarau. 2004. Textrank: bringing order into text. In The Companion Volume to the Proc. of 42st Annual Meeting of the ACL, pages 190– 193, Barcelona, Spain, July. Association for Computational Linguistics. Mihalcea, Rada, Paul Tarau, and Elizabeth Figa. 2004. PageRank on Semantic Networks, with application to Word Sense Disambiguation. In Proc. of the 20st International Conference on Computational Linguistics.

References Brill, Erick. 1994. Some advances in rulebased part of speech tagging. In AAAI, editor, Proceedings of the AAAI Conference.

Montejo, Arturo R´aez, Alfonso Urena, and Ralf Steinberg. 2005. Text categorization using bibliographic records: beyond document content. Procesamiento del Lenguaje Natural, 35:119–126.

Brin, Sergey and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30:1–7.

Ng, Hwee Tou and Hian Beng Lee. 1996. Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar Based Approach. In Proc. the 34th Annual Meeting of the ACL.

Bueno-Tecpanecatl, Claudia, D. Pinto, and H´ector Jim´enez-Salazar. 2005. El p´arrafo virtual en la generaci´on de extractos. Research on Computing Science, 13:85–90.

P´erez, David, Jos´e Tepacuacho, H´ector Jim´enez, and Grigori Sidorov. 2006. A term frequency range for text representation. Research on Computing Science, 20:113–118.

Hovy, Eduard. 2005. Text summarization. In R. Mitkov, editor, The Oxford Handbook of Computational Linguistics. Oxford University Press, 1st edition, pages 583– 598.

Urbizag´astegui-Alvarado, Rub´en. 1999. Las posibilidades de la ley de Zipf en la indizaci´on autom´atica. Technical report, Universidad de California Riverside, California, USA.

2

Actually TR outperformed (HITS, 0.5023) the top systems of DUC 2002 (0.5011) (Mihalcea and Tarau, 2004).

119

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 121-127

recibido 06-02-08, aceptado 03-03-08

Categorización de textos biomédicos usando UMLS∗ Biomedical text categorization using UMLS José Manuel Perea Ortega Arturo Montejo Ráez

María Teresa Martín Valdivia Manuel Carlos Díaz Galiano

Universidad de Jaén, Campus Las Lagunillas Edicio A3. E-23071 {jmperea,maite,amontejo,mcdiaz}@ujaen.es

Resumen: En este artículo se presenta un sistema automático de categorización

de texto multi-etiqueta que hace uso del metatesauro UMLS (Unied Medical Language System). El sistema ha sido probado sobre un corpus biomédico que incluye textos muy cortos pertenecientes a expedientes de niños con enfermedades respiratorias. El corpus ha sido enriquecido utilizando las ontologías que incluye UMLS y los resultados obtenidos demuestran que la expansión de términos realizada mejora notablemente al sistema de categorización tradicional. Palabras clave: Categorización de texto, Ontologías, UMLS, Integración de conocimiento, Expansión de términos

Abstract: In this paper we present an automatic system for multi-label text ca-

tegorization which makes use of UMLS (Unied Medical Language System). Our approach has been tested on a biomedical corpus which includes very short texts belonging to expedients of children with respiratory disseases. The corpus has been enriched by using those ontologies integrated in UMLS and the results obtained show that the term expansion approach proposed greatly improves the traditional categorization system. Keywords: Text categorization, Ontology, UMLS, Knowledge integration, Term expansion

1. Introducción No cabe duda que la información es uno de los recursos fundamentales en cualquier ámbito profesional o personal. Sin embargo, en los últimos años, la cantidad de información generada diariamente de manera electrónica está creciendo de forma exponencial. De hecho, el acceso a dicha información se está convirtiendo en un gran problema. Esta saturación de información está provocando que gran parte de la investigación en nuevas tecnologías esté siendo orientada a la recuperación y uso eciente de dicha información. Parte de esta investigación hace uso de técnicas y herramientas propias del Procesamiento del Lenguaje Natural (PLN). El PLN es una disciplina que ha demostrado a lo largo de los años que es imprescindible ∗

Este trabajo ha sido nanciado por el Ministerio de Ciencia y Tecnología a través del proyecto TIMOM (TIN2006-15265-C06-03). ISSN 1135-5948

para mejorar la precisión de los sistemas de información (Mitkov, 2003) tales como sistemas de categorización de documentos, sistemas de recuperación de información monolingüe y multilingüe, sistemas de extracción de conocimiento, sistemas de generación automática de resúmenes... En este trabajo se presenta un sistema de categorización de textos multi-etiqueta que ha sido entrenado en un entorno biomédico. La categorización de textos es una de las tareas fundamentales del PLN y que mas ampliamente han sido estudiadas (Sebastiani, 2002). La categorización consiste en determinar si un documento dado pertenece a un conjunto de categorías predeterminadas. Por otra parte, una de las técnicas que han sido utilizadas para aumentar la precisión de los sistemas consiste en la integración de recursos externos que permitan obtener una información de mayor calidad. Así por ejemplo, © Sociedad Española para el Procesamiento del Lenguaje Natural

José Manuel Perea Ortega, María Teresa Martín Valdivia, Arturo Montejo Ráez, Manuel Carlos Díaz Galiano

la integración de conocimiento mediante el uso de ontologías ha conseguido muy buenos resultados en numerosos sistemas. Por ejemplo, WordNet1 (Miller, G.A. et al., 1993) ha sido utilizada con éxito en multitud de trabajos relacionados con recuperación de información, desambiguación e incluso categorización de textos (Martín Valdivia, Ureña López, y García Vega, 2007). Por otra parte, en el entorno biomédico se están desarrollando muchos sistemas de información que hacen uso de recursos externos como ontologías. Los trabajos realizados demuestran que la integración de conocimiento puede ayudar a mejorar los sistemas. Por ejemplo, la ontología GO2 (Gene Ontology) ha constituido una fuente de información incalculable para muchos investigadores que trabajan con temas relacionados con el genoma humano (Bontempi, 2007). La ontología MeSH (Medical Subject Headings) ha sido aplicada con éxito para expandir términos de las consultas en sistemas de recuperación de información (Díaz Galiano et al., 2007). Sin embargo, la mayoría de los trabajos que integran información a partir de ontologías han estado orientados a la recuperación y extracción de información más que a la categorización de texto. En un trabajo anterior (Martín Valdivia et al., 2007) hicimos uso de la ontología MeSH pero los resultados obtenidos no fueron muy prometedores. El sistema desarrollado realizaba una expansión de términos que tenía en cuenta la jerarquía de conceptos de MeSH usando los nodos padres, hijos y/o hermanos. En este artículo se propone usar el metatesauro UMLS que incluye varias ontologías médicas (entre ellas la ontología MeSH) para realizar una expansión de términos a la colección de documentos CCHMC. Con esto, se pretende conseguir una mejor categorización de textos multi-etiqueta integrando el conocimiento incluido en UMLS sobre el corpus CCHMC. El artículo se organiza de la siguiente manera: en primer lugar, se describe brevemente la tarea de categorización de textos multietiqueta así como el sistema categorizador utilizado. A continuación, se presentan el corpus biomédico utilizado (el corpus CCHMC). El metatesauro UMLS se describe en la siguiente sección junto con la manera de expandir los términos del corpus. En la sección

cinco se muestran los experimentos y resultados obtenidos. Finalmente, se comentan las conclusiones y trabajos futuros.

2. Categorización de textos multi-etiqueta La asignación automática de palabras clave a los documentos abre nuevas posibilidades en la exploración documental (Montejo Ráez y Steinberger, 2004), y su interés ha despertado a la comunidad cientíca en la propuesta de soluciones. La disciplina de la Recuperación de Información (RI), junto con las técnicas para el Procesamiento del Lenguaje Natural (PLN) y los algoritmos de Aprendizaje Automático (Machine Learning, ML) son el sustrato de donde emergen las tareas de Categorización Automática de Textos (Sebastiani, 2002). Los algoritmos de aprendizaje empleados van desde clasicadores lineales, probabilísticos y métodos de regresión (Joachims, 1998), (Friedman, Geiger, y Goldszmidt, 1997), (Lewis et al., 1996) a redes neuronales (Martín Valdivia, García Vega, y Ureña López, 2003; Li et al., 2002), pasando por técnicas de voto y boosting (Li et al., 2002; Bauer y Kohavi, 1999). En la clasicación de documentos se distinguen tres casos: categorización binaria, cuando el clasicador debe devolver una de entre dos posibles categorías, categorización multi-clase, cuando el clasicador debe proporcionar una categoría de entre varias propuestas. Por último, tenemos el caso más complejo, la categorización multi-etiqueta, donde el clasicador debe determinar un número indenido de clases de entre una amplia variedad de candidatas. En cualquier caso, los sistemas de categorización automáticos se componen habitualmente de dos módulos principales: un procesador de documentos y un algoritmo de entrenamiento y clasicación. El primero transforma los textos a representaciones manejables por los segundos, generalmente siguiendo el modelo de espacio vectorial. El segundo aplica algoritmos de aprendizaje automático para modelizar los clasicadores. El dominio biomédico ha sido uno de los más interesados en el desarrollo y progreso de este tipos de sistemas, al contar con una larga tradición en el uso de ontologías y vocabularios controlados para el manejo de documentos, siendo el multi-etiquetado el problema que se plantea en general.

1 http://wordnet.princeton.edu 2

http://www.geneontology.org

122

Categorización de textos biomédicos usando UMLS

BIOSIS categorizaba documentos a partir de un vocabulario de 15,000 términos biológicos que se podían resumir en 600 conceptos (Vieduts-Stokolo, 1987). Esta clasicación era jerárquica, y si sólo se consideraba el nivel primario en torno al 75 % de los conceptos quedaban cubiertos por el sistema. La precisión rozaba el 65 %. Medical Subject Headings (MeSH) es una taxonomía de conceptos médicos usados para la categorización de documentos en la base de datos MEDLINE. El sistema desarrollado por Bruno Pouliquen (Pouliquen, Delamarre, y Beux, 2002) denominado Nomindex es una de las primeras propuestas para la automatización de su etiquetado. Su sistema aplicaba principalmente medidas estadísticas típicas dentro del mundo de la Recuperación de Información dando como resultado un sistema más que aceptable. Podemos citar también el trabajo de Wright et al. (Wright et al., 1999) en el desarrollo de una herramienta para el indexado de documentos en el UMLS (siglas de Unied Medical Language System en inglés). Este sistema hace también uso intensivo de recursos lingüísticos como el reconocimiento de componentes nominativos (noun phrases ) o sinónimos. Una combinación de la información en el título, el resumen y el contenido permite asignar a cada concepto del tesauro MeSH. Nuestro enfoque se ha centrado en el uso de las ontologías médicas como un recurso para la mejora de los sistemas de categorización mediante la expansión de términos en la consulta. Con respecto a trabajos anteriores (Martín Valdivia et al., 2007), hemos modicado el método de expansión, pasando de usar exclusivamente MeSH y una expansión basada en recorridos sobre la jerarquía de términos a una expansión sobre UMLS a través de la interfaz MetaMap Transfer 3 . El conjunto de datos utilizado no diere, así como el sistema de categorización y evaluación: hemos aplicado la herramienta TECAT4 sobre el corpus CCHMC (detallado más adelante) mediante una validación cruzada. Si bien los resultados eran desalentadores, consideramos que el problema debía radicar en la ontología usada así como en la forma en que ésta fue aplicada. Es por ello que estudiar un cambio de enfoque era necesario a la hora de emitir un juicio acerca de los efectos que la

integración de estos recursos producen en la categorización de textos biomédicos.

3. La colección CCHMC Esta colección de 978 documentos ha sido preparada por The Computational Medicine Center 5 . Dicho corpus incluye registros médicos anónimos recopilados en el departamento de radiología del Hospital infantil de Cincinnati (the Cincinnati Children's Hospital Medical Center's Department of Radiology - CCHMC) (cmc, 2007). Estos documentos son informes radiológicos que están etiquetados con códigos del ICD-9-CM (Internacional Classication of Diseases 9th Revision Clinical Modication). Se trata de un catálogo de enfermedades codicadas con un número de 3 a 5 dígitos con un punto decimal después del tercer dígito. Los códigos ICD-9-CM son un subgrupo de los códigos ICD-9. Están organizados de manera jerárquica, agrupándose varios códigos consecutivos en los niveles superiores. Estos códigos están relacionados con enfermedades del sistema respiratorio únicamente y sus valores se establecen dentro del rango de números 460 al 5196 . Cada documento contiene dos campos de texto a partir del cual se ha construido el cuerpo a procesar: CLINICAL_HISTORY e IMPRESSION. Ambos campos son, por lo general, muy breves, veamos un ejemplo: CLINICAL_HISTORY: Eleven year old with ALL, bone marrow transplant on Jan. 2, now with three day history of cough. IMPRESSION: 1. No focal pneumonia. Likely chronic changes at the left lung base. 2. Mild anterior wedging of the thoracic vertebral bodies. La brevedad de contenido nos hace pensar que la expansión de términos debería contribuir a una mejora del sistema de categorización, al aumentar el número de características representativas de cada documento. El proceso seguido para dicha expansión se describe más adelante. 5 http://www.computationalmedicine.org 6

Se puede consultar dicho catálogo de códigos ICD-9-CM en la dirección http://www.cs.umu.se/∼medinfo/ICD9/ icd9cm_group8.html

3 http://mmtx.nlm.nih.gov/index.shtml 4

http://sinai.ujaen.es/wiki/index.php/TeCat 123

José Manuel Perea Ortega, María Teresa Martín Valdivia, Arturo Montejo Ráez, Manuel Carlos Díaz Galiano

4. UMLS UMLS7 es un repositorio de varias ontologías biomédicas desarrollado por la Biblioteca Nacional de Medicina de Estados Unidos. UMLS integra más de 2 millones de nombres para unos 900,000 conceptos procedentes de más de 60 familias de vocabularios biomédicos, así como 12 millones de relaciones entre esos conceptos (Bodenreider, 2004). UMLS es un sistema que garantiza referencias cruzadas entre más de treinta vocabularios y clasicaciones. La mayoría de estas referencias cruzadas se realizan gracias al análisis léxico de los términos, de ahí su inclusión en la categoría de sistemas léxicos de clasicación en el dominio biomédico (Ceusters et al., 1997). Algunos ejemplos de ontologías que incorpora UMLS son ICD-9CM, ICD-10, MeSH, SNOMED CT, LOINC, MEDLINE, WHO Adverse Drug Reaction Terminology, UK Clinical Terms, RxNORM, Gene Ontology, and OMIM. UMLS está formado por tres componentes principales:

Figura 1: Procesamiento de un texto con MetaMap semánticos denidos y 54 relaciones entre ellos. UMLS tiene varias herramientas software de soporte como MetaMap . MetaMap es una herramienta online que se utiliza para encontrar conceptos relevantes del Metatesauro dado un texto arbitrario. MetaMap Transfer (MMTx) provee la misma funcionalidad que MetaMap pero como programa Java. Para los experimentos de este trabajo hemos utilizado esta interfaz.

El Metatesauro. Es la base de datos núcleo de UMLS, una colección de conceptos, términos y sus relaciones. El Metatesauro está organizado por conceptos, y cada concepto tiene atributos especícos que denen su signicado y lo enlazan a sus correspondientes nombres de conceptos en las distintas ontologías que conforman UMLS. También se representan numerosas relaciones entre conceptos, tales como ”es un ”, ”es parte de ”, ”es causado por ”, etc.

4.1. Expansión de CCHMC usando UMLS Para expandir con UMLS cada chero de texto de la colección CCHMC hemos utilizado la herramienta MetaMap Transfer (MMTx). El texto de cada chero se procesa a través de una serie de módulos. En primer lugar, el texto se divide en componentes como párrafos, sentencias, frases, elementos léxicos y tokens. Después, las distintas variantes se generan a partir de las frases detectadas. Los conceptos candidatos del Metatesauro UMLS son recuperados y evaluados en relación a estas frases. Los conceptos candidatos que mayor similitud tengan con la frase se organizan en un mapping nal que será el que se utilice para la expansión de términos. Se puede observar el procesamiento que sigue el texto de un documento con MetaMap en la Figura 1. El pseudocódigo seguido en los experimentos para realizar la expansión de términos a

El Lexicón Especializado. Es una base de datos de información lexicográca para uso en Procesamiento de Lenguaje Natural. Contiene información sobre vocabulario común, términos biomédicos, términos encontrados en MEDLINE y en el propio Metatesauro. Cada entrada contiene información sintáctica, morfológica y ortográca. La Red Semántica. Es un conjunto de categorías y relaciones usadas para clasicar y relacionar las entradas en el Metatesauro. Cada concepto en el Metatesauro se asigna al menos a un tipo semántico o categoría. Existen 135 tipos 7

http://www.nlm.nih.gov/research/umls 124

Categorización de textos biomédicos usando UMLS

un documento de la colección CCHMC se explica a continuación: 1. Para cada sentencia encontrada en el documento obtenemos las frases detectadas. 2. Para cada frase obtenemos su mapping nal (mejores conceptos candidatos). 3. Para cada concepto candidato: Obtenemos su nombre UMLS y lo añadimos al conjunto de términos expandidos (si no estuviera ya añadido). Añadimos también al conjunto de la expansión el grupo de términos sinónimos que conforman dicho concepto, es decir, aquellos términos que aparecen en distintas ontologías de UMLS y que pertenecen al concepto en cuestión, controlando que no haya términos repetidos. En la Figura 2 podemos ver varios ejemplos de expansión realizada con la herramienta MetaMap Transfer (MMTx) a un documento de la colección CCHMC, siguiendo las estrategias que se explican en el apartado 5.

Figura 2: Ejemplos de expansión UMLS de un documento de la colección CCHMC descrito en el apartado 4.1. En algunas ocasiones, los términos de expansión obtenidos de la ontología estaban compuestos por más de una palabra o token. Esta característica nos ha permitido utilizar dos estrategias en el proceso de expansión:

5. Experimentos y resultados Para este trabajo se han realizado varios experimentos con distintos tipos de expansión UMLS y con diferentes algoritmos de aprendizaje automático. Concretamente se ha utilizado el algoritmo SVM (Support Vector Machine) y una red neuronal tipo perceptrón denominada PLAUM. Para estos algoritmos se han considerado sus conguraciones por defecto, sin variaciones de ningún parámetro. También se ha utilizado expansión de términos haciendo uso de una ontología médica como UMLS para incorporar información de calidad a los documentos de la colección que ayude a mejorar la categorización de los mismos. Los resultados demuestran que el uso de SVM es mejor que PLAUM cuando no se aplica expansión de términos. En cambio, PLAUM mejora cuando hemos utilizado expansión. Para todos los casos, el uso de la expansión de términos con UMLS mejora el caso base. La expansión de los documentos de la colección CCHMC se ha realizado utilizando la ontología médica UMLS. El procedimiento seguido para realizar dicha expansión se ha

Estrategia joint . Consiste en conside-

rar los términos de expansión de más de una palabra como un único token. Para ello, hemos sustituido los espacios entre las palabras del término por el símbolo subrayado. De esta forma se consigue introducir más términos diferentes para el posterior proceso de clasicación.

Estrategia no-joint . Consiste en sepa-

rar los tokens de aquellos términos de expansión formados por más de una palabra y añadirlos por separado a la expansión, comprobando que no haya tokens repetidos. Con esta estrategia, al contrario de lo que ocurre con la anterior, el número total de términos añadidos a los documentos de la colección es bastante inferior.

En la Figura 2 se puede observar el resulta125

José Manuel Perea Ortega, María Teresa Martín Valdivia, Arturo Montejo Ráez, Manuel Carlos Díaz Galiano

P R F1

PLAUM 80.91 % 64.08 % 71.52 %

SVM

P R F1

90.48 % 61.79 %

73.43 %

Tabla 1: Micro-averaging sin expansión

P R F1

PLAUM 85.17 % 69.49 %

76.53 %

PLAUM 84.97 % 71.13 %

77.44 %

SVM

92.98 % 64.80 % 76.37 %

Tabla 3: Micro-averaging con expansión joint

SVM

las diferencias no son muy importantes (2,33 puntos para la estrategia no-joint y 1,38 puntos para la expansión joint ).

92.04 % 62.92 % 74.74 %

6. Conclusiones y trabajo futuro

Tabla 2: Micro-averaging con expansión nojoint

En este trabajo se ha presentado un estudio sobre la integración de conocimiento médico en la categorización multi-etiqueta de documentos biomédicos. Para ello, se ha expandido el corpus utilizado (CCHMC) en el proceso de categorización multi-etiqueta con el tesauro médico UMLS. Para realizar el estudio se han utilizado dos algoritmos de aprendizaje como SVM y PLAUM. Aunque las diferencias encontradas entre ambos algoritmos no son determinantes, parece que PLAUM funciona mejor cuando utilizamos cualquiera de las dos estrategias de expansión explicadas. No obstante, no consideramos relevantes las diferencias. Los resultados corroboran la conveniencia de integrar conocimiento externo procedente de una ontología especíca, en este caso UMLS. Estos resultados ponen de maniesto que, independientemente del algoritmo utilizado, la expansión de términos usando UMLS mejora considerablemente los resultados. En el futuro se intentarán aplicar estas técnicas de expansión con UMLS a otros corpus biomédicos para comprobar su rendimiento. Por otro lado, se tiene pensado aplicar las mismas estrategias seguidas en este trabajo sobre otras tareas de PLN como minería de textos o recuperación de información biomédica.

do de la aplicación de ambas estrategias de expansión a un documento de la colección. Con respecto a la evaluación de los resultados obtenidos, las medidas consideradas son la precisión (P), la cobertura (R) y la F1, siendo ésta última la que nos da una visión más completa del comportamiento del sistema. Estas medidas han sido obtenidas mediante micro-averaging sobre validación cruzada en 10 particiones (10-fold cross-validation ), es decir, repitiendo el experimento 10 veces con distintas colecciones de entrenamiento y evaluación, y calculando, cada vez, los aciertos y fallos en cada clase de forma acumulativa y calculando los valores nales sobre dichos valores acumulados. Se pueden observar los resultados obtenidos para los distintos experimentos en las tablas 1, 2 y 3 para la medida micro-averaging. Si analizamos los resultados desde el punto de vista de la expansión de los documentos, se puede armar que la integración de UMLS mejora notablemente los resultados sin expansión. En concreto, para el algoritmo PLAUM, la medida F1 mejora en 6,54 puntos si se utiliza expansión no-joint y en 7,64 puntos con expansión joint. Para el algoritmo SVM ocurre igual pero con una diferencia más pequeña que el PLAUM (1,75 puntos con expansión no-joint y 3,84 puntos con expansión joint ). En cuanto a los algoritmos de aprendizaje utilizados, se puede observar que la expansión funciona tanto para PLAUM como para SVM, pero hay que señalar que SVM funciona mejor que PLAUM cuando no se aplica expansión de términos (2,6 puntos mejor). En cambio, con PLAUM se han obtenido mejores resultados que con SVM cuando hemos utilizado expansión de términos UMLS, aunque

Bibliografía 2007. CMC. The Computational Medicine Center's 2007 Medical Natural Language Processing Challenge. Bauer, Eric y Ron Kohavi. 1999. An Empirical Comparison of Voting Classication Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36(1-2):10513, August. Bodenreider, Olivier. 2004. The Unied Medical Language System (UMLS): inte126

Categorización de textos biomédicos usando UMLS

grating biomedical terminology. Nucleic Acids Research, 32.

Martín Valdivia, M.T., A. Montejo Ráez, M.C. Díaz Galiano, y L.A. Ureña López. 2007. Integración de conocimiento en un dominio especíco para la categorización multietiqueta. Procesamiento del Lenguaje Natural, 38.

Bontempi, Gianluca. 2007. A Blocking Strategy to Improve Gene Selection for Classication of Gene Expression Data. IEEEACM Transactions on Computational Biology and Bioinformatics, 4(2):293300.

Martín Valdivia, M.T., L.A. Ureña López, y M. García Vega. 2007. The learning vector quantization algorithm applied to automatic text classication tasks. Neural Networks, 20(6):748756.

Ceusters, W., F. Buekens, G. De Moor, y A. Waagmeister. 1997. The distinction between linguistic and conceptual semantics in medical terminology and its implications for NLP-based knowledge acquisition. En IMIA Working Group 6, Jacksonville, Florida.

Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., y Miller, K. 1993. Introduction to WordNet: An On-line Lexical Database.

Díaz Galiano, M.C., M.A. García Cumbreras, M.T. Martín Valdivia, A. Montejo Ráez, y L.A. Ureña López. 2007. Using Information Gain to Improve the ImageCLEF 2006 Collection. En CLEF, volumen 4730 de Lecture Notes in Computer Science, páginas 711714. Springer.

Mitkov, Ruslan, editor. 2003. The Oxford Handbook of Computational Linguistics. Oxford University Press.

Friedman, Nir, Dan Geiger, y Moises Goldszmidt. 1997. Bayesian Network Classiers. Mach. Learn., 29(2-3):131 163.

Pouliquen, Bruno, Denis Delamarre, y Pierre Le Beux. 2002. Indexation de textes médicaux par extraction de concepts, et ses utilisations. En A. Morin & P. Sébillot (eds.), editor, 6th International Conference on the Statistical Analysis of Textual Data, JADT'2002, volumen 2, páginas 617628, March.

Montejo Ráez, A. y R. Steinberger. 2004. Why keywording matters. High Energy Physics Libraries Webzine, (Issue 10), December.

Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning. Springer Verlag, (1398):137142.

Sebastiani, Fabrizio. 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34(1):147.

Lewis, David D., Robert E. Schapire, James P. Callan, y Ron Papka. 1996. Training algorithms for linear text classiers. En Hans-Peter Frei Donna Harman Peter Schäuble, y Ross Wilkinson, editores, Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, páginas 298306, Zürich, CH. ACM Press, New York, US.

Vieduts-Stokolo, Natasha. 1987. Concept recognition in an automatic textprocessing system for the life sciences. Wright, Lawrence W., Holly K. Grossetta Nardini, Alan R. Aronson, y Thomas C. Rindesch. 1999. Hierarchical concept indexing of full-text documents in the R InUnied Medical Language System° formation Sources Map. Journal of the American Society for Information Science, 50(6):514523.

Li, Y., H. Zaragoza, R. Herbrich, J. ShaweTaylor, y J. Kandola. 2002. The Perceptron Algorithm with Uneven Margins. En Proceedings of the International Conference of Machine Learning (ICML'2002). Martín Valdivia, M.T., M. García Vega, y L.A. Ureña López. 2003. LVQ for Text Categorization using Multilingual Linguistic Resource. Neurocomputing, 55:665' 679. 127

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 129-136

recibido 08-02-08, aceptado 03-03-08

Sistemas de Recuperaci´ on de Informaci´ on Geogr´ afica ∗ multiling¨ ues en CLEF Multilingual Geographical Information Retrieval systems in CLEF Jos´ e Manuel Perea Ortega Miguel Angel Garc´ıa Cumbreras Manuel Garc´ıa Vega L. Alfonso Ure˜ na L´ opez Universidad de Ja´en, Campus Las Lagunillas Edificio A3. E-23071 {jmperea,magc,mgarcia,laurena}@ujaen.es Resumen: En este art´ıculo se presenta un estudio comparativo de las distintas estrategias y t´ecnicas de procesamiento del lenguaje natural m´as utilizadas en la actualidad para abordar la tarea de la recuperaci´on de informaci´on geogr´afica (Geographical Information Retrieval, GIR). Este trabajo se ha basado fundamentalmente en el an´alisis de los mejores sistemas presentados a la tarea de b´ usqueda del GeoCLEF, un marco de evaluaci´ on para recuperaci´on de informaci´on geogr´afica que pertenece al foro internacional Cross Language Evaluation Forum (CLEF). Las conclusiones obtenidas reflejan que es imprescindible hacer uso de recursos externos de informaci´on geogr´afica, tales como gazetteers y tesauros o reconocedores de entidades. As´ı mismo es necesario realizar una indexaci´on por separado de la informaci´on geogr´ afica y de la no geogr´afica antes del proceso de recuperaci´on. Palabras clave: Recuperaci´on de Informaci´on Geogr´afica, GeoCLEF, Procesamiento del Lenguaje Natural, Recuperaci´on de Informaci´on Abstract: This paper presents a comparative study of several strategies and techniques of natural language processing most used at present to solve the geographical retrieval information (GIR) task. This work has been based on the analysis of the best systems submitted to the search task of GeoCLEF, an evaluation framework for the geographical information retrieval task which belongs to the international forum Cross Language Evaluation Forum (CLEF). The main conclusions show that it is imperative to make use of external geographic information resources such as gazetteers and thesaurus, named entity recognizers and it is necessary to make an index for geographic information only and another index for non-geographic information before the retrieval process. Keywords: Geographical Information Retrieval, GeoCLEF, Natural Language Processing, Information Retrieval

1.

Introducci´ on

La recuperaci´on de informaci´on geogr´afica (GIR a partir de ahora, del ingl´es Geographical Information Retrieval) pertenece a una rama especializada de la recuperaci´on de informaci´ on (IR, del ingl´es Information Retrieval ) tradicional. Incluye todas las ´areas de investigaci´ on que tradicionalmente forman el n´ ucleo de la IR, pero adem´as con un ´enfasis ∗

Este trabajo ha sido financiado por el Ministerio de Ciencia y Tecnolog´ıa a trav´es del proyecto TIMOM (TIN2006-15265-C06-03) y el proyecto RFC/PP2006/Id514 financiado por la Universidad de Ja´en.

ISSN 1135-5948

en la informaci´on geogr´afica y espacial. La recuperaci´on de informaci´on geogr´afica se preocupa de la recuperaci´on de informaci´on que involucra alg´ un tipo de percepci´on espacial. Muchos documentos contienen alg´ un tipo de referencia espacial relevante para la b´ usqueda (Mandl et al., 2007). Existen congresos y foros de evaluaci´ on como el Text REtrieval Conference1 (TREC) y el CLEF2 que no eval´ uan expresamente la relevancia en la tarea de la recuperaci´on de informaci´ on geogr´afica. El objetivo del Geo1 2

http://trec.nist.gov http://www.clef-campaign.org

© Sociedad Española para el Procesamiento del Lenguaje Natural

José Manuel Perea Ortega, Miguel Angel García Cumbreras, Manuel García Vega, L. Alfonso Ureña López

CLEF3 es proporcionar el marco de trabajo necesario en el que evaluar estos sistemas GIR en b´ usquedas de informaci´on, teniendo en cuenta aspectos geo-referenciales y multiling¨ ues. Es una tarea perteneciente al CLEF que se viene celebrando desde 2005. La principal contribuci´ on de este art´ıculo es ofrecer una visi´on general de las estrategias y t´ecnicas de procesamiento del lenguaje natural (PLN) m´as utilizadas en los sistemas presentados a la tarea GeoCLEF durante los u ´ltimos tres a˜ nos, para resolver la recuperaci´on de informaci´on basada en contenido geogr´afico. El art´ıculo se organiza de la siguiente manera: en primer lugar, se describe brevemente la tarea de la recuperaci´on de informaci´on geogr´afica. A continuaci´ on, se presentan los recursos utilizados en GeoCLEF. Las principales estrategias usadas en un sistema de recuperaci´on de informaci´on geogr´ afica se describen en la siguiente secci´ on. En la secci´on cinco se muestra un an´alisis de los resultados obtenidos en el marco del GeoCLEF. Finalmente, se comentan las conclusiones.

2.

Figura 1: Arquitectura b´asica del sistema GIR GeoUJA

La tarea de la recuperaci´ on de informaci´ on geogr´ afica

Se puede definir la tarea de la recuperaci´on de informaci´on geogr´afica como la recuperaci´on de documentos relevantes en respuesta a una consulta con el formato , donde la relaci´on espacial puede implicar impl´ıcitamente contenido, o expl´ıcitamente ser seleccionado de un conjunto de posibles opciones topol´ogicas, direccionales o de proximidad (Bucher et al., 2005). La tarea m´as importante definida en GeoCLEF es la de b´ usqueda de informaci´on geogr´ afica (search task ). Pero GeoCLEF no s´olo eval´ ua sistemas de b´ usqueda de informaci´on geogr´ afica, sino que tambi´en est´a proponiendo nuevas subtareas que se enmarcan dentro de esta rama, como la de an´alisis de consultas (query parsing), cuyo objetivo es identificar aspectos geogr´aficos en una consulta, o las subtareas piloto que han propuesto para este a˜ no 2008 relacionadas con Wikipedia4 y la b´ usqueda geogr´afica de im´agenes. Para la tarea principal de b´ usqueda, GeoCLEF organiza a su vez dos subtareas: la monoling¨ ue, 3 4

en la que hay que utilizar el mismo idioma tanto para las consultas como para las colecciones (ingl´es, alem´an o portugu´es en 2007), y la biling¨ ue, que implica traducci´on, ya que el idioma de la consulta tiene que ser distinto al de la colecci´on utilizada. Existen una amplia variedad de enfoques para resolver la tarea GIR, que van desde aproximaciones simples de recuperaci´on de informaci´ on sin indexaci´on de t´erminos geogr´ aficos a arquitecturas que hacen uso de t´ecnicas de procesamiento del lenguaje natural para extraer localizaciones e informaci´ on topol´ogica de los documentos y las consultas. Algunas de las t´ecnicas usadas en la actualidad incluyen extracci´on de entidades geogr´aficas, an´alisis sem´antico, bases de conocimiento geogr´afico (como ontolog´ıas, tesauros o gazetteers), t´ecnicas de expansi´on de consultas y desambiguaci´ on geogr´afica. En la Figura 1 se puede observar la arquitectura b´asica empleada en el sistema GIR GeoUJA (Perea Ortega et al., 2007). Este sistema ha sido desarrollado por nuestro grupo de investigaci´ on SINAI5 para resolver la tarea

http://ir.shef.ac.uk/geoclef http://www.wikipedia.org

5

130

http://sinai.ujaen.es

Sistemas de Recuperación de Información Geográfica multilingües en CLEF

de la recuperaci´on de informaci´on geogr´afica, presentando distintas versiones del mismo en las competiciones de GeoCLEF 2006 (Garc´ıa Vega et al., 2007) y 2007.

3.

Granularidad en las referencias a pa´ıses. Por ejemplo, “al norte de Italia”. El formato utilizado para las consultas en los a˜ nos 2006 y 2007 difiere ligeramente del empleado en 2005, ya que no proporciona las entidades geogr´aficas ya etiquetadas. Como se puede observar en la Figura 2, una consulta consta de tres etiquetas: t´ıtulo (), descripci´on () y narrativa (). Normalmente para los experimentos se suele utilizar el texto de las etiquetas t´ıtulo y descripci´on, aunque para algunas consultas es interesante usar el texto de la etiqueta narrativa, ya que contiene descripciones geogr´aficas detalladas que ayudan al motor de b´ usqueda a definir con m´as exactitud su criterio de relevancia e incluso, a veces, contiene listados de localizaciones o regiones relevantes para la b´ usqueda.

Recursos

Las colecciones de documentos utilizadas en GeoCLEF constan de relatos period´ısticos ocurridos en los a˜ nos 1994 y 1995. La colecci´ on de ingl´es contiene historias, noticias y eventos de cobertura nacional e internacional que representan una amplia variedad de regiones geogr´aficas y localizaciones. Esta colecci´on consta de un total de 169.477 documentos y fue compuesta con noticias del peri´odico ingl´es The Glasgow Herald (1995) y del peri´odico americano Los Angeles Times (1994). Adem´as de la colecci´on en ingl´es, GeoCLEF 2007 proporcion´o colecciones en idioma alem´an y portugu´es. En GeoCLEF 2006 se lleg´o a facilitar incluso una colecci´ on de documentos en espa˜ nol. Todas estas colecciones tienen una estructura com´ un: informaci´ on espec´ıfica de peri´odico como fecha, p´ agina, tema, t´ıtulo, autor y el texto de la noticia. Las colecciones no han sido etiquetadas geogr´aficamente y no contienen informaci´ on sem´antica espec´ıfica sobre localizaciones (Mandl et al., 2007). Un total de 25 consultas fueron generadas para GeoCLEF 2007. Estas consultas han intentado reflejar un punto de vista de usuario razonable, bien preguntando por lugares tur´ısticos (por ejemplo la catedral de St. Paul ), definiendo zonas espec´ıficas (“al norte de Italia”), o bien desde un punto de vista period´ıstico (“violaci´ on de derechos humanos en Myanmar ” o “muertes en el Himalaya”). Tambi´en se han tratado de reflejar distintas dificultades relacionadas con tareas que aborda el procesamiento del lenguaje natural:

Figura 2: Formato de una consulta del GeoCLEF 2007

4.

Principales t´ ecnicas de PLN aplicadas en un sistema GIR

En el estudio de las principales t´ecnicas PLN aplicadas en una arquitectura GIR nos hemos basado en los sistemas presentados en GeoCLEF 2005, 2006 y 2007 para la tarea monoling¨ ue en ingl´es. En general, todas las arquitecturas presentadas realizan un preprocesamiento tanto a las colecciones de documentos como a las consultas formuladas. Este an´alisis ling¨ u´ıstico consiste en aplicar un extractor de ra´ıces (stemmer ), una lista de palabras sin contenido sem´antico (stop-words), para eliminar las palabras vac´ıas, y un Reconocedor de Entidades (Named Entity Recognizer, NER) para detectar y reconocer posibles entidades en cualquier texto. Seg´ un el estudio realizado, el stemmer m´as utilizado es el Porter Stemmer 6 . Tambi´en

Ambig¨ uedad geogr´afica. Por ejemplo, existe una catedral de St. Paul en Londres y otra en Sao Paulo. Regiones geogr´aficas mal definidas (“cerca del este”). Relaciones geogr´aficas complejas como “cerca de ciudades rusas” o “a lo largo de la costa mediterr´ anea”. Aspectos multiling¨ ues. “Greater Lisbon” en ingl´es es lo mismo que “grande Lisboa” en portugu´es o que “großraum Lissabon” en alem´an.

6

131

http://tartarus.org/martin/PorterStemmer

José Manuel Perea Ortega, Miguel Angel García Cumbreras, Manuel García Vega, L. Alfonso Ureña López

se usa en varios sistemas, pero con menos frecuencia que el anterior, el Snowball Tartarus 7 . Con respecto a la lista de stopwords para el ingl´es, la m´as utilizada ha sido la creada por Salton y Buckley8 , que consta de 571 palabras. En relaci´on a los reconocedores de entidades m´as empleados, hay sistemas que han optado por implementar sus propios reconocedores haciendo uso de distintas bases de conocimiento geogr´aficas y tesauros (Ferr´es y Rodr´ıguez, 2007), (Larson, 2007), pero la mayor´ıa han empleado Lingpipe 9 como herramienta NER. En nuestro sistema GIR presentado a las dos u ´ltimas ediciones del GeoCLEF hemos hecho uso del m´ odulo NER que incorpora la herramienta GATE (General Architecture for Text Engineering)10 , obteniendo buenos resultados. Seg´ un el an´alisis de los distintos sistemas, es poco habitual utilizar herramientas de etiquetado POS (Part Of Speech), aunque algunos sistemas como (Ferr´es y Rodr´ıguez, 2007) hacen uso de un etiquetador POS estad´ıstico llamado TnT. Por u ´ltimo, otra herramienta importante en el ´ambito del PLN son los traductores o sistemas de traducci´on autom´atica (Machine Translation, MT). Para la tarea GIR es necesario utilizarlos cuando la consulta planteada y la colecci´on a indexar est´an en idiomas distintos (tarea multiling¨ ue). En (Larson, 2007) se hace uso del traductor LEC Power Translator. En nuestro sistema GIR GeoUJA utilizamos un sistema propio de traducci´on autom´ atica llamado SINTRAM (SINai TRAnslation Module) (Garc´ıa Cumbreras et al., 2007).

5.

mentar su propio motor de b´ usqueda, como en (Toral et al., 2006), con el sistema IR-n, basado en pasajes, obteniendo buenos resultados en la competici´on GeoCLEF 2006. Seg´ un el estudio, los esquemas de pesado m´as utilizados en los sistemas IR han sido: TF·IDF, Okapi (Robertson y Walker, 1999), DFR (Divergence From Randomness) (Ounis et al., 2006), BRF (Blind Relevance Feedback ) (Chen, 2003), PRF (Pseudo Relevant Feedback ) (Buckley et al., 1995) y LR (Logistic Regression) o modelo de Regresi´ on Log´ıstica (Cooper, Gey, y Dabney, 1992). Existen otros esquemas menos usuales como el de frecuencia inversa de documento con normalizaci´on 2 de Laplace o InL2, utilizado en (Guill´en, 2007).

5.1.

GeoCLEF 2005

En la primera edici´on del GeoCLEF, a diferencia de las dos posteriores, los organizadores a˜ nadieron en las consultas informaci´ on sobre el concepto principal, las localizaciones y las relaciones espaciales de las mismas. Toda esta informaci´on fue extra´ıda de forma manual y colocada en etiquetas justo despu´es de las principales de cada topic. Por este motivo, hubo algunas aproximaciones basadas u ´nicamente en recuperaci´on de informaci´on cl´asica, sin ning´ un tratamiento geogr´afico. De hecho, de los cuatro sistemas con mayor puntuaci´ on en esta edici´on, tres de ellos se basaron u ´nicamente en un sistema de IR sin tratamiento de la informaci´ on geogr´afica. La arquitectura que obtuvo mejores resultados en la tarea monoling¨ ue de ingl´es fue la presentada por la Universidad de Berkeley (Gey y Petras, 2005), que utiliz´o un sistema cl´asico de recuperaci´on de informaci´ on con un algoritmo de ranking de documentos basado en regresi´on log´ıstica. La mayor´ıa de sistemas apostaron por utilizar reconocedores de entidades especializados en el dominio geogr´afico como una aproximaci´ on inicial para resolver esta tarea (Cardoso et al., 2005). Otras arquitecturas tambi´en emplearon recursos externos de conocimiento geogr´afico tales como ontolog´ıas y gazetteers, as´ı como estad´ısticas sociales y caracter´ısticas f´ısicas de los mismos. En concreto, hicieron uso de gazetteers como GNIS14 (Geographic Names Information System) y GNS15 (Geonet Names Ser-

Aproximaciones m´ as utilizadas para resolver la tarea GIR

En general, la arquitectura de cualquier sistema GIR parte de un modelo b´asico de recuperaci´on de informaci´on. Por tanto, un elemento esencial en todos los sistemas presentados es la herramienta utilizada como motor de b´ usqueda. Entre los m´as usados est´an Lucene11 , Terrier12 y algo menos Lemur13 . Algunos participantes han optado por imple7

http://snowball.tartarus.org ftp://ftp.cs.cornell.edu/pub/smart/english.stop 9 http://www.alias-i.com/lingpipe 10 http://gate.ac.uk 11 http://lucene.apache.org 12 http://ir.dcs.gla.ac.uk/terrier 13 http://www.lemurproject.org 8

14 15

132

http://www.usgs.gov http://www.nga.mil

Sistemas de Recuperación de Información Geográfica multilingües en CLEF

ver). El grupo XLDB de la Universidad de Lisboa construy´o su propia ontolog´ıa geogr´ afica bas´andose en recursos externos como Wikipedia y World Gazetteer16 (Cardoso et al., 2005). Por otro lado, hubo varios sistemas que utilizaron expansi´on de consulta (Buscaldi, Rosso, y Sanchis Arnal, 2005). La arquitectura presentada por la Universidad Polit´ecnica de Valencia hizo uso de la ontolog´ıa no geogr´ afica WordNet17 para realizar dicha expansi´ on, bas´andose en las relaciones de sinonimia y meronimia.

5.2.

su sistema es que hicieron uso de desambiguaci´ on de referencias geogr´aficas (top´onimos) y de similitud geogr´afica entre ´ambitos. Nuestro grupo de investigaci´ on SINAI, en su primera participaci´on en GeoCLEF (Garc´ıa Vega et al., 2007), opt´o por el enfoque de expandir las consultas utilizando informaci´ on geogr´afica procedente de un NER, de un gazetteer como Geonames18 y de un tesauro generado a partir de las propias colecciones del GeoCLEF. Esta aproximaci´ on no ofreci´ o mejores resultados que el caso base (sin expansi´on de consultas) por lo que concluimos que la expansi´on no se estaba haciendo correctamente. Esto mismo le ocurri´o a la Universidad de Alicante, que qued´o en segunda posici´on en la tarea monoling¨ ue en ingl´es. El enfoque b´asico que utiliz´o este grupo fue el que siguieron la mayor´ıa de sistemas presentados en esta segunda edici´on del GeoCLEF (Toral et al., 2006).

GeoCLEF 2006

En GeoCLEF 2006 la variaci´ on de arquitecturas presentadas en los distintos sistemas aument´ o considerablemente con respecto a la primera edici´on. Estas aproximaciones variaban desde enfoques b´asicos de IR sin indexaci´on geogr´afica a profundos procesamientos del lenguaje natural para extraer lugares y t´erminos topol´ogicos tanto de las colecciones como de las consultas. Algunas de las t´ecnicas espec´ıficas usadas fueron:

5.3.

GeoCLEF 2007 se presentaba con la novedad de una nueva tarea: clasificaci´on de consultas. Su objetivo era identificar componentes geogr´aficos en las mismas. La tarea principal mantuvo las subtareas monoling¨ ue y biling¨ ue. Los organizadores continuaron con su esfuerzo de proponer un conjunto de consultas dif´ıciles desde el punto de vista geogr´ afico (ver apartado 3). El mejor sistema en la tarea de b´ usqueda monoling¨ ue en ingl´es fue el presentado por la Universidad Polit´ecnica de Catalu˜ na (Ferr´es y Rodr´ıguez, 2007). En este enfoque, a partir del texto de las colecciones, se construyen dos ´ındices:

T´ecnicas ad-hoc (BRF, descomposici´on de palabras, expansi´on manual de consultas). Construcci´on propia de recursos de conocimiento geogr´afico a partir de recursos externos (gazetteers como GNIS o World Gazetteer). Expansi´on de consultas basada en gazetteer y WordNet. M´odulos de pregunta-respuesta utilizando recuperaci´on de pasajes. Extracci´on de entidades geogr´aficas.

´ Indice geogr´ afico. Contiene toda la informaci´on geogr´afica extra´ıda del texto de las colecciones (entidades, variaciones de nombres de entidades para resolver posibles ambig¨ uedades, coordenadas geogr´aficas, etc.).

Resoluci´on de la ambig¨ uedad geogr´afica. El sistema presentado por el grupo XLDB de la Universidad de Lisboa (Martins et al., 2006) fue el que obtuvo mejores resultados en la tarea monoling¨ ue en ingl´es. Volvieron a hacer uso de la ontolog´ıa geogr´afica que crearon en la edici´on anterior y la utilizaron para expandir las consultas. Esta ontolog´ıa se organiza en conceptos que ellos hacen corresponder con ´ambitos geogr´aficos (geographic scopes). De este modo, tambi´en utilizaron expansi´ on de consultas basadas en ´ambitos geogr´ aficos. Otra caracter´ıstica interesante de 16 17

GeoCLEF 2007

´ Indice textual. Almacena los lemas de las palabras con contenido sem´antico de la colecci´on, sin incluir ninguna informaci´on geogr´afica. Para extraer la informaci´on geogr´afica tanto de las colecciones como de las consultas, hacen uso de una base de conocimien-

http://world-gazetteer.com http://wordnet.princeton.edu

18

133

http://www.geonames.org

José Manuel Perea Ortega, Miguel Angel García Cumbreras, Manuel García Vega, L. Alfonso Ureña López

to geogr´ afico generada por ellos mismos y que consta de tres componentes: Un tesauro geogr´afico. Este componente fue construido a su vez uniendo cuatro gazetteers: GNS, GNIS, GeoWorldMap 19 y World Gazetteer. Como cada gazetteer tiene distintas clases y conceptos, ellos mapearon estas clases al conjunto de caracter´ısticas proporcionado por el tesauro ADL Feature Type Thesaurus20 (ADLFTT). Un tesauro de tipos de caracter´ısticas. Utilizaron el tesauro ADL Feature Type Thesaurus. Una base de datos que contiene conjuntos de regiones no coincidentes (representadas por pol´ıgonos) para cada pa´ıs (Pouliquen et al., 2004). Esta base de datos resuelve tareas como la obtenci´on de los l´ımites de cualquier pa´ıs, la detecci´on de si unas coordenadas dadas pertenecen a una determinada ´area, etc. Antes del proceso de recuperaci´on, una fase importante en este sistema es el an´alisis de la consulta. Este procesamiento se divide en un an´ alisis ling¨ u´ıstico de los topics (etiquetado POS, extracci´on de lemas y de entidades) y en un an´ alisis geogr´ afico, aplicado sobre las localizaciones y organizaciones detectadas durante el an´alisis ling¨ u´ıstico, y que hace uso de la base de conocimiento geogr´afica explicada anteriormente. Con todos estos ingredientes lanzan la recuperaci´on de documentos teniendo como consulta los lemas (sin informaci´on geogr´afica) del topic en cuesti´on. Para ello, utilizan Terrier como motor de b´ usqueda con varios esquemas de pesado (TF·IDF, Okapi y DFR). Por otro lado, obtienen otra lista de documentos recuperados utilizando la informaci´on geogr´ afica extra´ıda del topic y el ´ındice geogr´ afico creado con anterioridad. Como motor de b´ usqueda en este ´ındice hacen uso de un sistema IR basado en pregunta-respuesta (Question-Answering based IR system). La u ´ltima fase de la arquitectura consta de un proceso de filtrado con los documentos recuperados por el sistema IR y los recuperados usando la base de conocimiento geogr´afico y el ´ındice geogr´afico. En el ranking final de documentos se colocan primero aquellos que 19 20

Figura 3: Arquitectura b´asica del sistema TALP presentado por la Universidad Polit´ecnica de Catalu˜ na en GeoCLEF 2007 aparezcan en las dos listas. Se puede ver un esquema del enfoque seguido por la Universidad Polit´ecnica de Catalu˜ na en la Figura 3. El resto de sistemas presentados optaron b´asicamente por la misma filosof´ıa de usar recursos geogr´aficos externos, gazetteers, tesauros, ontolog´ıas como WordNet e incluso Wikipedia. Mencionar la propuesta de la Universidad Polit´ecnica de Valencia (Buscaldi y Rosso, 2007) que utiliz´o expansi´on de consultas con WordNet haciendo uso de tres ´ındices: uno para t´erminos geogr´aficos (top´onimos); otro para t´erminos no geogr´aficos y el u ´ltimo para t´erminos extra´ıdos de WordNet hol´onimos y sin´onimos de los top´onimos encontrados en el primer ´ındice.

6.

An´ alisis de resultados

En esta secci´on vamos a analizar los resultados obtenidos por los distintos participantes de las tres u ´ltimas ediciones del GeoCLEF para la tarea monoling¨ ue en ingl´es (ver

http://www.geobytes.com http://www.alexandria.ucsb.edu/gazetteer

134

Sistemas de Recuperación de Información Geográfica multilingües en CLEF

A˜ no 2005 2005 2005 2006 2006 2006 2007 2007 2007

Universidad Berkeley2 San Marcos Alicante Lisboa Alicante San Marcos Polit´ecnica Catalu˜ na Berkeley1 Polit´ecnica Valencia

MAP 0.3936 0.3613 0.3495 0.3034 0.2723 0.2637 0.2850 0.2642 0.2636

informaci´on geogr´afica (entidades, georeferencias, relaciones espaciales, etc.). T´ecnicas PLN b´asicas aplicadas tanto a las colecciones como a las consultas: detector y reconocedor de entidades (NER), lematizador, lista de palabras vac´ıas y etiquetador POS. Ser´ıa interesante contar tambi´en con un desambig¨ uador de top´onimos para resolver ambig¨ uedades geogr´aficas.

Tabla 1: Principales resultados del GeoCLEF en la tarea monoling¨ ue ingl´es

En cuanto a la expansi´on de consultas no queda claro si es recomendable utilizarla. Hay sistemas que han empeorado sus resultados usando esta t´ecnica como (Garc´ıa Vega et al., 2007) o (Toral et al., 2006) y otros que los han mejorado (Buscaldi y Rosso, 2007) o (Ferr´es y Rodr´ıguez, 2007).

Tabla 1). En general, se observa una decadencia de resultados en t´erminos de precisi´on media (Mean Average Precision, MAP) desde 2005 a 2007. Esto es debido fundamentalmente a la mayor innovaci´ on y diversidad introducida a la hora de generar las consultas tanto del 2006 como del 2007. Por ejemplo, para los topics del GeoCLEF 2007 se introdujeron dificultades a˜ nadidas como relaciones geogr´aficas complejas (“la costa mediterr´ anea”), regiones pol´ıticas (“Bosphorus”) o lugares geogr´ aficos delicados como lagos, aeropuertos, circuitos de f´ormula uno o catedrales. Todo esto ha hecho que la dificultad en resolver la tarea aumente y la precisi´on obtenida por los sistemas empeore.

7.

El uso de otros recursos como WordNet o Wikipedia tambi´en pueden ser interesantes.

Bibliograf´ıa Bucher, B., P. Clough, H. Joho, R. Purves, y A. K. Syed. 2005. Geographic IR Systems: Requirements and Evaluation. En Proceedings of the 22nd International Cartographic Conference. Buckley, C., G. Salton, J. Allan, y A. Singhal. 1995. Automatic query expansion using smart: Trec 3. Proceedings of TREC3. NIST, Gaithesburg, MD, p´aginas 69–80.

Conclusiones

En este trabajo se ha presentado un estudio sobre las distintas estrategias empleadas para resolver la tarea de la recuperaci´on de informaci´ on geogr´afica (GIR), as´ı como las t´ecnicas de PLN m´as utilizadas. Dicho estudio se ha centrado en los sistemas presentados en GeoCLEF, un marco de evaluaci´ on GIR que organiza el CLEF desde el a˜ no 2005. Las conclusiones que se derivan de este estudio se resumen a continuaci´ on:

Buscaldi, D. y P. Rosso. 2007. The UPV at GeoCLEF 2007. En Working Notes of the Cross Language Evaluation Forum (CLEF 2007). Buscaldi, D., P. Rosso, y E. Sanchis Arnal. 2005. A WordNet-based Query Expansion method for Geographical Information Retrieval. En Working Notes of the Cross Language Evaluation Forum (CLEF 2005).

Es imprescindible hacer uso de recursos externos de informaci´on geogr´afica, tales como gazetteers y tesauros. Algunos de los m´as utilizados son: GNIS, GNS, Geonames, World Gazetteer o GeoWorldMap.

Cardoso, N., B. Martins, M. Silveira Chaves, L. Andrade, y M.J. Silva. 2005. The XLDB Group at GeoCLEF 2005. En Working Notes of the Cross Language Evaluation Forum (CLEF 2005).

Es recomendable la creaci´on de al menos dos ´ındices para el proceso de recuperaci´on de informaci´on: uno que contenga la informaci´on no geogr´afica (´ındice textual) y otro s´olo con la

Chen, Aitao. 2003. Cross-Language Retrieval Experiments at CLEF 2002, volumen 2785 of LNCS Series. SpringerVerlag. 135

José Manuel Perea Ortega, Miguel Angel García Cumbreras, Manuel García Vega, L. Alfonso Ureña López

Cooper, W.S., F.C. Gey, y D.P. Dabney. 1992. Probabilistic retrieval based on staged logistic regression. En 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Ounis, I., G. Amati, V. Plachouras, B. He, C. Macdonald, y C. Lioma. 2006. Terrier: A High Performance and Scalable Information Retrieval Platform. En Proceedings of ACM SIGIR’06 Workshop on Open Source Information Retrieval (OSIR 2006). Seattle, Washington, USA.

Ferr´es, D. y H. Rodr´ıguez. 2007. TALP at GeoCLEF 2007: Using Terrier with Geographical Knowledge Filtering. En Working Notes of the Cross Language Evaluation Forum (CLEF 2007).

Perea Ortega, J.M., M.A. Garc´ıa Cumbreras, M. Garc´ıa Vega, y A. Montejo R´aez. 2007. GEOUJA System. University of Ja´en at GEOCLEF 2007. En Working Notes of the Cross Language Evaluation Forum (CLEF 2007), p´agina 52.

Garc´ıa Cumbreras, M.A., L.A. Ure˜ na-L´ opez, F. Mart´ınez Santiago, y J.M. Perea Ortega. 2007. BRUJA System. The University of Ja´en at the Spanish task of QA@CLEF 2006. LNCS of Springer-Verlag.

Pouliquen, B., R. Steinberger, C. Ignat, y T. De Groeve. 2004. Geographical information recognition and visualization in texts written in various languages. En Proceedings of the 2004 ACM symposium on Applied computing, p´aginas 1051–1058.

Garc´ıa Vega, M., M.A. Garc´ıa Cumbreras, L.A. Ure˜ na L´opez, y J.M. Perea Ortega. 2007. GEOUJA System. The first participation of the University of Ja´en at GEOCLEF 2006, volumen 4730 of LNCS Series. Springer-Verlag.

Robertson, S.E. y S. Walker. 1999. OkapiKeenbow at TREC-8. En Proceedings of the 8th Text Retrieval Conference TREC8, NIST Special Publication 500-246, p´aginas 151–162.

Gey, F. y V. Petras. 2005. Berkeley2 at GeoCLEF: Cross-Language Geographic Information Retrieval of German and English Documents. En Working Notes of the Cross Language Evaluation Forum (CLEF 2005).

Toral, A., O. Ferr´ andez, Noguera, E., Z. Kozareva, A. Montoyo, y R. Mu˜ noz. 2006. Geographic IR Helped by Structured Geospatial Knowledge Resources. En Working Notes of the Cross Language Evaluation Forum (CLEF 2006).

Guill´en, R. 2007. GeoCLEF2007 Experiments in Query Parsing and Crosslanguage GIR. En Working Notes of the Cross Language Evaluation Forum (CLEF 2007). Larson, R.R. 2007. Cheshire at GeoCLEF 2007: Retesting Text Retrieval Baselines. En Working Notes of the Cross Language Evaluation Forum (CLEF 2007). Mandl, T., F. Gey, Di Nunzio, G., N. Ferro, R. Larson, M. Sanderson, D. Santos, C. Womser-Hacker, y Xing Xie. 2007. Geoclef 2007: the clef 2007 crosslanguage geographic information retrieval track overview. En Proceedings of the Cross Language Evaluation Forum (CLEF 2007). Martins, B., N. Cardoso, M. Silveira Chaves, L. Andrade, y M.J. Silva. 2006. The University of Lisbon at GeoCLEF 2006. En Working Notes of the Cross Language Evaluation Forum (CLEF 2006). 136

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 137-143

recibido 12-02-08, aceptado 03-03-08

PPIEs: Protein-Protein Interaction Information Extraction system∗ PPIEs: Sistema de Extracci´ on de Informaci´ on sobre interacciones entre prote´ınas Roxana Danger

Paolo Rosso Ferran Pla Antonio Molina Technical University of Valencia Cam. Vera, s/n 46022 (Spain) (rdanger; prosso; fpla; amolina)@dsic.upv.es

Abstract: More than three millions research articles have been written about proteins and Protein-Protein Interactions (PPI). The present work describes a plausible architecture and some preliminary experiments of our Protein-Protein Interaction Information Extraction system, PPIEs. The promising results obtained suggest that the approach deserves further efforts. Some important aspects that need to be improved in the future have been identified: entity recognition; lexical data storage and searching (in particular, controlled vocabularies); knowledge discovery for ontology enrichment. Keywords: Information Extraction, Protein-Protein Interaction. Resumen: En la literatura aparecen m´as de tres millones de art´ıculos acerca de las prote´ınas y sus interacciones (PPI). En este trabajo se expone una arquitectura plausible y algunos experimentos preliminares de nuestro sistema de extracci´on de informaci´on sobre interacciones entre prote´ınas, PPIEs. Los resultados obtenidos son muy prometedores, por lo que el trabajo merece ulteriores desarrollos. Este estudio ha permitido, adem´as, identificar algunos aspectos a mejorar en el futuro: el reconocimiento de entidades y el almacenaje y b´ usqueda de datos l´exicos (en particular, los vocabularios controlados) y el descubrimiento de conocimiento para el enriquecimiento de ontolog´ıas. Palabras clave: Extracci´on de informaci´on, Interacci´on entre prote´ınas.

1

Introduction

The goal of Information Extraction Systems (IES) is the enrichment of knowledge bases with information from texts. None of the different methodologies used to solve this problem has clearly demonstrated its superiority (Reeve and Han, 2005). On the one hand, many of them are based on learning processes. In such cases, the quality of Information Extraction (IE) depends on the representativity of the training data, and the ability for generalization of the systems. On the other hand, the majority of IES uses a complete syntactic and semantic analysis. The quality here is affected by possible errors during Natural Language Processing (NLP). Background knowledge is an essential element for IES. If the interesting concepts for the task are known, as well as others semanThis work has been funded by the projects TIN2006-15265-C06-04 and “Juan de la Cierva” of the Ministry of Education and Science of Spain. ∗

ISSN 1135-5948

tically related concepts (such as their synonyms, antonyms, meronyms, etc.,), its identification could be used for an effective IE. The methods for instance extraction should be based on the own nature of the data to be extracted. This kind of IES guided by knowledge or, more formally, by ontology- has demonstrated to be effective when the domain knowledge is enclosed and specific enough. For example, in (Danger, 2007) is described IES to populate an archeology ontology from text collection of archeology site memories. The system has considered both the ontological entities and the complex instances related them, and obtained a 92% of precision and 84% of recall for the archeology ontology with more than 500 concepts and relations. Our goal is to propose a general architecture for IES guided by ontologies, which allows to enrich both the domain knowledge of ontologies and their instances. This study © Sociedad Española para el Procesamiento del Lenguaje Natural

Roxana Danger, Paolo Rosso, Ferran Pla, Antonio Molina 8 7 6 5 4 3 2 1 0 1950

1955

1960

1965

1970

1975

% of papers about "protein"

1980

1985

1990

1995

2000

% of papers about "protein protein interaction"

Figure 1: Increasing interest of the biomedical community in PPI research. http://dan.corlan.net/medline-trend.html. is part of a research project for the specific biomedical domain1 . The availability of huge data in text format, the growing interest in the fascinating world of proteins as well as the necessity for biochemistry researchers to arrange all discovered protein features in databases made us decide to carry out some experiments in the Protein-Protein Interaction (PPI) domain. The present work summarizes the available resources which make plausible our proposal and shows some preliminary results of the simplest IES guided by ontology we conceive for the PPI domain. Section 2 introduces the role of proteins for life, and the importance of PPI. In Section 3 the available resources as well as our first PPIEs (Protein-Protein Interaction Information Extraction system) are described.The results of some preliminary experiments carried out using our PPIEs, are discussed in Section 5. Finally, conclusions and future works are drawn in Section 6.

2

2005

Data source:

A very short and shallow summary of genetic discoveries is given below. Each cell (the human body has about 100 billion of cells) contains DNA (Deoxyribonucleic acid) molecules, which are sequences of nucleotides that “describe” hereditary information, contained in a set of chromosomes (23 pairs for humans). DNA fragments containing this hereditary information are genes; other fragments are involved in the structural definition or in the regulation processes of the cells. At the beginning of a gene there is a promoter which controls its activity, and the coding and non-coding of a sequence. Noncoding sequences regulate the conditions necessary for gene expression (the process of converting a gene into a useful form for the cell). The products of gene expression, determined by the coding sequences, are in the majority proteins. Proteins are linear polymers built from 20 aminoacids. The majority of chemical reactions occurring inside the cell are produced thanks to the protein capability of binding other molecules. Bindings between the same molecule form fibers (structural function). If a protein is associated with other ones, an interaction between proteins is observed. Protein-protein interactions allow catalyzing chemical reactions (enzymatic function), controlling the cell cycle (control function) and assembling protein complexes (complex functions) which, in turn, are involved in cell signing or in signal transduction functions. The importance of PPI in living bodies

Proteins and Protein-Protein Interaction

Heredity and variation in living organisms are the subject study of Genetics. The discoveries obtained from the pioneer studies of Mendel in 1880 up to have made possible to understand a little but exciting part of the biochemical mechanisms of the living bodies. 1 MIDES: M´etodos de aprendizaje para la miner´ıa de textos en dominios espec´ıficos. http://gplsi.dlsi.ua.es/text-mess/index.php

138

PPIEs: Protein-Protein Interaction Information Extraction system

Biomedical Ontologies 2 ) complying with various requirements, including a minimal level of agreement between experts in each domain area. A controlled and consensual vocabulary useful in many tasks may thus be assumed. The most relevant ontologies (structures of databases, in some cases) associated with proteins and their interaction concepts are: intAct (Interaction Database), interPro, PO, Uniprot/Swiss-Prot, MI, MGED and Tambis. All above ontologies share a set of 4 essential concepts, which have been described in (Orchard and et. al., 2007) as the minimal interesting information for PPI:

has motivated an increasing interest in their study. Figure 1 shows the proportional increasing of the published papers about proteins and PPI since the middle of the last century until nowadays. Up to 2005, more than 3 millions papers about proteins have been published, and at least 5% of them were related specifically to PPI. In the figure, it may be noticed the growing interest of the biomedical community in protein research, and it is clear the faster behaviour of the published papers regarding to PPI. Different point of views are emphasized in the studies about proteins: their structural utility, biochemical signals and/or biochemical reactions. All viewpoints have to be combined in order to obtain a general idea of the influence of a determined gene or protein in the organism. Moreover, PPI are important because they may help to discover the functions of other proteins making them interact and observing the successive behaviour. Considering all the above, the current challenge of bioinformatics is to populate biomedical databases with the essential information in order to allow some basic processing, such as searching or general comparison between proteins or their interactions. Currently, manual and semi-automatic processing are carried out in order to make the recent discoveries available to all biochemical community. The present work aspires to contribute to this process of information diffusion and interchange.

3

• Publications: a subject research together with its authors, institutions, journal of publication, etc. and the experiments which have been carried out; • Experiments: a description of the experiments which justify the research; • Interactions: a list of interactions occurring in the experiments; • Interactors: a list of interacting molecular elements. An ontology-driven IES for PPI should consider, in an initial stage, at least the above concepts. In successive stages, other related concepts could be incrementally added.

3.2

PPI resources

The PPI resources which make possible to define an IES are enumerated in the three successive sections. As we explained above, the definition of an ontology to guide the process is essential. In the literature we have found different ontologies regarding PPI. Their study have allowed us to discover the indispensable information needed to be extracted. On the other hand, some biomedical NLP tools have been defined; the understanding of the used methods together with how to improve them is an important issues. Finally, we describe the available data as well as the textual medical databases over which we work.

3.1

Biomedical NLP tools

Recognizing bio-entities (proteins, genes, biological functions, diseases, treatments and others biomedical concepts) is the task in which current developments are focusing on. Given the huge amount of concepts available in the controlled vocabularies which could appear in biomedical texts, some of these recognizers merge Information Retrieval (IR) and IE techniques in order to speed up the recognition process. Table 1 gives an idea of the quality of protein entity recognizers. Four of the available systems were (trained if necessary and) used to extract proteins from the evaluation sentences provided by BIOCREATIVE’06 challenge3 . As may be noticed, more than 44% of the proteins remained undetected. Most of the biomedical recognizers use: rules or dictionary searcher strategies, like in (Hanisch et al., 2005) and (Kou, Cohen, and Murphy, 2005); or machine learning

PPI ontologies

The biomedical community has been developing a set of ontologies (the OBO, Open

2 3

139

http://obo.sourceforge.net http://biocreative.sourceforge.net/biocreative 2.html

Roxana Danger, Paolo Rosso, Ferran Pla, Antonio Molina

Figure 2: General architecture for a simple IES. System ABNER GAPSCORE (Score ≤ 0.3) NLPROT WHATIZIT

Pr 0.57 0.67 0.57 0.82

R 0.44 0.52 0.56 0.54

teins or a set of grammatical rules manually computed. The systems which detected interactions from raw text obtained a F-score of 30, whereas those that used manually interactor annotations reached as much an Fscore of 48.

F1 0.50 0.56 0.56 0.65

Table 1: Comparison of protein recognizers. Pr=Precision, R=Recall.

3.3

Public PPI data

The biomedical community publishes various databases in which PPI are described and are constantly updated and supervised by biologists. The most relevant are: HPRD (Human Protein Reference Database), IntAct (Interaction Database) and DIP (Database of Interacting Proteins). Each of them provides sophisticated searching capabilities in order to allow users to review, compare and search for particular protein features. A big amount of researches are public available in various format (pdf, xml, etc.). Pubmed database 4 provides access to citations from biomedical literature of many journals and conferences. Moreover, the data available in databases are referred to Pubmed paper identifiers. Therefore, combining both sources of information, sets of texts for training and evaluation purposes may be easily defined.

approaches based on Hidden Markov Models or Conditional Random Fields, like in (Okanohara et al., 2006) and (Sun et al., 2007). Such bad results are due to the terminology problems observed in bio-entities. Although some molecular names provide useful cues (as the molecular weight, function or the discoverer name), many interactors are described by long, compound, ambiguous, common and jargon English words. However, in BIOCREATIVE’06 challenge (Wilbur, Smith, and Tanabe, 2007) new proteins recognizers (not freely available) which obtain better results with a highest F1-score of 87.21, have been described. Moreover, combining the results a significant improvement of a 90.66 of F1-score is achieved. This fact reveals us that new bio-entities recognizers, in particular proteins, would be able to reach high quality values by combining different techniques. A similar conclusion was obtained in recent comparison studies (Ponomareva et al., 2007), (Sun et al., 2007). A representative set of IES for PPI has been met in BIOCREATIVE’06 challenge (Krallinger, Leitner, and Valencia, 2007). The competition was concentrated in detecting pairs of proteins and the kind of interaction between them. The common framework of the systems is to use a complete syntactic and semantic analysis to extract clearly defined interactions. Interactions are extracted considering verb joining two pro-

4

Defining our first PPIEs

The simplest approximation we may conceive for an IES guided by ontologies is represented in Figure 2. It is composed basically by a process which converts a raw text in a list of words (by using a text segmentation, which includes the recognition of simple datatypes such as those that use regular expressions, and a signs remover). Then, the words are stemmed and used by ontology entity recognizers. Ontology entities to be recognized are defined in form of concepts and relations of a 4

140

http://www.ncbi.nlm.nih.gov/PubMed/

PPIEs: Protein-Protein Interaction Information Extraction system

Type of entity Biological role Cell type Detection method Identification method Interaction type Interactor type Tissue type Protein name

Vocabulary Resource psi-mi.obo#biological role cell.obo#cell psi-mi.obo#interaction detection method psi-mi.obo#participant identification method psi-mi.obo#interaction type psi-mi.obo#interactor type http://www.expasy.org/cgi-bin/lists?tisslist.txt Uniprot/Swiss-Prot database5

Table 2: PPI controlled vocabulary. Notation: Ontology name#concept base in the Ontology. PPI ontology. We assume that the lexical information to extract them from text is also specified in the ontology. Therefore, a reasoner should be used to: 1) interpret the ontology, that is, the concepts and their relations; and 2) make available lexical information needed for the IE task. The instance generator makes use of the algorithm proposed in (Danger, 2007). This algorithm defines a set of rules for the complex instance generation which use the ontology interpretation to properly link a list of ontological entities. The above architecture is useful for a study of the complexity of the problem we are facing. In the following sections we describe, our PPI including how the lexical information has been linked to the appropriated ontological elements and the inference process used to generate the complex instances.

4.1

Entity recognizers are simply dictionary searchers. In Table 2 the resources from which the dictionaries have been created are described. Almost all of them are ontologies from the Open Biomedical Ontologies 6 .

4.2

Ontology Reasoner and instance generation

The Pellet reasoner7 , the most popular reasoner for OWL, has been used to recover, from PPIO, the instances models (general descriptions of the concepts and their relations) and the lexical information which will be used to generate complex instances describing protein-protein interactions. For simplicity, the reader should assume that we obtain, for each concept, the other concepts and relations associated with it, its position in the hierarchy with respect to the others concepts, and how to recognize it in a text. Therefore, using all this information, the ontology entities in texts may be discovered. It is easy to infer the compositions of relations linking two concepts and the semantic distances between them. The two aspects above allow, by using the algorithm introduced in (Danger, 2007), to infer the complex ontological instances described in texts.

PPI ontology

We have defined an ontology in OWL (Ontology Web Language) for PPI, based on the recommendations about the minimal interesting information for PPI (Orchard and et. al., 2007). We include other important and well classified concepts related to this domain knowledge such as: interaction and interactor types, biological role of a host in the experiments, cell type on which the experiment was carried out or applied, detection interaction and identification of the interactors methods. The ontology we defined, PPIO, contains 19 concepts and 21 relations. Moreover, it has been enriched with lexical information in two annotation properties, lex and lexValue. Through them the lexical methods for identifying ontological elements (concepts and properties) and properties values are described. In the current implementation lex and lexValue are limited to list entity examples.

5

Preliminary experiments

Experiments have been carried out on two resources developed and maintained by EBI8 . The first resource is IntAct, the previously mentioned database, and the second one is a set of 3422 paragraphs extracted from PPI research papers along with the interaction identification number (Accession number, AC ) in IntAct database which represents the interaction described in the paragraph. Each paragraph represents a complex interaction in6

http://obo.sourceforge.net http://www.mindswap.org/2003/pellet/ 8 http://www.ebi.ac.uk/ 7

141

Roxana Danger, Paolo Rosso, Ferran Pla, Antonio Molina

Type of entity Biological role Cell type Detection method Identification method Interaction type Interactor type Tissue type Protein name

%of Parag. 100 32 100 100 100 100 9 100

Precision 90 92 70 98 99 100 58 95

Recall 46 69 23 85 83 78 35 78

Table 3: Entities in text paragraphs. stance: there are 3422 interaction instances which include a total of 87186 relations. For example, given a typical paragraph such as: “Co-immunoprecipitation from T-cells of theta PKC and p59fyn.”, ontological entities are recognized using dictionary searchers, as in the example:

High recall values were obtained for proteins, but these results are due to the completeness of the protein dictionary, which also includes protein synonyms. In the future, we should use a molecular (protein) recognizer based on morpho-syntactic features of protein names, and protein synonyms should be discovered and matched to the corresponding most common protein names. We limit the analysis to protein interactor types: therefore, the precision is of 100% and the recall coincides with the recall of protein name. Other entities have different behaviours. The interaction type, identification method and cell type concepts are well recognized due to the stability of their vocabulary, whereas a low proportion of detection method, and tissue type are recognized. We plan to perform a thorough study of the dynamism of biomedical terminology in order to recognize new terms, as well as to improve the entity disambiguation mechanism. Also, a process for identifying typing errors will be included, because we notice a high frequency of such mistakes in the processed text. With respect to the instance generation process, a precision of 72% and a recall of 67% were obtained considering all paragraphs. We consider that an instance is well recognized if it is referred to the correct concept and all its relations are well formed. In spite of the rather simple linguistics processing, the precision and recall values obtained by the system are satisfactory. We will try to maintain linguistic processing complexity as low as possible in future developments. Moreover, we plan to improve the entity recognition process to make it less dictionary-dependent.Other two issues will be considered in the future. These are the learning of new terms, synonyms, acronyms and metonyms to enrich the controlled vocabulary, and the efficient recognition of such

Co-immunoprecipitation from Tcells of theta PKC and p59fyn .

Finally, the corresponding instance is reconstructed using the instance generator as follows. The indentation is used to identify relations with previously defined instances. As it may be noticed, the complex instance is created using the list of recognized entities. The appropriate relations are selected and used to link the corresponding instances. Some instances (such as experiment) and data (such as interaction type) are inferred using the ontology information. interaction has been produced by :: experiment f ound in source :: ncbiTaxId=9606 has tissue type :: Peripheral blood T-lym. detect method :: anti bait coimmunoprecipit. has participant :: Concrete interactor name :: Proto-oncog. tyros.-protein kin. Fyn interactorT ype :: protein has participant :: Concrete interactor name :: Protein kinase C theta type interactorT ype :: protein has interaction type :: physical interaction

Table 3 shows for each type of entity mentioned in the paragraphs, the percentage of paragraphs in which it has been found and the precision and recall obtained by the particular ontology entity recognizer. 142

PPIEs: Protein-Protein Interaction Information Extraction system

terms in texts. The latter aspect includes the use of efficient indexing strategies for searching terms appearing in texts.

6

the second biocreative ppi task: Automatic extraction of protein-protein interactions. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, pages 41–54.

Conclusions and further work

In this paper we have introduced an architecture for an information extraction system about protein protein interactions, PPIEs. The most important resources available regarding PPI have been summarized. Such resources have been used in order to perform information extraction in relevant papers. A domain ontology on PPI has been defined which includes lexical information regarding ontological entities. Preliminary experimental results are encouraging. They indicate that the proposed set of tools is suitable for PPI identification, although a more sophisticated mechanism for entity identification should be used in the future. Furthermore, we plan to study the dynamism of the biomedical vocabulary (including the recognition and evolution of new terms, synonyms, acronyms and metonyms), the disambiguation process and the extension of the PPIO ontology.

Okanohara, Aisuke, Yusuke Miyao, Yoshimasa Tsuruoka, and Junichi Tsujii. 2006. Improving the scalability of semi-markov conditional random fields for named entity recognition. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, pages 465–472. Orchard, Sandra and et. al. 2007. The minimum information required for reporting a molecular interaction experiment (mimix). Nature Biotechnology, 25(8):894–898. Ponomareva, Natalia, Paolo Rosso, Ferr´ an Pla, and Antonio Molina. 2007. Conditional random fields vs. hidden markov models in a biomedical named entity recognition task. In Proc. of Int. Conf. Recent Advances in Natural Language Processing, RANLP, pages 479–483.

References

Reeve, Lawrence and Hyoil Han. 2005. Survey of semantic annotation platforms. In SAC, pages 1634–1638.

Danger, Roxana. 2007. Extraction and analysis of information from the Semantic Web perspective (in Spanish: Extracci´ on y an´ alisis de informaci´ on desde la perspectiva de la Web Sem´ antica). Ph.D. thesis.

Sun, Chengjie, Yi Guan, Xiaolong Wang, and Lei Lin. 2007. Rich features based conditional random fields for biological named entities recognition. Computers in Biology and Medicine, 37(9):1327–1333.

Hanisch, Fundel, Mevissen, Zimmer, and Fluck. 2005. Prominer: rule-based protein and gene entity recognition. BMC Bioinformatics, 6 Suppl 1.

Wilbur, Johm, Larry Smith, and Lorrie Tanabe. 2007. Biocreative 2. gene mention task. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, pages 7–16.

Kou, Zhenzhen, William Cohen, and Robert Murphy. 2005. High-recall protein entity recognition using a dictionary. Bioinformatics, 21(1):266–273. Krallinger, Martin, Florian Leitner, and Alfonso Valencia. 2007. Assessment of

143

Tesis

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 147-148

recibido 06-11-07, aceptado 03-03-08

Computing Meaning in Interaction Computaci´ on del Significado en Di´ alogos Roser Morante Vallejo Tilburg University Postbus 90153, 5000 LE Tilburg, The Netherlands [email protected] Resumen: Tesis doctoral realizada en la Universidad de Tilburg por Roser Morante Vallejo bajo la direcci´on de Harry Bunt (Tilburg Univ.). La defensa de la tesis tuvo lugar el 3 de diciembre de 2007 ante el tribunal formado por los doctores David Traum (Univ. of Southern California), Michael McTear (Univ. of Ulster), Reinhard Muskens (Tilburg Univ.), Emiel Krahmer (Tilburg Univ.) y Robbert-Jan Beun (Utrecht Univ.). Palabras clave: Actos de habla, simulaci´on del di´alogo, actualizaci´on del contexto, DIT, grounding. Abstract: PhD Thesis written by Roser Morante Vallejo at Tilburg University under the supervision of Harry Bunt (Tilburg Univ.). The thesis defence (viva voce) took place before the committee formed by doctors David Traum (Univ. of Southern California), Michael McTear (Univ. of Ulster), Reinhard Muskens (Tilburg Univ.), Emiel Krahmer (Tilburg Univ.) and Robbert-Jan Beun (Utrecht Univ.) on the 3rd of December 2007. Keywords: Dialogue acts, dialogue simulation, context update, DIT, grounding.

1.

Introduction

The general purpose of our research is to define a model of dialogue context update in the framework of Dynamic Interpretation Theory (DIT) (Bunt, 2000). According to the theory, communicative agents can be modelled as structures of goals, beliefs, preferences, expectations, and other types of information, plus memory and processing capabilities. Part of these structures is dynamic in the sense of changing during a dialogue, as a result of the agents perceiving and understanding each other’s communicative behavior, of reasoning with the outcomes of these processes, and of planning communicative and other acts. A dialogue participant’s beliefs about the domain and about the dialogue partner form a crucial part of his information state, which in DIT is called his context. Dialogue acts are functional units used by the speaker to change the context. Formally, a dialogue act in DIT consists of a semantic content and a communicative function, the latter specifying how the information state of the addressee is to be updated with the former upon understanding the corresponding utterance. Context includes the participant’s state of beliefs and goals, including beliefs about each other’s processing of previous utterances.

2.

Contributions

Our main contributions are: (i) applying the theory to the analysis of dialogue, using the DIT taxonomy of dialogue acts to model dialogues; in ISSN 1135-5948

particular we are concerned with modeling the effects of three groups of dialogue acts in the dialogue context: Information Transfer, Action Discussion, and Dialogue Control Feedback; (ii) assigning the model of beliefs and goals to dialogue acts; (iii) analysing fragments of dialogues by applying this model; (iv) defining a model of context update by defining certain principles and rules. On the basis of a detailed analysis of the flow of beliefs in a number of simple dialogue fragments, we propose certain mechanisms for modeling the transfer of information: adoption, strengthening, and cancellation of beliefs. This has allowed us to explain in the form of an algorithm how information may be updated in a dialogue (Morante, Keizer, y Bunt, 2007), in particular how information may be grounded. We have proposed that grounding is the side–effect of general communication principles, and mostly the result of addressees giving feedback, implicit or explicit, to speakers (Bunt y Morante, 2007). The context update model has been converted into an algorithm and implemented in a dialogue simulator (Keizer y Morante, 2007). In sum, our investigation has yielded theoretical and practical results. On the theoretical side, the analysis of dialogues has led to a better understanding of how the dialogue participant’s context is updated as an effect of the utterances being produced. On the practical side, the context update model has been converted into an algorithm and implemented in a dialogue simulator. © Sociedad Española para el Procesamiento del Lenguaje Natural

Roser Morante

3.

Contents

lator and context update system synthesizes the belief update process as understood in DIT in the form of a general algorithm that is implemented in a tool. The algorithm concentrates the findings of our research and it reflects what we understand to be an aspect of computing meaning in interaction, namely updating the beliefs and goals in the participant’s context model. The chapter presents the tool in which the algorithm is implemented, DISCUS, a Dialogue Simulation and Context Update System. Finally, Chapter 10: Conclusions and Future Research puts forward some conclusions and suggestions for future research.

Chapter 1: Introduction introduces the topic of research, goals, scope, and background. Chapter 2: Dialogue Modelling presents a general view of the main approaches to dialogue modeling, a review of foundational literature on belief modeling, and the information state approach to dialogue management, where DIT can be placed. In Chapter 3: Grounding we review various approaches to grounding, which is a dialogue phenomenon for which our model of dialogue analysis can give an account. We start by defining some concepts related to grounding, we introduce the foundational Contribution Model by (Clark y Schaefer, 1989) and two related proposals: the extension of the Contribution Model to HC interaction by Brennan and collaborators (Brennan, 1998; Cahn y Brennan, 1999), and the formal theory of grounding by (Paek y Horvitz, 2000); the computational theory of grounding by (Traum, 1994), and the treatment of grounding from the information state update perspective. Chapter 4: Dynamic Interpretation Theory is devoted to introducing the theoretical framework of our research. The concepts of dialogue act and context are explained, the DIT dialogue act taxonomy is presented, and the DIT approach to dialogue management is sketched. Chapter 5: Dialogue Analysis Methodology presents the methodology that will be applied to the analysis of dialogues. It consists of defining the effects that an utterance has in the context model, and making explicit general rules and principles that govern the context update: creation, adoption, and cancellation of beliefs. In Chapter 6: Analysis of Dialogue Patterns (I), General Purpose Communicative Functions we analyse how the context is updated with the General Purpose Communicative Functions of Information Transfer and Action Discussion. In Chapter 7: Analysis of Dialogue Patterns (II), Dialogue Control Communicative Functions we focus our attention on a group of Dialogue Control Functions: Auto– Feedback Functions. Feedback Functions are used by dialogue participants to provide information about their processing of the partner’s previous utterances. Feedback can be positive or negative, and can refer to different levels of processing. The goal of this chapter is to provide an analysis for all levels and types of Autofeedback communicative functions, as defined in DIT. In Chapter 8: Context Update in Dialogues: a DIT approach we analyse long dialogues, and we show that the DIT mechanisms for context update can explain how dialogue participants reach a subjective state of grounding, without the need of specific grounding mechanisms. Chapter 9: DISCUS: A dialogue simu-

Bibliograf´ıa Brennan, S. E. 1998. The grounding problem in conversations with and through computers. En S.R. Fussell y R.J. Kreuz, editores, Social and cognitive psychological approaches to interpersonal communication. Lawrence Erlbaum, Hillsdale, NJ, p´aginas 201–225. Bunt, H. 2000. Dialogue pragmatics and context specification. En H. Bunt y W. Black, editores, Abduction, Belief and Context in Dialogue. Studies in Computational Pragmatics. John Benjamins, Amsterdam, p´aginas 81–150. Bunt, H. y R. Morante. 2007. The weakest link. En Text, Speech and Dialogue, 10th International Conference, TSD 2007, Proceedings. Lecture Notes in Computer Science 4629, p´aginas 591–598, Plzen, Czech Republic. Cahn, J. E. y S. E. Brennan. 1999. A psychological model of grounding and repair in dialog. En Proceedings AAAI FAll Symposium on Psychological Models of Communication in Collaborative Systems, p´aginas 25–33, North Falmouth, MA. American Association for Artificial Intelligence, AAAI. Clark, H.H. y E.F. Schaefer. 1989. Contributing to discourse. Cognitive Science, 13:259–294. Keizer, S. y R. Morante. 2007. Dialogue simulation and context dynamics for dialogue management. En Proceedings of the NODALIDA conference, p´aginas 310–317, Tartu, Estonia. Morante, R., S. Keizer, y H. Bunt. 2007. A dialogue act based model for context updating. En Proceedings of the 11th Workshop on the Semantics and Pragmatics of Dialogue (DECALOG), p´aginas 9–16, Rovereto, Italy. Paek, T. y Eric Horvitz. 2000. Grounding criterion: toward a formal theory of grounding. Technical report MSR–TR–2000–40, Microsoft Research, Redmond, WA. Traum, D.R. 1994. A Computational Theory of Grounding in Natural Language Conversation. PhD Thesis. Department of Computer Science, University of Rochester, Rochester.

148

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 149-150

recibido 29-01-08, aceptado 03-03-08

Recuperaci´ on de Pasajes Multiling¨ ue para la B´ usqueda de Respuestas∗ Multilingue Passage Retrieval for Question Answering Jos´ e M. G´ omez Departamento de Sistemas Inform´aticos y Computaci´on Universidad Polit´ecnica de Valencia Camino Vera s/n - 4022 Valencia [email protected] Resumen: Tesis doctoral en Inform´atica realizada en la Universidad Polit´ecnica de Valencia (UPV) por Jos´e Manuel G´omez Soriano bajo la direcci´on del Dr. Emilio Sanchis Arnal (UPV). La defensa de tesis tuvo lugar ante el tribunal formado por los doctores Manuel Palomar Sanz y Fernando Llopis Pascual (Univ. Alicante), L. Alfonso Ure˜ na L´opez (Univ. Ja´en), y Lidia A. Moreno Boronat y Paolo Rosso (UPV) el 28 de noviembre de 2007. La calificaci´on obtenida fue Sobresaliente Cum Laude por unanimidad. Palabras clave: JIRS, recuperaci´on de informaci´on, recuperaci´on de pasajes, b´ usqueda de respuestas Abstract: PhD Thesis in Computer Science written by Jos´e Manuel G´omez Soriano under the supervision of Dr. Emilio Sanchis Arnal from Polithecnic Univ. of Valencia (PUV). The author was examined in Nov 28, 2007 by the commitee formed by the doctors Manuel Palomar Sanz and Fernando Llopis Pascual (Univ. Alicante), L. Alfonso Ure˜ na L´opez (Univ. Ja´en), and Lidia A. Moreno Boronat and Paolo Rosso (PUV). The greade obtained was Sobresaliente Cum Laude. Keywords: JIRS, information retrieval, passage retrieval, question answering

1.

Introducci´ on

Los sistemas de B´ usqueda de Respuestas (BR) son sistemas que dan una respuesta concreta a una pregunta realizada por el usuario. Esta pregunta, en vez de ser un conjunto de t´erminos como en las tareas de Recuperaci´on de Informaci´on (RI) ad hoc, se realiza en lenguaje natural y, generalmente, est´a escrita correctamente tanto sint´actica como sem´anticamente. Una de las dificultades a las que se enfrentan los sistemas de BR es que ´estos devuelven mucha menos informaci´on que los sistemas de RI cl´asicos. Los primeros u ´nicamente devuelven una respuesta formada por unos pocos t´erminos y los segundos una lista de documentos relevantes. Es usual que los sistemas de BR hagan uso de sistemas de RI como primera etapa para reducir la cantidad de informaci´on que deben procesar. Por lo general, los sistemas tradicionales de RI, basados en palabras claves, fallan a la hora de entregar pedazos de texto (pa∗

Este art´ıculo ha sido parcialmente financiado bajo el proyecto TEX-MESS n´ umero TIN2006-15265-C0601.

ISSN 1135-5948

sajes) con la respuesta cuando la pregunta se realiza en lenguaje natural. JAVA Information Retrieval System (JIRS) es un sistema de RI que fue inicialmente ideado y especializado para tareas de BR. El objetivo de JIRS, al contrario que los sistemas tradicionales de RI, es encontrar pasajes con mayor probabilidad de contener la respuesta en vez de obtener documentos relevantes. Es m´as, est´a enfocado para recuperar pasajes directamente en vez de documentos. JIRS es un sistema independiente del idioma, de hecho ha sido usado en idiomas tan dispares como espa˜ nol, ingl´es, franc´es, italiano, ´arabe, urdu y oromo y, en general, puede ser utilizado, sin apenas cambios, en cualquier idioma no aglutinativo. Recientemente tambi´en ha sido adaptado al euskera, que es un idioma aglutinativo, a˜ nadiendo un peque˜ no m´odulo de separaci´ on de t´erminos para el euskera. La hip´otesis en la que se basa JIRS es que, en una colecci´on de documentos suficientemente grande, siempre habr´a una expresi´ on muy similar a la pregunta que contenga la respuesta. JIRS busca estas semejanzas y de© Sociedad Española para el Procesamiento del Lenguaje Natural

José M. Gómez

vuelve las m´as parecidas al principio de la lista de resultados. Por ejemplo, si la pregunta es “What is the capital of Croatia? ”, JIRS intentar´a encontrar la estructura Zagreb is the capital of Croatia, o alguna muy similar. JIRS busca n-gramas formados por t´erminos de la pregunta en una colecci´on de documentos y aquellos pasajes con estructuras de mayor peso y m´as aglutinadas ser´an los que obtendr´an mayor valor de similitud.

2.

delo de Distancias valora mejor aquellos pasajes que est´en formados por estructuras con los t´erminos de la pregunta de mayor peso y que, adem´as, est´en m´as aglutinadas.

4.

JAVA Information Retrieval System es un sistema de RP especialmente orientado a BR puesto que fue dise˜ nado espec´ıficamente para dicha tarea. Este sistema no busca los documentos o pasajes relevantes a una consulta sino los pasajes con mayor probabilidad de contener la respuesta. Para ello utiliza un sistema que busca estructuras formadas por los t´erminos de la pregunta y las valora dependiendo del peso de dichos t´erminos y la distancia con respecto a las estructuras de mayor peso. Los resultados presentados en la tesis demuestran que JIRS mejora la precisi´on, cobertura y MRR de los pasajes devolviendo un mayor n´ umero de pasajes que contiene la respuesta que los tradicionales sistemas de RI. Los sistemas de BR que utilizaron alg´ un modelo de n-gramas de JIRS en la edici´on del CLEF 2005, se situaron entre las mejores posiciones y, en el CLEF 2006, se demostr´o que el mismo sistema de BR mejoraba considerablemente si se utilizaba JIRS en vez de Lucene como sistema de RP. Usando JIRS se podr´ıa mejorar los resultados de la mayor´ıa de los participantes del CLEF puesto que ´estos utilizan el Lucene en sus respectivos sistemas de BR. La u ´nica condici´on que se debe cumplir para que los sistemas de n-gramas mejoren los resultados es que el corpus tenga la suficiente redundancia. De no ser as´ı, JIRS se comporta como un sistema tradicional de RI. JIRS es una aplicaci´on modular y escalable, que permite una alta adaptabilidad a nuevos proyectos sin tener que conocer el c´odigo desarrollado por otros. En estos momentos est´a siendo utilizada por diversos grupos nacionales e internaciones de investigaci´on para desarrollar nuevas herramientas de Procesamiento del Lenguaje Natural debido a su cualiades y su potencia. JIRS es una aplicaci´on libre con licencia GPL que puede ser descargada gratuitamente de http://jirs.dsic.upv.es/.

Descripci´ on de JIRS

JIRS es un sistema de RI y Recuperaci´on de Pasajes (RP) de alta modularidad, escalabilidad y configuraci´on. A parte de realizar b´ usquedas por los tradicionales m´etodos basados en palabras claves, permite hacer b´ usquedas basadas en n-gramas. Esto lo hace especialmente apropiado para sistemas de BR multiling¨ ue. JIRS se compone de un n´ ucleo llamado Java Process Manager (JPM), unos archivos de configuraci´on. y un conjunto de bibliotecas de clases. JPM es un gestor de procesos que permite a˜ nadir o modificar la operatividad del sistema as´ı como los par´ametros de ejecuci´on de una forma sencilla sin recompilar toda la aplicaci´on, u ´nicamente modificando los archivos de configuraci´on. Dichos archivos tienen una estructura jer´arquica basada en documentos XML que permite estructurar la informaci´on de una forma l´ogica. Los archivos de configuraci´on no se componen u ´nicamente de par´ametros de la forma nombre-valor que determinan la configuraci´on de las diferentes acciones, sino que determinan qu´e acciones y cu´al ser´a el orden de ejecuci´on de dichas acciones. De esta forma se puede modificar totalmente el comportamiento del sistema cambiando u ´nicamente el archivo de configuraci´on.

3.

Conclusiones

El modelo de Densidad de Distancias de N -gramas

JIRS incorpora tres modelos de n-gramas para realizar las b´ usquedas. De los cuales, el modelo de Densidad de Distancias de N gramas (en adelante el modelo de Distancias) es el que mejor resultados aporta. Este modelo busca, en los pasajes, estructuras que est´en formadas por t´erminos de la pregunta. Despu´es valora estas estructuras dependiendo del peso de los t´erminos que contienen y el n´ umero de t´erminos que las separa del ngrama de mayor peso. De esta forma, el mo150

Procesamiento del Lenguaje Natural, Revista nº 40, marzo de 2008, pp. 151-152

recibido 30-01-08, aceptado 03-03-08

Desarrollo y evaluaci´ on de diferentes metodolog´ıas para la gesti´ on autom´ atica del di´ alogo ∗ Development and evaluation of different methodologies for automatic dialog management David Griol Barres Departament de Sistemes Inform`atics i Computaci´o Universitat Polit`ecnica de Val`encia. E-46022 Val`encia, Spain [email protected] Resumen: Tesis doctoral en Inform´atica realizada por David Griol Barres bajo la direcci´ on de los doctores Llu´ıs Hurtado Oliver y Encarna Segarra Soriano (Univ. Polit`ecnica de Val`encia). El acto de defensa de la tesis tuvo lugar el 12 de Diciembre de 2007 ante el tribunal formado por los doctores Eduardo Lleida Solano (Univ. de Zaragoza), Javier Mac´ıas Guarasa (Univ. de Alcal´a de Henares), Mar´ıa In´es Torres Bara˜ nano (Univ. del Pa´ıs Vasco), Emilio Sanchis Arnal (Univ. Polit`ecnica de Val`encia) y Fernando Garc´ıa Granada (Univ. Polit`ecnica de Val`encia). La calificaci´on obtenida fue de Sobresaliente Cum Laude por unanimidad. Palabras clave: Gesti´ on de Di´alogo, Modelos Estad´ısticos, Simulaci´ on de usuarios, Adaptaci´ on, Sistemas de Di´alogo Abstract: PhD Thesis in Computer Science written by David Griol Barres under the supervision of Dr. Llu´ıs Hurtado Oliver and Dr. Encarna Segarra Soriano (Univ. Polit`ecnica of Val`encia). The author was examined on December 12th 2007 by the committee formed by Eduardo Lleida Solano (Univ. de Zaragoza), Javier Mac´ıas Guarasa (Univ. de Alcal´a de Henares), Mar´ıa In´es Torres Bara˜ nano (Univ. del Pa´ıs Vasco), Emilio Sanchis Arnal (Univ. Polit`ecnica de Val`encia) y Fernando Garc´ıa Granada (Univ. Polit`ecnica de Val`encia). The grade obtained was Sobresaliente Cum Laude. Keywords: Dialog Management, Statistical Models, User Simulation, Adaptation, Dialog Systems

1.

Introducci´ on

Un inter´es hist´orico dentro del campo de las Tecnolog´ıas del Habla ha sido utilizar estas tecnolog´ıas en aplicaciones reales, especialmente en aplicaciones que permitan a una persona utilizar su voz para obtener informaci´ on mediante la interacci´ on directa con una m´aquina o para controlar un determinado sistema. Un sistema de di´alogo puede, de esta forma, entenderse como un sistema autom´ atico capaz de emular a un ser humano en un di´alogo con otra persona, con el objetivo de que el sistema cumpla con una cierta tarea (normalmente suministrar una cierta informaci´ on o llevar a cabo una determinada tarea). El gestor del di´alogo es un elemento central dentro de la arquitectura de un sistema de di´alogo, dado el n´ umero de m´odulos con ∗

Trabajo parcialmente financiado por los proyectos TIN2005-08660-C04-02 y TIC2002-04103-C03-03.

ISSN 1135-5948

los que interacciona y las tareas que debe llevar a cabo para decidir las acciones que dan respuesta a la intervenci´ on del usuario. El objetivo principal de la tesis es el estudio y desarrollo de diferentes metodolog´ıas para la gesti´on del di´alogo en sistemas de di´ alogo hablado. El principal reto planteado reside en el desarrollo de metodolog´ıas puramente estad´ısticas para la gesti´on del di´alogo, basadas en el aprendizaje de un modelo a partir de un corpus de di´alogos etiquetados. En este campo, se presentan diferentes aproximaciones para realizar la gesti´on, la mejora del modelo estad´ıstico y la evaluaci´ on del sistema del di´alogo. Para la implementaci´ on pr´actica de estas metodolog´ıas, en el ´ambito de una tarea espec´ıfica, ha sido necesaria la adquisici´on y etiquetado de un corpus de di´alogos. El hecho de disponer de un gran corpus de di´alogos ha facilitado el aprendizaje y evaluaci´ on del mo© Sociedad Española para el Procesamiento del Lenguaje Natural

David Griol Barres

delo de gesti´on desarrollado. As´ı mismo, se ha implementado un sistema de di´alogo completo, que permite evaluar el funcionamiento pr´actico de las metodolog´ıas de gesti´on en condiciones reales de uso. Para evaluar las t´ecnicas de gesti´on del di´ alogo se proponen diferentes aproximaciones: la evaluaci´ on mediante usuarios reales; la evaluaci´ on con el corpus adquirido, en el cual se han definido unas particiones de entrenamiento y prueba; y la utilizaci´on de t´ecnicas de simulaci´ on de usuarios. El simulador de usuario desarrollado permite modelizar de forma estad´ıstica el proceso completo del di´ alogo. En la aproximaci´ on que se presenta, tanto la obtenci´on de la respuesta del sistema como la generaci´on del turno de usuario se modelizan como un problema de clasificaci´ on, para el que se codifica como entrada un conjunto de variables que representan el estado actual del di´alogo y como resultado de la clasificaci´on se obtienen las probabilidades de seleccionar cada una de las respuestas (secuencia de actos de di´alogo) definidas respectivamente para el usuario y el sistema. A partir de los di´alogos generados mediante el uso de este m´odulo de simulaci´ on se ha ampliado y mejorado el corpus adquirido inicialmente. Adem´as se presentan diferentes t´ecnicas para la generaci´on autom´atica de di´alogos, que facilitan la obtenci´on autom´ atica de un corpus etiquetado de di´alogos y el posterior aprendizaje de un gestor de di´alogo. Los trabajos desarrollados se engloban en el marco del proyecto DIHANA, cuyo principal objetivo fue el desarrollo de un sistema de di´ alogo para el acceso a un sistema de di´alogo mediante el habla espont´ anea. La tarea definida para el proyecto fue el acceso vocal a un sistema que proporciona informaci´on sobre trayectos en tren de recorrido nacional. En u ´ltimo lugar, las metodolog´ıas propuestas en DIHANA para la gesti´on del di´alogo se han adaptado para desarrollar un gestor de di´alogo en el ´ambito del proyecto ´ Se describe la adaptaci´on realizaEDECAN. da y la evaluaci´ on de un gestor desarrollado para un sistema de di´alogo que facilita la reserva de instalaciones deportivas. Adicionalmente, se presentan diferentes metodolog´ıas basadas en reglas para la gesti´ on del di´alogo, as´ı como distintas aproximaciones para el desarrollo de generadores de respuestas en lenguaje natural.

De este modo, las l´ıneas de investigaci´on principales que se definieron para la tesis doctoral se materializaron en los siguientes objetivos: 1. Estudio y desarrollo de diferentes metodolog´ıas estad´ısticas para el desarrollo de gestores de di´alogo. 2. Estudio e implementaci´ on de diferentes metodolog´ıas para la evaluaci´ on de sistemas de di´alogo. 3. Estudio y desarrollo de diferentes modelos para la simulaci´ on de usuarios. 4. Definici´on de metodolog´ıas que permitan la estandarizaci´on de los sistemas de di´alogo y su adaptaci´on a diferentes tareas.

2.

Estructura de la tesis

En cuanto a la estructura del documento, la tesis est´a comprendida por un total de diez cap´ıtulos. El cap´ıtulo primero presenta los objetivos y el contexto en el que se enmarca la tesis. El cap´ıtulo segundo aborda de manera m´as detallada el estado de arte relativo a los sistemas de di´alogo hablado. Los cap´ıtulos tercero y cuarto se dedican a la descripci´on de la tarea DIHANA y de las caracter´ısticas principales del sistema de di´alogo implementado para este proyecto. El cap´ıtulo quinto presenta dos aproximaciones basadas en reglas para la gesti´on de di´alogo. El cap´ıtulo sexto describe el n´ ucleo central del trabajo desarrollado en el marco de la tesis: el desarrollo de modelos estad´ısticos para la gesti´on del di´alogo. El cap´ıtulo s´eptimo describe diferentes t´ecnicas y medidas para la evaluaci´ on de sistemas de di´alogo, mostr´ andose los resultados obtenidos en la evaluaci´ on de los gestores de di´alogo desarrollados. El cap´ıtulo octavo presenta el simulador de usuario desarrollado para evaluar y mejorar el comportamiento del gestor estad´ıstico. El cap´ıtulo noveno se dedica al estudio de la adaptaci´on de las metodolog´ıas de gesti´on propuestas para afrontar nuevas tareas. Completan la tesis, las conclusiones del trabajo y una serie de anexos en los que se ampl´ıa con mayor detalle la informaci´on presentada en los diferentes cap´ıtulos. La tesis puede consultarse en el apartado de Investigaci´ on del website del Departamento de Sistemas Inform´aticos y Computaci´on de la UPV (www.dsic.upv.es). 152

Información General

SEPLN'2008 XXIV CONGRESO DE LA SOCIEDAD ESPAÑOLA PARA EL PROCESAMIENTO DEL LENGUAJE NATURAL Escuela Politécnica Superior de la Universidad Carlos III de Madrid (España) 10-12 de septiembre 2008 http://basesdatos.uc3m.es/sepln2008/web/

1

Presentación

La XXIV edición del congreso anual de la Sociedad Española para el Procesamiento del Lenguaje Natural se celebrará en Madrid (España) del día 10 al 13 de septiembre de 2008, organizado por la Sociedad Española para el Procesamiento del Lenguaje Natural junto con la Universidad Carlos III de Madrid. Como en ediciones anteriores, con este evento la SEPLN pretende promover la difusión de las actividades de investigación, desarrollo e innovación que realizan en cualquiera de los ámbitos del procesamiento del lenguaje natural los diversos grupos e investigadores españoles y extranjeros. El congreso aspira a ofrecer un foro de discusión y comunicación en el que se favorezca el intercambio de la información y materiales científicos necesarios para promover la publicación de trabajos y la colaboración con instituciones nacionales e internacionales que actúen en el ámbito de interés del congreso.

2

Objetivos

El objetivo principal de este congreso es el de ofrecer a la comunidad científica y empresarial del sector el foro idóneo para la presentación de las últimas investigaciones y desarrollos del ámbito de trabajo en PLN, así como mostrar las posibilidades reales de aplicación y conocer nuevos proyectos. De esta manera, el XX Congreso de la SEPLN pretende ser un lugar de encuentro para la comunicación de resultados e intercambio de opiniones sobre el desarrollo de esta área en la actualidad. Además, se desea conseguir el objetivo de anteriores ediciones de este congreso identificando las futuras directrices de la investigación básica y de las aplicaciones previstas por los profesionales, con el fin de

ISSN 1135-5948

contrastarlas con las necesidades reales del mercado. Igualmente el congreso pretende ser un marco propicio para introducir a otras personas interesadas en esta área de conocimiento.

3

Areas Temáticas

Se anima a grupos e investigadores a enviar comunicaciones, resúmenes de proyectos o demostraciones en alguna de las áreas temáticas siguientes: • Modelos lingüísticos, matemáticos y psicolingüísticos del lenguaje • Lingüística de corpus • Extracción y recuperación de información monolingüe y multilingüe • Gramáticas y formalismos para el análisis morfológico y sintáctico • Lexicografía computacional • Generación textual monolingüe y multilingüe • Traducción automática • Reconocimiento y síntesis de voz • Semántica, pragmática y discurso • Resolución de la ambigüedad léxica • Aplicaciones industriales del PLN • Análisis automático del contenido textual

4

Formato del Congreso

La duración prevista del congreso será de tres día, con ponencias invitadas y sesiones dedicadas a la presentación de comunicaciones y de proyectos o demostraciones.

5

Comité de programa Miembros: • Prof. José Gabriel Amores Carredano (Universidad de Sevilla)

 Sociedad Española para el Procesamiento del Lenguaje Natural

• • • • • • • • • • • • • • • • • • • • • • • • •

Prof. Toni Badia i Cardús (Universitat Pompeu Fabra) Prof. Manuel de Buenaga Rodríguez (Universidad Europea de Madrid) Prof. Fco. Javier Calle Gómez (Universidad Carlos III de Madrid) Prof.ª Irene Castellón Masalles (Universitat de Barcelona) Prof.ª Arantza Díaz de Ilarraza (Euskal Herriko Unibertsitatea) Prof. Antonio Ferrández Rodríguez (Universitat d'Alacant) Prof. Mikel Forcada Zubizarreta (Universitat d'Alacant) Prof.ª Ana María García Serrano (Universidad Politécnica de Madrid) Prof. Koldo Gojenola Galletebeitia (Euskal Herriko Unibertsitatea) Prof. Xavier Gómez Guinovart (Universidade de Vigo) Prof. Julio Gonzalo Arroyo (Universidad Nacional de Educación a Distancia) Prof. José Miguel Goñi Menoyo (Universidad Politécnica de Madrid) José B. Mariño Acebal(Universitat Politécnica de Catalunya) Prof.ª M. Antonia Martí Antonín (Universitat de Barcelona) Prof.ª Mª Teresa Martín Valdivia (Universidad de Jaén) Prof. Patricio Martínez Barco (Universitat d'Alacant) Prof. Paloma Martínez Fernández (Universidad Carlos III de Madrid) Profª. Raquel Martínez Unanue (Universidad Nacional de Educación a Distancia) Prof.ª Lidia Ana Moreno Boronat (Universitat Politécnica de Valencia) Prof. Lluis Padró (Universitat Politécnica de Catalunya) Prof. Manuel Palomar Sanz (Universitat d'Alacant) Prof. Ferrán Pla (Universitat Politécnica de Valencia) Prof. Germán Rigau (Euskal Herriko Unibertsitatea) Prof. Horacio Rodríguez Hontoria (Universitat Politécnica de Catalunya) Prof. Kepa Sarasola Gabiola (Euskal Herriko Unibertsitatea)

• • • •

6

Prof. Emilio Sanchís (Universitat Politécnica de Valencia) Prof. L. Alfonso Ureña López (Universidad de Jaén) Prof.ª Mª Felisa Verdejo Maillo (Universidad Nacional de Educación a Distancia) Prof. Manuel Vilares Ferro (Universidade de Vigo)

Fechas importantes

Fechas para la presentación y aceptación de comunicaciones: • Fecha límite para la entrega de comunicaciones: 28 de abril de 2008 • Notificación de aceptación: 13 de junio de 2008 • Fecha límite para entrega de la versión definitiva: 27 de junio de 2008 • Fecha límite para entrega de proyectos y demostraciones: 6 de junio de 2008

Hoja de Inscripción para Socios Datos Personales Apellidos Nombre DNI Teléfono Domicilio Municipio Provincia

: ................................................................................................................................................. : ................................................................................................................................................. : ............................................................ Fecha de Nacimiento : ........................................... : ............................................................ E-mail : ........................................... : ................................................................................................................................................. : ................................................................................................. Código Postal : ................. : .................................................................................................................................................

Datos Profesionales Centro de trabajo : ..................................................................................................................................... Domicilio : ..................................................................................................................................... Código Postal : .................... Municipio : ..................................................................................... Provincia : ........................................... Teléfono : ................................. Fax : ............................. E-mail : ..................................... Áreas de investigación o interés: ................................................................................................................... ........................................................................................................................................................................

Preferencia para envío de correo: [ ] Dirección personal

[ ] Dirección Profesional

Datos Bancarios: Nombre de la Entidad Domicilio Cód. Postal y Municipio Provincia

: ............................................................................................................................ : ............................................................................................................................ : ............................................................................................................................ : ............................................................................................................................

Cód. Banco (4 dig.) Cód. Suc. (4 dig.) Dig. Control (2 Dig.) Núm.cuenta (10 dig.) ........................................ ........................................ ........................................ ........................................ En.....................a....................................de..............................................de........................... (firma) ------------------------------------------------------------------------------------------------------------------------------------------------------

Sociedad Española para el Procesamiento del Lenguaje Natural. SEPLN Sr. Director de: Entidad Núm. Sucursal Domicilio Municipio Provincia Tipo cuenta (corriente/caja de ahorro)

: ......................................................................................................... : ......................................................................................................... : ......................................................................................................... : ............................................................... Cód. Postal : .............. : ......................................................................................................... : .........................................................................................................

Ruego a Vds. que a partir de la fecha y hasta nueva orden se sirvan de abonar a la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN) los recibos anuales correspondientes a las cuotas vigentes de dicha asociación. Les saluda atentamente Fdo: ........................................................................... (nombre y apellidos del firmante) ............................de ..................................de................. -----------------------------------------------------------------------------------------------------------------------------------------------------Cuotas de los socios: 18 € (residentes en España) o 24 € (socios residentes en el extranjero). Nota: La parte inferior debe enviarse al banco o caja de ahorros del socio

Hoja de Inscripción para Instituciones Datos Entidad/Empresa Nombre : ................................................................................................................................................. NIF : ............................................................ Teléfono : ............................................................ E-mail : ............................................................ Fax : ............................................................ Domicilio : ................................................................................................................................................. Municipio : ................................................... Código Postal : ............ Provincia : .......................... Áreas de investigación o interés: ................................................................................................................... ........................................................................................................................................................................

Datos de envío Dirección Municipio Teléfono

: .............................................................................................. Código Postal : ................. : .......................................................................... Provincia : .............................................. : ........................................... Fax : ................................ E-mail : ...............................

Datos Bancarios: Nombre de la Entidad Domicilio Cód. Postal y Municipio Provincia

: ............................................................................................................................ : ............................................................................................................................ : ............................................................................................................................ : ............................................................................................................................

Cód. Banco (4 dig.) Cód. Suc. (4 dig.) Dig. Control (2 Dig.) Núm.cuenta (10 dig.) ........................................ ........................................ ........................................ ........................................ --------------------------------------------------------------------------------------------------------------------------------------------------

Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN). Sr. Director de: Entidad Núm. Sucursal Domicilio Municipio Provincia Tipo cuenta (corriente/caja de ahorro) Núm Cuenta

: .......................................................................................................................... : .......................................................................................................................... : .......................................................................................................................... : ............................................................................. Cód. Postal : ................. : .......................................................................................................................... : .......................................................................................................................... : ..........................................................................................................................

Ruego a Vds. que a partir de la fecha y hasta nueva orden se sirvan de abonar a la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN) los recibos anuales correspondientes a las cuotas vigentes de dicha asociación. Les saluda atentamente Fdo: ........................................................................... (nombre y apellidos del firmante) ............................de ..................................de................. -------------------------------------------------------------------------------------------------------------------------------------------------.......................................................................................................................................................................... Cuotas de los socios institucionales: 300 €. Nota: La parte inferior debe enviarse al banco o caja de ahorros del socio

Información para los Autores Formato de los Trabajos • La longitud máxima admitida para las contribuciones será de 8 páginas DIN A4 (210 x 297 mm.), incluidas referencias y figuras. • Los artículos pueden estar escritos en inglés o español. El título, resumen y palabras clave deben escribirse en ambas lenguas. • El formato será en Word ó LaTeX Envío de los Trabajos • El envío de los trabajos se realizará electrónicamente a través de la página web de la Sociedad Española para el Procesamiento del Lenguaje Natural (http://www.sepln.org) • Para los trabajos con formato LaTeX se mandará el archivo PDF junto a todos los fuentes necesarios para compilación LaTex • Para los trabajos con formato Word se mandará el archivo PDF junto al DOC o RTF

Procesamiento del Lenguaje Natural, Revista nÂº 40, marzo de ... - sepln [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch