EXECUTIVE SUMMARY [PDF]

Internacional sobre o Português, volume I, e, I. e I. e(orgs.), APL,. Lisboa, Junho de 1996, pp. 203-223. - Bacelar do N

8 downloads 8 Views 69KB Size

Recommend Stories


Executive summary (PDF)
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Executive Summary (PDF)
Don’t grieve. Anything you lose comes round in another form. Rumi

Executive Summary Operational Summary
If your life's work can be accomplished in your lifetime, you're not thinking big enough. Wes Jacks

executive summary
We must be willing to let go of the life we have planned, so as to have the life that is waiting for

Executive summary
Knock, And He'll open the door. Vanish, And He'll make you shine like the sun. Fall, And He'll raise

Executive Summary
Silence is the language of God, all else is poor translation. Rumi

Executive Summary
Raise your words, not voice. It is rain that grows flowers, not thunder. Rumi

Executive Summary
Suffering is a gift. In it is hidden mercy. Rumi

executive summary
How wonderful it is that nobody need wait a single moment before starting to improve the world. Anne

executive summary
Don't be satisfied with stories, how things have gone with others. Unfold your own myth. Rumi

Idea Transcript


EXECUTIVE SUMMARY Technical achievements The LE-PAROLE project concentrated on the production of reusable resources in the domain of Language Engineering, more precisely on the construction of lexicons and tagged corpora. Concerning the corpus line of action the project has produced a set of comparable corpora for 11 languages of the European Union. The term "comparable" means that not only all of them are of equivalent size, but also that their relative composition ie (type of text newspaper vs. technical documentation vs. commercial fliers, period covered, etc.) are very similar. The availability of such a resource opens new areas for research and commercial applications. The lexicon part of the project, has an undeniable merit in having validated a common conceptual representation model for 11 languages of the EU. The resulting lexicons themselves are of a size ("20 000 words") sufficient to bootstrap a number of LE applications.

Validation results The validation of the resources produced in the course of the project is a three step process. First each of the Partners has made periodic in-house checks of the data produced. This kind of validation was of course different for lexica and corpora and each Partner used a different strategy adapted to its particular conditions. The second level of validation was periodically performed by the technical manager of the project. This operation consisted in checking whether the delivered data was conformant to the specified DTD, and for lexica a deeper integrity checking has been performed using the relevant tool. For legal and practical reasons the validation by the technical manager was restricted to checking about 1% of the data produced by the Partners. The third and final validation of the data is being performed by external validators. Because of the cost of such an operation the validation is being performed on three languages: Dutch, Italian and Spanish, but access to all the relevant data has been provided. It is important to note that a strict formal validation protocol has been established based on directives of the EAGLES project. The resources of the PAROLE project are further extended in a number of national projects and the effort of producing generic lexica is also pursued in the SIMPLE project by adding a semantic layer of information. These two

factors provide an implicit auto correction mechanism and offer further possibilities to validate the data delivered today. Impact and future prospects The main impact is the delivery of reusable language Engineering resources. A number of mutual interactions between the PAROLE partners and other initiatives illustrate this point. For example a sample of 1 million word from the Catalan PAROLE corpus have been supplied to the Universitat Politecnica de Catalunya in order to help with the production of a speech database of the same type as constructed by SPEECHDAT Another example, in Portugal, is the Reference Corpus of Contemporary Portuguese It contains, for the moment, 74 million words, and includes the PAROLE corpus. It contains samples of oral and written Portuguese, in its varieties of Portugal, Brazil, of Macao and of lusophone Africa (Guiné-Bissau, Cabo-Verde, Angola, Moçambique and S. Tomé e Príncipe). The PAROLE Portuguese corpus is also used in other works such as: - Combinatories Dictionnar of Portuguese - (available for consultation) On a corpus of 12.282.392 words, the user can select any word whose frequency of combination with another is equal or superior to 2, distances of +/- 4, relatively to the key-word, and obtain information on the frequency of combination, distribution, mutual information and concordances. - Spoken Portuguese: geographic and social varieties - (in pre-publication stage) - Four CD-ROM containing samples of oral Portuguese in its regional and international varieties. These samples are the audio recordings with its orthographic transcription alignment. - Three volumes of studies of oral Portuguese. - Computerised Multifunctional Lexicon of Portuguese (undergoing - first year) A lexicon of 30 thousand words is being constituted. The entries are extracted from a corpus of 15 million words and contain the following information: - quantitative information of the word, with its percentage repartition oral/written; - phonetic transcription; - morphosyntactic classification. Novo Dicionário da língua Portuguesa undergoing in Academia das Ciências de Lisboa

Although contractually finished the PAROLE project has a number of continuations of which a number have started already during the project. They can be classified into two groups: - national projects, whose scope is to extend the size of the PAROLE deliverables - EU funded projects. Concerning the national projects, most Partners, had expressed interest in broadening the coverage of their resources at the beginning of the PAROLE project. At the time of writing of this document, 5 languages are covered in such projects. The languages concerned are Danish, Greek, Italian, Swedish, Portuguese1. We can give a brief description of the first four to provide an example of the relationships between PAROLE activities and their national spin-off.

Danish For Danish, the increasing needs for various products within the field of language technology create rapidly growing demands for computerised, large size reusable lexical data collections. CST has decided to meet this challenge by initiating the development of such a lexicon for Danish within the framework of the "STO" project (SprogTeknologisk Ordbog: Language Technology Lexicon). For this national project COP regards the Danish lexical resource produced in the LE-PAROLE project as an extremely valuable initial capital. The declared goal of the PAROLE project is to develop relevant and reusable language resources to be complemented in subsequent national projects; thus the STO project is a good example of the intended exploitation of the LE-PAROLE results. The Danish PAROLE lexicon makes up the point of departure for the STO regarding the lexical coverage and the levels and depth of the linguistic description. Its lexical coverage is 20.000 morphological units belonging to general language vocabulary; this is the largest and most systematically encoded Danish lexicon for LE purposes. For practical applications, however, it is necessary to increase considerably the number of entry words in the lexicon. In order to attain the objectives a number of tasks has been defined for the enlargement and further development of the existing lexicon.

1

National project or funding are expected for other languages. For example, the PAROLE German subconsortium is leading a group of partners in the preparation of a national initiative for linking the PAROLE lexica at a multilingual level.

COP investigated the reusability of the material produced in preparation for an adaptation and integration into the large size national lexicon resource we are going to develop. Furthermore, the encoding guidelines given for the PAROLE work and the encoding tool have been reconsidered. The main actions of the project containing the PAROLE lexicon work are the following -

the lexical coverage of general language vocabulary will be enlarged

-

the lexical coverage will be extended also to comprise language for specific purposes (LSP) within a number of selected domains. This will give rise to some additional features that should be incorporated into the linguistic description

-

the levels of description defined (i.e. morphology, syntax and semantics) are useful esp. because of the possibility of modular encoding

-

the depth of the linguistic description is in most cases sufficient for general applications although a few additional features to be captured were defined (e.g. for differentiating between general language/LSP/domain specific entries, cf. above)

-

the properties of the descriptive language will probably be revised and tuned to match Danish linguistic traditions.

Danish is a "less widely used" language; this fact increases the need for the use of language engineering technology to support and maintain international communication. Being a relative small language community, the cost effectiveness is an essential factor in the development of relevant products and in the competitiveness in the marketplace. These needs can only be met by the development of reusable multipurpose resources produced by means of goaloriented and coordinated national efforts. To this end, CST is initiating, organising and coordinating a network that is going to be established around a consortium of (approx. 10) partners. A feasibility study concerning the following main issues has been carried out and the concluding report (finished in August 1987) describes the -

guidelines for the composition of the general language vocabulary to be covered (an extension of the PAROLE coverage)

-

proposal for the selection of domains and text types to be covered

-

guidelines for the composition of the LSP vocabulary to be covered

-

investigations of various corpus tools regarding functionalities, technical requirements, user-friendliness, etc. with a view to suitability for the chosen purposes

-

outline of the selection and organisation of linguistic information types in preparation for the final linguistic specifications.

A pilot project (to be completed in March 1998) is following up the main recommendations of the feasibility study. COP is now in the last working phase preparing the documentation that comprises the following issues: -

an overview of potential project partners, users and customers on the basis of contacts established until now

-

an outline of relevant cooperation models within a national network wrt. encoding of lexicon entries, delivery of reusable external material, expert assistance, etc.

-

the elaboration of linguistic specifications wherein the Partner greatly benefit from the PAROLE specifications for Danish and from the internal validation of the PAROLE lexicon resource

-

the considerations regarding extended use of the PAROLE model (including e.g. treatment of collocations, etc.)

-

assembling a test corpus for LSP, testing the corpus tool chosen, etc.

-

a proposal for a user-friendly lexicon coding tool (with an outline of the main linguistic and technical requirements)

-

outline of a language specific coding manual (to be used also by external partners who want to contribute to the encoding of new entries)

-

testing the conversion of external reusable lexical data into a slightly modified PAROLE model (expressed within SGML) as a basis for the elaboration of appropriate conversion methods, routines and formats.

After the completion and evaluation of the outcome of the pilot project the main STO-project will start. The duration of the project will be approx. 6 years. The Partner plans the lexicon development to be modular in the sense that it will produce selfcontaining parts (e.g. each lexicon covering a given domain vocabulary to be delivered as a complete module for customisation). The main working tasks will be of various kinds like organisational, linguistic, technical and legal, below a few concrete examples are given: -

elaborate a consortium agreement

-

develop the tool for the encoding work

-

build up an LSP corpus of an appropriate size and composition

-

carry out corpus based encoding of LPS vocabulary

-

provide stepwise refinement and revision specifications and coding manual when needed

-

extend the coverage of general language vocabulary to the size defined

-

coordinate external and internal work (including the management of data in the kernel STO database)

-

follow up the general user requirements and needs

of

the

linguistic

The account given above illustrates how an EU-initiative can be followed up by national work that takes advantage of the theoretical and practical experience acquired. The Partner is aware of the increased relevance of effective international communication and of the national responsibility for being a full member of the Information Society. CST and the consortium contribute with the STO-project to the development of language engineering products for Danish quite in parallel to other national initiatives that integrate the results of the PAROLE project into their follow-up projects. Thus also the possibility of future international co-operation on multilingual projects is foreseen.

Greek The Greek National Project OROSSIMO was approved for funding by the General Secretariat of Research and Technology of the Ministry of Development; its duration is 28 months (April 1996 - July 1998). It aims at the collection of resources (text corpora and terminology) in several scientific and technical domains. During its first phase, ILSP collaborated with Greek scientists from Universities and Research Centers as well as Publishing Houses in the country, and collected corpus in 19 scientific domains. These domains gave a total of approximately 3 million running words and 23,843 terms. The current phase of the project concerns the development of an integrated environment connecting the texts with the terms, allowing the user to look-up the translation of specific terms in a bilingual (English - Greek) glossary, to browse through the texts and to view each term in actual use in its domain(s). This environment is planned to be available to the scientific community for research and development purposes. The OROSSIMO project is a spin-off of the PAROLE project, given that it adheres to the PAROLE specifications for text encoding, and, although the OROSSIMO environment is independently designed and developed, the texts collected by this project constitute part of the PAROLE corpus of ILSP.

Italian Two National Projects, proposed by Pisa, were approved last year for the first time in Italy in the field of Language Resources. Although these two Projects are funded in the framework of different laws, this can be considered as constituting together a National Project, as this is what the Italian Ministry is suggesting, i.e. to make them act in synergy. The main objective of the National Project is to build an infrastructure of Language Resources and related tools, both for the written and spoken areas, and to validate them in a number of industrial applications. The project is defined as a complement and an enrichment of the PAROLE lexicons and corpora, and it is based – for its specifications – on the EAGLES recommendations. The overall perspective is

combining innovative R&D and industrial, market-oriented relevance. It is very important that the field of Language Engineering, and within it, the area of Language Resources, is given a formal and economical recognition at the national level by the Ministry of Research. Objectives of the National Project linked to PAROLE. Here the objectives connected to PAROLE are briefly mentioned. One of the goals is to extend the PAROLE Corpus, both in size and to have more text tagged. The encoding model will be the PAROLE one. Part of the Corpus will be linked to a comparable corpus for Arabic (encoded with the same tags) a part of which will also be POS-tagged. Moreover, new types of linguistic encoding will be introduced, with the purpose of creating a first nucleus of an Italian Treebank, i.e. part of the Corpus will be annotated at the syntactic level (phrasal and functional annotation) and also at the semantic level. This is judged a strong requirement coming also from industries, both for training and for testing purposes. Another objective is the extension of the PAROLE lexicon. The linguistic model will be the same as PAROLE for morphosyntax and syntax, and as SIMPLE for semantic subcategorisation. It will be linked to an Italian WORDNET (which will extend the EURO-WORDNET data), so that a completely connected and coherent set of data for Italian will be available. Another extension will concern the addition of some terminological domains, chosen according to the needs of the industrial partners. The lexicon will be accompanied by a lexical tool, with the necessary functionalities for inputing data, updating, storing, browsing, exporting, etc. Another part of the project, which builds on the results of the LE SPARKLE project, will provide a lexical acquisition system to extract lexical information from corpora, and enrich the basic generic lexicon on-the-fly with domain or application specific data. The lexicon will obviously be linked to, and used by, other components built in the project, e.g. the grammar and the different industrial applications. The validation of the lexicon and the corpus will be done following the recommendations of the ELRA Validation Group.

Swedish In 1997 a second phase of the Swedish program for language technology was begun and will continue through out 1999. The site manager for the Swedish PAROLE partner was one of three who outlined the direction the whole program would take. The long-range goal of the program is to strengthen the level of competence in language engineering with a view to transferring such competence to industrial applications. Another goal is to facilitate a network of contacts and information exchange with the international community in order to enable Swedish partners to participate in projects on the European level. Two projects were applied for by the Swedish participants in PAROLE and both received funding. One of them, "Corpus based lexicon and analysis tools" is a direct continuation of the Swedish PAROLE work and was explicitly described as

such in the application. On the basis of the Swedish PAROLE corpus a lexicon for two-level morphological analysis will be produced and the syntactic descriptions from the Swedish PAROLE lexicon will be integrated. This project is co-operating with yet another Swedish site that will build up the Swedish equivalent of EUROWORDNET. The second project "Parallel corpora in Linköping, Uppsala and Göteborg" is a continuation of the informal contacts that were made within the context of PAROLE concerning parallel corpora. The standardisation norms used in PAROLE will be extended to a context involving multilingual, aligned corpora. A project in an earlier phase of the Swedish language technology program has recently been completed: The Stockholm-Umeå Corpus Project. One of the factors for it being late involved problems with SGML and corpus encoding. In January 1997 the Swedish PAROLE partner, represented by Daniel Ridings, became actively involved in the successful conclusion of that project. In the course of the work the standards for corpus encoding used by PAROLE were introduced and implemented in the SUC corpus. This is immediately seen in the corpus header to that project where the PAROLE extensions to the TEI system of DTD's is called upon. In addition to the corpus standards, even the tagset for the morphosyntactic description of Swedish was harmonised between SUC and Swedish PAROLE so that there is now one million running words of corpus material tagged and manually checked according to the PAROLE model.

The second type of continuation of PAROLE is represented by two EU funded projects SIMPLE and ELAN. The SIMPLE project....add semantics to PAROLE lexicons based on PAROLE corpora ... The ELAN project ..distribution of resources ..... PROJECT TIMETABLE (1 P.)

PAROLE

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24 == == == == == == == == == == == == == == == == == == == == == == == ==

WP1.1

Management

WP1.2

Commercial == and Legal Aspects

WP2.n

Language

== == == == == == == == == == == == == == == == == == == == == == ==

== == == == == == == == == == == == == == == == == == == == == == == ==

Corpus WP3.n

Language Lexicon

== == == == == == == == == == == == == == == == == == == == == == == ==

WP4

Lexicon Tool

== == == == == == == == == == == == == == == == == == == == == == == ==

The above diagrams show that all the major activities of the project ran all through the project. They were subdivided into tasks with a shorter span as explained further. An important milestone was the delivery of first year lexicons and corpora, which occurred in month 12 of the project. There were 13 workpackages WP2.n for Belgian French ,Catalan ,Danish ,Dutch ,English ,Finnish ,French ,German ,Greek ,Irish ,Italian ,Portuguese and Swedish. Each of these consisted of corpus building and tagging tasks. Similarly there were13 workpackages WP3.n for Catalan ,Danish ,Dutch ,English ,Finnish ,French ,German ,Greek ,Irish ,Italian ,Portuguese, Spanish and Swedish. Each included tasks for prodicing the morphological and syntactic layers of the lexicons. The different language specific workspaces ran in parallel, but a constant interchange was maintained between them, this focussed on elaborating common lexicon encoding decisions for similar problems in different languages, taking similar corpus encoding conventions for the same types of texts independent of the languages. There was also an ongoing exchange of information concerning different contractual solutions with corpus providers.

Contractual (commencement, contract addenda, new partners ...) The project began on April 2nd 1996. The first contract addendum has been revised the contents of the intermediate milestones, without changing the nature of the final deliverables. This addendum however also contained additional technical constraints on the produced resources, which guaranteed more uniform lexical encoding conventions across all the project languages. At the same time a strict correspondence between the lexicon encoding and the corpus tag system has been enforced. The quality of the final results of the PAROLE project is thus much better than originally planned at the beginning of the contract. Stages of work (preparations, R&D, testing, demonstration, including milestones) The milestones of the project as revised by the addendum to the contract.: Corpus

Partici pant's short name

Data delivered at each milestone

6

9

11

12

15

18

21

24

R(* T(*) R )

T

R

T

R

T

R

T

R

T

R

T

R

T

PSA

20

250

ATH

20

250

BAR

20

250

BIR

20

250

COP1

20+ 9

250

DUB

15

150

GOT

20

250

HEL

20

250

LEI

20

250

LIE

20

250

LIS1

20

250

MAN

20

250

PAR

20

250

Notes: "R" stands for "running words" and the quantity is expressed in millions "T" stands for "tagged and manually checked" and the quantity is expressed in thousands

Lexicon

Time:

Language Catalan

t3 t t9 6

11

t12

t15

t18

t21

t24

Size (entries)

M: 20K S: 20K

Danish

M: 20K S: 20K

Dutch

M: 20K S: 20K

English

M: 20K S: 20K

Finnish

M: 20K S: 20K

French

M: 50K S: 20K Se: 20K

German

M: 20K S: 20K

Greek

M: 20K S: 20K

Italian

M: 20K S: 20K

Portuguese

M: 20K S: 20K

Spanish

M: 20K S: 20K

Swedish

M: 20K S: 20K

Notes: "K" stands for 1000 words "M" stands for "morphological information" "S" stands for "syntactic information" "Se" stands for "semantic information"

External reviews Two reviews. Both positive . Conferences, exhibitions, user group meetings The following project workshops have taken place: -

encoding in Charenton ...

-

syntax in Charenton .....

-

corpus in Pisa ....

-

syntax 2 in Barcelona ....

Other important events ACHIEVEMENTS The PAROLE consortium has started the preparation of a number of books: -

Delcourt C and Mariani J (ed. by) Book "Corpus Ecrits et Corpus Oraux" published by AUPELF (Association of the French speaking Universities). This book will include papers from PSA (theoretical issues), from LIE and PAR (French corpora), and from ATH and GOT (parallel corpora including a French one)

-

Delcourt C (ed. by) - volume dedicated to PAROLE title: "PAROLE" - will be published by Kluwer both as a book and as a special issue of the review Computers and the Humanities. All the institutional members of the PAROLE Association are preparing their contribution.

-

Zampolli A., Calzolari N., Ogonowski A. (ed. by) The PAROLE Language Resources (To appear in Linguistica Computazionale” special issue)2 In addition to the presentation of the project in several Conferences, Workshops, etc., PAROLE has substantially contributed to the organisation of some international and national conferences and events focusing on the issue of LR. We give here one example for each of the two types: International: The LREC (First International Conference on Language Resources and Evaluation, 28 – 30 May, 1998, Granada, Spain) will provide an overview of the state-of-the-art, discuss problems and opportunities, exchange information regarding ongoing and planned activities, language resources and their applications, discuss evaluation methodologies and demonstrate evaluation tools, explore possibilities and promote initiatives for international cooperation. LREC will focus on the following issues: the availability of language resources and the methods for the evaluation of resources, technologies and products, for written and spoken language. Substantial mutual benefits can be expected from addressing issues like these through international co-operation.

It is reasonable to expect that LREC will succeed in contributing to some objectives shared by PAROLE: - increasing the awareness of the relevance of LR - promote international cooperation and harmonisation - promote links between suppliers of LR, users of LR (mainly developers of LE products and services), national and international funding agencies In fact: - LREC received more than twice the number of submissions foreseen - The funding agencies of various continents will take part both with oral papers and in a panel devoted to an overview of what they see in future for LR, R&D and international cooperation - The major European and American industries will take part in a panel on the industrial use of LR - The NSF has sponsored the participation of about 40 Americans, and submitted the proposal for a NSF sponsored Workshop TAL, The Italian National Conference dedicated to LE in the Information Society, has been organised under the auspices of the PTT Minister, with the

2

Planned TABLE OF CONTENTS Forward, Introduction, Composition of Corpus, Encoding of Corpus, Tagging, Model & Tool/Software for Lexicon, The Morphology Layer, The Syntax Layer, The Validation Model, Examples of Extraction of Information from Corpora

participation of the Ministries of Research and of Industries and of many other relevant Italian authorities (Rome, January 13-14, 1997). The final session of the conference has been dedicated to the presentation and the discussion of a national plan focusing on the production of basic generic resources for Italian, explicitly intended to complement the PAROLE initial nucleus. The proposal has been prepared by a national working group comprising representatives of various public administrations (ministries, etc.), industries, service providers, research institutes, universities, etc. The Italian National Project, which has been launched to implement the proposal of this group, will set up a national network (sponsored by the Ministry of Research, CNR, etc.) which will be connected to the PAROLE activities.

EVALUATION AND ASSESSMENT The (still partial) lexical data3 of three languages will be evaluated by external users, according to the ELRA validation manual, through the ELDA services, as stated in the PAROLE TA.

Danish Online access to the Danish corpus is regularly given to researchers and students who presently have to work with it in the site of DSL. At a semester course given at the Copenhagen Business School by Ole Norling-Christensen the corpus was used for demonstrations and exercises. Besides, DSL answers questions and make searches for external clients. Subcorpora and wordlists according to specific criteria are produced on demand, e.g. for a psycholinguistic research project, for a commercial client, a machine translation project (METAL)", for the Danish SPEECHDAT group.

Dutch The Institute for Dutch Lexicology INL (LEI) has since 1994 been providing access to linguistically annotated Dutch text corpora via Internet. A steadily growing number of users retrieve corpus data (frequency lists, concordances etc.) by use of the INL corpus query system (presently, 200 users have addressed over 60,000 queries). PAROLE standards for text classification (topic and medium) have been applied in one of these corpora, the 38 Million

3

This number has been agreed with the Commission, given the cost of the validation.

Words Corpus 1996. This is described in Kruyt & Dutilh (1997) and it is also accounted for in the retrieval system under the item 'corpus information/text classification', with explicit mentioning of PAROLE. This corpus is consulted by individual researchers and by lexicographers participating in international lexicon projects.

German The German PAROLE work, both corpus and lexicon, has been made available within the German subconsortium and to other interested parties. It is also agreed that, when complete, the German PAROLE corpus will be incorporated into the IDS corpus, where it can then be accessed for academic use by IDS scholars, visitors and external users.

Greek Samples of the Greek corpus have been supplied in several occasions inside and outside the country, all of which concerned basic linguistic research. The morphosyntactically tagged and disambiguated corpus of 250,000 running words produced within the PAROLE project is used for the development of a new morphosyntactic annotator for ILSP.

Italian Lexicon: the Italian Lexicon is being used by the ILC-CNR parser to produce a syntactic analysis of Italian texts. It has also been used inside the LESPARKLE project, as “gold standard” for evaluation of the results of the lexical acquisition system from corpora. The Italian lexical data will be connected to the Italian EUROWORDNET. Corpus: the Italian Corpus has been used for research aims by about 30 users. These users represent different and diverse fields from all over the world: Universities, publishing houses, private companies. The corpus can be accessed, through DBT functionality, both onsite and on Internet. This corpus is already used in some international projects. For example, the lexical acquisition system developed within SPARKLE extract lexical information from the corpus and this will enrich the basic PAROLE lexicon with domain or application specific information.

Portuguese Computerised Multifunctional Lexicon of Portuguese (national funding)(undergoing - first year) - partnership with CLUL, Verbo and ILC.

Lexical and grammatical differences between European and Brazilian Portuguese.(internal INESC funding).

Swedish 10,000,000 word-class tagged words of the PAROLE Swedish corpus are available on the WEB (http://ldb20.svenska.gu.se). It is a sub-set of the Swedish PAROLE corpus, and is explicitly identified as such. All 10,000,000 words have been tagged with the Swedish PAROLE tagset. It is possible to search for words, phrases or grammatical patterns based on the word class tagging. For example, the normal futurum construction in Swedish is "kommer" (come) + "att" (infinitive marker) + infinitive. It is possible to formulate a query that returns all constructions where the "att" (inf-marker) is not present, even if the infinitive is 0 to x words distant from "kommer". The query "motor" is the command-line version of Stuttgart's "Corpus Query Processor" that is called by through a CGI script, experiments are also planned with the CUE system from Birmingham. As agreed between the PAROLE Consortium and the Commission in the TA: -

the PAROLE lexica will be fully available to the R&D community

-

the PAROLE corpora will be -

accessible in Internet, at the respective partner site, in their totality: 20M words

-

distributable on CD-ROM (or other means) only in part: a minimum of 3M words (including 250,000 tagged and manually checked words).

To implement this agreement, the PAROLE Consortium has taken the following steps: Lexica and Distributable Corpora The lexica and the distributable corpora produced by PAROLE for the various languages will be made available through the ELRA channel. ELDA is currently finalising the relevant legal and contractual documents. Full Corpora Each partner will take care of the organisational structure necessary for making their full corpora available through Internet. (As mentioned before, some partners have already started offering this type of service). In addition, the PAROLE Consortium has promoted the creation of a network that will also include, through TELRI, Eastern country suppliers. The network will connect the site servers through a common management system, able to work as “server of server sites” and to offer a unified access to all of the available data .

The ELAN project, approved in the MLIS framework, plans not only to reinforce or, where necessary, create international standards by designing a common query language and by providing standardised resources but also to operate a user community network with active awareness-raising measures, a clear copyright policy, user support, e-mail user groups, etc. In this respect ELAN can be considered as the natural continuation of LE-PAROLE in the direction of dissemination.

List of public deliverables and reports

Some of the publications listed below are exclusively devoted to the work done within the project, others refer to it and other works thus offering a presentation in a broader perspective. At the time of preparation of this report some of the items mentioned below were still in print, but will all be available at the end of the project (copies of these can currently be obtained from the concerned Partner). -

All the PAROLE work and results for Dutch are reported on in the INL annual "Jaarboek van de Stichting Instituut voor Nederlandse Lexicologie, overzicht van het jaar 1996', INL Leiden. Idem for 1997 in the annual of 1997 (to be published in autumn 1998).

-

Bacelar Do Nascimento, F.; Bettencourt Gonçalves, J.; Marrafa, P.; Pêgo, T.; Pereira, L.; Ribeiro, R.; Wittmann, L.; "LE-PAROLE - Do corpus à modelização da informação lexical num sistema multifunção", in Actas do XIII Encontro da Associação Portuguesa de Linguística, October 1997, Lisbon (to appear)

-

Bacelar do Nascimento, M. F. (1996) - "A observação e análise de dados reais na investigação e ensino de línguas", Actas do II Encontro da Associação Portuguesa dos Centros de Línguas do Ensino Superior, Universidade de Évora, Évora, Janeiro de 1996 (no prelo).

-

Bacelar do Nascimento, M. F. (1996) "Apresentação da mesa-redonda sobre corpora linguísticos", Actas do XI Encontro Nacional da Associação Portuguesa de Linguística, volume I * Corpora, Bacelar Do Nascimento, M. F., M. C. Rodrigues e J. Bettencourt Gonçalves (orgs.), APL, Lisboa, Setembro de 1996, pp. 19-20.

-

Bacelar do Nascimento, M. F. (1996) "Aspectos da sintaxe do português falado (repetições lexicais e de estruturas sintácticas em produções orais: fenómenos de deslocação), Actas do Congresso Internacional sobre o Português, volume I, e, I. e I. e(orgs.), APL, Lisboa, Junho de 1996, pp. 203-223.

-

Bacelar do Nascimento, M. F. (1997) "A exploração de corpora linguísticos no ensino/aprendizagem do português", Actas do Seminário Internacional de Português como Língua Estrangeira, Macau, Maio de 1997 (no prelo).

-

Bacelar do Nascimento, M. F. (1997) "Contribuição da análise de corpora para a descrição lexicográfica", Sentido que a vida faz, Estudos para Óscar Lopes, Porto, Ed. Campo das Letras, pp. 737-744.

-

Bacelar do Nascimento, M. F. e J. Bettencourt Gonçalves (1996) (em colaboração) "Corpus de Referência do Português Contemporâneo (CRPC), desenvolvimento e aplicações", Actas do XI Encontro Nacional da Associação Portuguesa de Linguística, volume I * Corpora, Bacelar Do Nascimento, M. F., M. C. Rodrigues e J. Bettencourt Gonçalves (orgs.), APL, Lisboa, Setembro de 1996, pp. 143-149.

-

Bacelar do Nascimento, M. F. e Luísa Alice Santos Pereira (1996) "Dicionário de Combinatórias do Português: associações lexicais frequentes observadas num corpus de português contemporâneo", Actas do XI Encontro Nacional da Associação Portuguesa de Linguística, volume II * Dicionários, FARIA, I. H. e M. CORREIA (orgs.), APL, Lisboa, Setembro de 1996, pp. 43-54.

-

Bacelar do Nascimento, M. F. e Luísa Alice Santos Pereira (1997) "Corpus de Referência do Português Contemporâneo", comunicação apresentada a Rencontres de Linguistique Appliquée, Construction et Utilisation de grands Corpus, Paris, 24-27 de Setembro de 1997.

-

Bilgram Thomas & Keson Britt; title: The Construction of a Tagged Danish Corpus;bib. info: name of conference: The 11th Nordic Conference on Computational Linguistics (NODALIDA) Proceedings. Center for Sprogteknologi, Copenhagen 1998.

-

Braasch Anna, A large scale lexicon for Danish in the Information Society. ELRA LRE (First International Conference on Language Resources and Evaluation), Granada, Spain, May 1998.

-

Braasch Anna and Norling-Christensen Ole; title: En trœkbaseret beskrivelse af dansk bøjningsmorfologi; Annual Meeting of Datalingvistisk Forening (DALF) 1996; title of book: Sprog & Multimedier; publisher of book: Aalborg Universitetsforlag, Denmark; ISBN: 87-7307-588-4; date of publication: 1997.

-

Braasch Anna, Corpus-supported modelling of syntactic information. The computational PAROLE lexicon for Danish. ALLC/ACH '98 (Association for Literary and Linguistic Computing/Association for Computers and the Humanities), Debrecen, Hungary, July 1998.

-

Costanza Navarretta. Encoding Danish Verbs in the PAROLE Model; In R.Mitkov, N.Nicolov and Nikolai Nikolov (eds.): Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, p. 359-363. 1997.

-

Delcourt C, Delcourt-Angélique J "PAROLE", Construction et utilisation de grands corpus conference (Paris) Association française de Linguistique appliquée, 24-27.09.1997; Proceedings of the meeting in the "Revue française de Linguistique appliquée".

-

Ferrer A., Marti E., Mayoral C. ; Title: Una proposta de classificació dels noms comuns (A; proposal of classification of common nouns);

Conference: XXVII Simposio de la Sociedad Española de Lingüística, Palma de Mallorca, December 1997. -

Gavrilidou, M., P. Labropoulou, E. Mantzari and S. Roussou (1997), "Specifications for a Computational Morphological Lexicon of Modern Greek", 3rd International Conference for Greek Linguistics, Athens.

-

Gavrilidou, M., P. Labropoulou, E. Mantzari and S. Roussou (1998), "The macrostructure of a computational lexicon of Modern Greek", Workshop on Computational Lexicography organised by the Greek Human Network on Language Technology, Athens, 14 March 1998.

-

Kruyt J.G. & Dutilh M.W.F. (1997), A 38 Million Words Dutch Text Corpus and its Users. In: Lexikos 7 (Afrilex-reeks/series 7:1997), p. 229-244. Bureau of the WAT, Stellenbosch, RSA.

-

Kruyt J.G. (1995a), Nationale tekstcorpora in internationaal perspectief. In: Forum der Letteren 36 (1995), p. 47-58.

-

Kruyt J.G. (1995b), Technologies in Computerized Lexicography. In: Lexikos 5 (Afrilex-reeks/series 5b:1995), p. 117-137.

-

Kruyt J.G. (1997), Europese corpus- en lexiconprojecten. In: STDH Nieuwsbrief van de Stichting Tekstcorpora en Databestanden in de Humaniora 7 (November 1997), p. 4.

-

Kruyt J.G. (1998), Elektronische woordenboeken en tekstcorpora voor Europese taaltechnologie (Electronic dictionaries and text corpora for European language technology). In: Trefwoord Jaarboek Lexicografie 1997-1998, pp. 28-42. Sdu Uitgevers/Standaard Uitgeverij, Den haag/Antwerpen.

-

Melero Maite & Villegas Marta title: Issues on the Syntactic Encoding of a Computational Lexicon conference: First International Conference on Language Resources and Evaluation - Granada (Spain) - May, 28-30, 1998

-

Melero Maite & Villegas Marta, 1996. Propuesta Europea de Estandarización de la CodificaciÛn Morfosintactica Aplicada al Español. In Actas del XXVI Símposio de la Sociedad Española de Lingüística (SEL) sobre Morfología y Lenguaje CientÍfico y Técnico. Madrid, 1996.

-

Melero Maite ; Title: PAROLE, un proyecto europeo de creación de recursos lingüísticos (PAROLE, a european project for the creation of linguistic resources); Conference: XXVII Simposio de la Sociedad Española de Lingüística, Palma de Mallorca, December 1997)

-

Melero Maite and Villegas Marta ; Title: Subcategorización no estrictamente local en PAROLE (Non strictly local subcategotisation in PAROLE); Conference: XXVII Simposio de la Sociedad Española de Lingüística, Palma de Mallorca, December 1997)

-

Pereira, L. A. (1997) "Análise de corpora e dicionários de uso", Actas do XIII Encontro da Associação Portuguesa de Linguística, Lisboa, Outubro de 1997 (no prelo).

-

Pereira, L. A. (1997) "O recurso a corpora linguísticos e o contributo da abonação nos dicionários", Actas do 2.º Encontro Nacional da Associação de Professores de Português (APP), Lisboa, Abril de 1997 (no prelo).

-

Ruimy N., Corazzari O., Gola E., Spanu A., Calzolari N., Zampolli A., "LE-PAROLE Project: The Italian Syntactic Lexicon", EURALEX (EURopean Association for LEXicography - International Congress), Liege, Belgium, August 4-8, 1998.

-

Ruimy N., Corazzari O., Gola E., Spanu A., Calzolari N., Zampolli A., "The European LE-PAROLE Project and the Italian Lexical Instantiations", ALLC/ACH '98 (Association for Literary and Linguistic Computing Association for Computers and the Humanities), Debrecen, Hungary, July 5-10, 1998.

-

Ruimy N., Corazzari O., Gola E., Spanu A., Calzolari N., Zampolli A., "The European LE-PAROLE Project: The Italian Syntactic Lexicon", ELRA LRE (First International Conference on Language Resources and Evaluation), Granada, Spain, 26 May 1998.

-

Wittmann, Luzia. "O Processamento do Portugue^s no Grupo de Linguagem Naturaldo INESC". Seminàrio O Impacto das Novas Tecnologias na Comunicação Lingui'stica, organized by the Translation Service of CE, April/97, Lisbon.

-

Wittmann, Luzia. "The Palavroso System and the Parole Corpora Tagging". Workshop Taggers para o Português, ILTEC, 09/05/97, Lisbon.

-

Zampolli A. "The PAROLE poject in the General Context of the European Language Resources in R ta Marcineviãieniò and Nobert Volz "Language Applications for a Multilingual Europe" Kauna Lithuania- Apr. 17-20 1997. Zampolli A., Calzolari N., "LE-PAROLE: Its history and scope", The ELRA Newsletter, December 1996, Vol.I n.4, pp.5-6. Submitted Contributions The articles listed below have been submitted to conferences, at the time of preparation of this document the reviewers decision about their acceptance was not yet known. -

Bacelar do Nascimento, M. F. et alii, Projecto Português falado: variedadesgeográficas e sociais – DEMO. First International Conference on Language Resources and Evaluation, Granada, Spain, 28 – 30 de Maio:

-

Bacelar do Nascimento, M. F. “Constituição e Exploração de Corpora”.

-

Bacelar do Nascimento, M. F. “LE-PAROLE no contexto dos recursos linguísticos europeus”.

-

Bacelar do Nascimento, M. F. “O Corpus de Referência do Português Contempor‰neo e os Projectos de Investigação do Centro de Linguística da Universidade de Lisboa sobre variedades do português falado e escrito”.

-

Bettencourt Gonçalves, J. “Projecto Português falado: variedades geográficas e sociais”.

-

Col—quio Internacional A investigação do português na çfrica, América, çsia e Europa: balanço crítico e discuss‹o do ponto actual das investigaçães, Berlim, 23 – 27 de Março de 1998:

-

Garcia Marques, M. L. “Combinatorias International Conference on Language Evaluation,Granada, Spain, 28 – 30 de Maio.

-

Marrafa, P. “A Sintaxe do Léxico LE-PAROLE”. First International Conference on Language Resources and Evaluation,Granada, Spain, 28 – 30 de Maio.

-

Santos Pereira, L. A. e A. Mendes “Associaçães lexicais e quest›es de Informação Mœtua”.First International Conference on Language Resources and Evaluation,Granada, Spain, 28 – 30 de Maio.

-

Santos Pereira, L. A. et alii, “Corpus de Referência do Português Contempor‰neo” First International Conference on Language Resources and Evaluation,Granada, Spain, 28 – 30 de Maio:

-

Santos Pereira, L. A. “Dicionário de Combinat—rias do Português”

-

Wim Peters Hamish Cunningham Yorick Wilks Clare McCauley Mark Stevenson Title: A New Model for Language Resource Access and Distribution – (submitted) COLING-98

-

Workshop on Computational Linguistics, Associação Portuguesa de Linguística, Lisboa, 25 – 27 de Maio de 1998

Linguísticas”.First Resources and

The PAROLE consortium has started the preparation of a number of books: -

Delcourt C and Mariani J (ed. by) Book "Corpus Ecrits et Corpus Oraux" published by AUPELF (Association of the French speaking Universities). This book will include papers from PSA (theoretical issues), from LIE and PAR (French corpora), and from ATH and GOT (parallel corpora including a French one)

-

Delcourt C (ed. by) - volume dedicated to PAROLE title: "PAROLE" - will be published by Kluwer both as a book and as a special issue of the review Computers and the Humanities. All the institutional members of the PAROLE Association are preparing their contribution.

-- Zampolli A., Calzolari N., Ogonowski A. (ed. by) The PAROLE Language Resources (To appear in Linguistica Computazionale” special issue)4

4

Planned TABLE OF CONTENTS Forward, Introduction, Composition of Corpus, Encoding of Corpus, Tagging, Model & Tool/Software for Lexicon, The Morphology Layer, The Syntax Layer, The Validation Model, Examples of Extraction of Information from Corpora

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.