Natural Language Processing, lecture notes | Centre for Language ... [PDF]

Mar 13, 2012 - Infant abducted from hospital safe --- lexical ambiguity (safe); British left waffles on Falklands --- le

2 downloads 34 Views 88KB Size

Report

Download PDF

PNG Network

Recommend Stories

natural Language processing

Happiness doesn't result from what we get, but from what we give. Ben Carson

Natural Language Processing

Make yourself a priority once in a while. It's not selfish. It's necessary. Anonymous

Natural Language Processing g

Respond to every call that excites your spirit. Rumi

Natural Language Processing

Nothing in nature is unbeautiful. Alfred, Lord Tennyson

Evaluating Natural Language Processing Systems

Every block of stone has a statue inside it and it is the task of the sculptor to discover it. Mich

[PDF] Natural Language Processing with Python

Live as if you were to die tomorrow. Learn as if you were to live forever. Mahatma Gandhi

Workshop on Natural Language Processing

Suffering is a gift. In it is hidden mercy. Rumi

natural language processing in lisp

The only limits you see are the ones you impose on yourself. Dr. Wayne Dyer

Deep Learning for Natural Language Processing

Don't fear change. The surprise is the only way to new discoveries. Be playful! Gordana Biernat

Deep Learning for Natural Language Processing

Ask yourself: Where am I not being honest with myself and why? Next

Idea Transcript

Login

Centre for Language Technology, Gothenburg

Research Home

CLT toolkit

Publications

Members

Calendar

Links

wiki

SEARCH

Natural Language Processing, lecture notes

Site map

CLT

Natural Language Processing, lecture notes

Home

Lecture notes for the course DIT410/TIN171 Artificial Intelligence

Research CLT toolkit

Peter Ljunglöf 16 March 2012 Links to NLP videos and online demos: http://www.clt.gu.se/wiki/nlp-resources

Publications Members Calendar Links

Natural language A natural language is any language naturally used by humans. (http://en.wikipedia.org/wiki/Language) It can be written-only: extinct (Egyptian, Old English, Linear-B http://en.wikipedia.org/wiki/Linear_B) only used in special circumstances (Latin, Coptic, http://en.wikipedia.org/wiki/Coptic_language) as a compromise of dialects (Nynorsk, http://en.wikipedia.org/wiki/Nynorsk) or spoken-only: most of the existing c:a 7000 languages in the world or gestured: sign languages (http://en.wikipedia.org/wiki/Sign_language) or symbol-based: Chinese (http://en.wikipedia.org/wiki/Chinese_characters) Blissymbolics (http://en.wikipedia.org/wiki/Blissymbols) it can even be constructed: Esperanto, Interlingua, Blissymbolics (http://en.wikipedia.org/wiki/Constructed_language) Klingon, Na'vi, Sindarin (http://en.wikipedia.org/wiki/Artistic_language) There is no good distinction between "language" and "dialect": spoken Norwegian, Swedish and Danish form a continuum of mutually intelligible dialects and sociolects (http://en.wikipedia.org/wiki/Scandinavian_languages#Mutual_intelligibility) "A language is a dialect with an army and navy" (http://en.wikipedia.org/wiki/A_language_is_a_dialect_with_an_army_and_navy)

Natural language processing NLP has several names: natural language processing (NLP) natural language engineering human language technologies language technology computational linguistics etcetera… Natural language processing… "…is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty." http://www.coli.uni-saarland.de/~hansu/what_is_cl.html "…are information technologies that are specialized for dealing with the most complex information medium in our world: human language." http://www.dfki.de/~hansu/LT.pdf "…to get computers to perform useful tasks involving human language, tasks like enabling human-machine communication, improving human-human communication, or simply doing useful processing of text or speech." http://www.cs.colorado.edu/~martin/slp2.html "…is a field of computer science and linguistics concerned with the interactions between computers and human languages." http://en.wikipedia.org/wiki/Natural_language_processing "…is an interdisciplinary field dealing with the statistical and/or rule-based modeling of natural language from a computational perspective." http://en.wikipedia.org/wiki/Computational_linguistics "…is the scientific study of language from a computational perspective." http://www.aclweb.org/archive/misc/what.html "…is the computerized approach to analyzing text that is based on both a set of theories and a set of technologies." http://www.cnlp.org/publications/03nlp.lis.encyclopedia.pdf "…covers a broad range of activities with the eventual goal of enabling people to communicate with machines using natural communication skills." http://www.cslu.ogi.edu/HLTsurvey reading: any of the links above, or http://www.cl.cam.ac.uk/teaching/1011/L100/introling.pdf

Main NLP applications Automatic translation text-to-text translation (web, email) speech-to-speech translation (telephone, phrasebook) assistive technologies: speech-to-subtitles, speech-to-sign-language reading: http://www.cairn.info/revue-francaise-de-linguistique-appliquee-2003-2-page-99.htm

Human-computer dialogue text dialogue systems (SHRDLU, Eliza, chatbots, web helper agents) spoken dialogue systems (call centres, in-car systems, Apple SIRI) multi-modal systems (smartphone, information desks, avatars/talking heads) reading and links: http://en.wikipedia.org/wiki/Dialog_system http://www.ling.gu.se/~sl/dialogue_links.html http://www-2.cs.cmu.edu/~dbohus/SDS http://en.wikipedia.org/wiki/Siri_(software)

Question answering given a human-language question, determine its answer the IBM Watson system won Jeopardy in February 2011 reading: http://en.wikipedia.org/wiki/Question_answering http://en.wikipedia.org/wiki/Watson_(artificial_intelligence_software) http://www.nytimes.com/2010/06/20/magazine/20Computer-t.html

Text mining web search summarization categorization entity/relation recognition sentiment analysis reading: http://en.wikipedia.org/wiki/Text_mining http://en.wikipedia.org/wiki/Text_analytics

Accessibility visually impaired: speech synthesis: screen readers, VoiceXML speech recognition: dictation, dialogue systems automatic Braille terminals hearing impaired: speech recognition and synthesis sign language recognition and synthesis real-time sign language translation of TV programs elderly: can have problems with seeing, hearing, short-term memory, fine motor skills, loneliness possible NLP technologies: speech recognition and synthesis, automatic summarisation, dialogue systems, chatbots communicative disorders: alternative and augmentative communication (AAC) speech and dialogue technologies can help communicating with the society reading: http://en.wikipedia.org/wiki/Augmentative_and_alternative_communication http://slpat.org

Difficult problems in NLP: Ambiguity Ambiguity is one of the most difficult NLP problems. And it is everywhere!

Newspaper headlines Newspaper headlines are extra prone to ambiguities, since they often lack function words. Infant abducted from hospital safe --- lexical ambiguity (safe) British left waffles on Falklands --- lexical amb. (left, waffles) Jails for women in need of a facelift --- structural amb. (in need) Enraged cow injures farmer with axe --- structural (with axe) Stolen painting found by tree --- word sense (by) Miners refuse to work after death --- reference (after death) Jail releases upset judges --- lexical (releases, upset) Drunk gets nine months in violin case --- word sense (case) Teacher strikes idle kids --- lexical (strikes) Squad helps dog bite victim --- lexical (bite) Prostate cancer more common in men --- reference (more common) Smithsonian may cancel bombing of Japan exhibits --- structural (exhibits) Juvenile court to try shooting defendant --- lexical (try) Two sisters reunited after 18 years in checkout counter --- structural (in counter) Two Soviet ships collide, one dies --- reference (one) Taxiförare dödade man med bil --- structural (med bil) Förbud mot droger utan verkan --- structural (utan verkan)

Phonological ambiguity "Eye halve a spelling checker It came with my pea sea It plainly marks four my revue Miss steaks eye kin knot sea."

Lexical ambiguity one word -- several meanings = word senses "by" is a preposition with 8 senses (New Oxford American Dictionary) "case" is a noun with 4 senses different words -- same spelling (or pronunciation) "safe" is a noun and an adjective "left" is a noun, an adjective and past tense of the verb "leave" there is no general consensus of when we have one word with several senses, or different words most lexical ambiguities automatically lead to structural differences: ((jail) releases (upset judges)) vs. ((jail releases) upset (judges)) ((time) flies (like an arrow)) vs. ((fruit flies) like (a banana))

Structural ambiguity Attachment ambiguity adjectives: "Tibetan history teacher"; "old men and women" prepositions: "I once shot an elephant in my pajamas. How he got into my pajamas, I'll never know." (Groucho Marx) "I saw the man with the telescope" / "I saw the man with the dog" Garden path sentences "the horse raced past the barn fell" "the old man the boat" "the complex houses married and single soldiers and their families"

Semantic ambiguity Quantifier scope: "every man loves a woman" / "some woman admires every man" "no news is good news" / "no war is a good war" "too many cooks spoil the soup" / "too many parents spoil their children" "in New York City, a pedestrian is hit by a car every ten minutes." Pronoun scope: "Mary told her mother that she was pregnant." Ellipsis: "Kim noticed two typos before Lee did." --- did Lee notice the same typos? "Eva worked hard and passed the exam. Adam too." --- what did Adam do?

Pragmatic ambiguity Speech-act ambiguity: "Do you know the time?" --- "yes" "Can you close the window?" --- "sure I can, I'm already five years old" Contextual ambiguity: "you have a green light" if you are in a car, then perhaps the traffic light has changed if you are talking to you boss at work, then perhaps you can go ahead with your project or, there could be a green lamp somewhere in you room

Difficult problems in NLP: Sparse data The second very big problem is lack of data.

Hapax legomena Hapaxes are words that occur only once within a corpus. about 44% of the words (types, not tokens) in the novel Moby Dick are hapaxes about 55% of the word types in the Swedish Parole corpus (28 Mwords) Reading: http://en.wikipedia.org/wiki/Hapax_legomenon

Sparse data and ambiguity Hapaxes are just one aspect of the general problem of sparse data: "bank" has 5 noun meanings and 4 verb meanings, according to New Oxford Dictionary how many occurrences of each sense do we need to get reliable statistics? Furthermore, the senses are very context-dependent: e.g., in newspaper text, river banks are much less common than financial banks

Hapax n-grams about 75% of the bigrams in Swedish Parole occur only once

Hapax phrase structures about 50% of the syntactical constructions in the Penn Treebank occur only once

Solution To solve the sparse data problem we need statistical smoothing techniques: Laplace smooting, Witten-Bell, Good-Turing, Kneser-Ney, etc. but this is not enough – we also need more data

Main levels of abstraction Roughly the NLP tasks can be categorised into the following abstraction levels: Phonetics/phonology: Speech sounds, acoustics Morphology: Parts of words; suffix, infix, affix Lexical: Words, parts-of-speech, inflection Syntax: Grammatical structure, parsing Semantics: In-sentence meaning, 1st order logic Discourse: Anaphora resolution, text structure Pragmatics: Context-dependence, presupposition The lower in the list, the more "AI"-like the problems are. In general.

Main NLP approaches Symbolic / rule-based approaches uses hand-crafted linguistic knowledge: formal linguistics, logics, formal systems grammars, rules, theorem provers often bad coverage on the other hand, deep analysis works well for limited domains time-table information, weather forecasts, MP3 player, etc.

Data-driven / statistical approaches uses lots of linguistic data (lexica, corpora): statistics, databases, evaluation metrics statistical models, machine learning better coverage on the other hand, shallow analysis better suited for unrestricted domains information retrieval, web searches, chatbots, etc.

Hybrid approaches combining the best of both worlds, or at least trying this is a very hot current research trend

Main NLP tasks: Audiovisual Speech synthesis / Text-to-speech (TTS) There are two main techniques; formant synthesis: based on mathematical models often sounds "artificial" easy to modify, e.g., to make it sound like a female/male/child cheap: needs no recorded data and concatenative synthesis: concatenation of segments of recorded speech sounds much more "natural" than formant TTS difficult to modify, since it's based on a real human expensive: requires lots of manually annotated recordings variants: diphone (smaller) and unit-selection (better) There are still lots of interesting problems: multilingual utterances: "multilingual is called flerspråklig in Swedish" "the president of Georgia is Mikheil Saakashvili" ambiguity (homographs): record: /'rekrd/ or /ri'kôrd/ entrance: /'entrns/ or /en'trans/ learned: /'lrnid/ or /lrnd/ produce: /pr 'd(y)o os/ or /'präd(y)o os/ prosody: intonation, emotion, dialect, gender, age Reading: http://www.acoustics.hut.fi/publications/files/theses/lemmetty_mst http://en.wikipedia.org/wiki/Speech_synthesis

Automatic speech recognition (ASR) All successful ASR systems are statistical, based on Hidden Markov Models (HMM): uses an acoustic model and a language model requires huge amounts of annotated recordings ASR is a very difficult problem: coarticulation: the sounds representing successive letters blend into each other there are often no pauses between successive words word error rates on some tasks are less than 1%; on others they can be as high as 50% interesting research problems: speaker adaptation multilingual utterances dialects recognising prosodic information: intonation, emotion, dialect, gender, age Reading: http://research.microsoft.com/pubs/80528/SPM-MINDS-I.pdf

Other related areas Recognition/generation of gestures/facial expressions: useful for sign languages, or augmenting speech synthesis by animated mouth, facial emotions or body posture Optical character recognition (OCR): character error rate ca 1% for type-written text in Latin script much worse for hand-written texts, non-Latin scripts, or historical texts

Main NLP tasks: Segmentation Token segmentation / Word tokenisation For English and Swedish, words are often separated by whitespace; but there are still problems: multi-word units: "inter alia", "Kalle Anka" abbreviations: "e. g." (English), "t. ex." or "t ex" (Swedish) compounds: "flaggstångsknoppspolerare", "Kalle Ankatidning" clitics: "doesn't", "you're" split words by line-breaking: "news- / paper" (dash should be removed) "co- / education" (dash should be kept) split compounds: "bil- och båtägare" (Swedish) special tokens: currency, phone numbers, URLs, email addresses, etc. Some languages (e.g., Chinese, Japanese, Thai) do not mark word boundaries in text, which makes the tokenisation problem much harder. Reading: http://morphadorner.northwestern.edu/morphadorner/wordtokenizer/tokenizationproblems

Morphological segmentation Separate words into individual morphemes and identify the class of the morphemes this is not a serious problem in English slightly more problematic in Swedish: especially compound words are difficult very problematic in agglutinative languages, such as Turkish: uygar-laş-tır-ama-dık-lar-ımız-dan-mış-sınız-casına "as if you are among those whom we were not able to civilize" …or Finnish: lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas "technical warrant officer trainee specialized in aircraft jet engines" Reading: http://en.wikipedia.org/wiki/Morphology_(linguistics)

Sentence breaking / splitting This is not always trivial, especially not in unrestricted text sentences can have recursive structure: "I say 'Hi there!' to her." abbreviations (mr., mrs., e.g.) can interfer, and they can share punctuation with the end-of-sentence sentences can continue between line breaks, e.g., in bullet lists

Main NLP tasks: Tagging and chunking Part-of-speech tagging Assigning one (or more) part-of-speech tags to each word in a text. many words are ambiguous: "book" can be a noun or verb "set" can be a noun, verb or adjective "out" can be at least five different parts of speech unknown words: we need heuristics to decide their part-of-speech Approaches: rule-based tagging: hand-crafted rewrite rules transformation-based tagging: automatically learned rewrite rules statistical tagging: HMM-based n-gram tagging other machine learning approaches: memory-based learning, descision trees, support vector machines (SVM), maximum entropy markov models (MEMM), conditional random fields (CRF), perceptron, etc. The part-of-speech tagset: depends on the language, and your theory of grammar English: between 20 and 500 different tags other languages can have many many more POS tags Current state-of-the-art for English: baseline: ~90% (if we always assign each word its most probable tag) best: ~97% (with a tagset of ~40 tags) Reading: http://wiki.apertium.org/wiki/Part-of-speech_tagging

Text chunking Dividing a text in syntactically correlated parts of words most common is noun phrase (NP) chunking the chunks should be non-recursive; i.e., not full NP's, but instead NP's without prepositional phrases Approaches: machine learning methods from a manually annotated corpus hand-crafted grammars or rewrite systems Reading: http://www.cnts.ua.ac.be/conll2000/chunking

Named entity recognition (NER) Determine which items in a text map to proper names (e.g., people or places), and what the type of each such name is (e.g., person, location, organization) Names are often capitalised in English (but not always), and most capitalised words are names (but not always) in German, all nouns are capitalised in French and Spanish, names serving as adjectives are not capitalised many non-Latin scripts don't have capitalisation at all Reading: http://en.wikipedia.org/wiki/Named_entity_recognition

Main NLP tasks: Syntax Grammars Specify the grammatical structure of the sentences in a language. Sometimes we use context-free grammars (also known as BNF grammars), but more often higher-order grammar formalisms such as: GF: Grammatical Framework HPSG: Head-Driven Phrase-Structure Grammar TAG: Tree-Adjoining Grammar LFG: Lexical-Functional Grammar DG: Dependency Grammar Reading: http://www.grammaticalframework.org/ http://www.cl.cam.ac.uk/teaching/1011/L100/syn-sem-disc.pdf http://www.sfs.uni-tuebingen.de/~fr/current/textbook.html http://en.wikipedia.org/wiki/Head-driven_phrase_structure_grammar http://en.wikipedia.org/wiki/Tree-adjoining_grammar http://en.wikipedia.org/wiki/Lexical_functional_grammar

Parsing Parsing = determining the parse tree of a given sentence. NLP parsing is very different from parsing of programming languages: most sentences are ambiguous, even massively ambiguous for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human) programming languages, on the other hand, are never ambiguous totally different types of parsing algorithms we need an algorithm that returns the most probable tree Approaches: hand-written grammars: sometimes context-free, but more often higher-order grammar formalisms like the ones above grammars trained from annotated corpora (treebanks): automatically or semi-automatically sometimes the grammar is skipped – the parser is trained directly from the corpus hybrid methods: hand-written backbone in some grammar formalism probabilities and parsing heuristics trained from a corpus Reading: http://www.cse.chalmers.se/~peb/pubs/LjunglofWiren-2009a.pdf http://stp.ling.uu.se/~nivre/docs/05133.pdf

Text generation The opposite of parsing; generating utterances from syntactic (and semantic) structure. sometimes we use higher-order grammar formalisms, but there are lots of other approaches we need heuristics for deciding which surface structure is the "best" classical AI techniques can be used, such as planning the "best" generated sentence depends on the application, the audience and the context Reading: http://www.fb10.uni-bremen.de/anglistik/langpro/webspace/jb/info-pages/nlg/ATG01

Main NLP tasks: Meaning: semantics, pragmatics Computational semantics Specify the formal meaning of sentences/utterances, and of longer texts the classic example (by Richard Montague, 1970) is first-order logic with lambda calculus type theory can be used as an alternative discourse representation theory (DRT) can handle meaning across sentence boundaries type theory with records is another alternative latest trend is minimal recursion semantics (MRS) MRS uses underspecification to reduce the number of ambiguities Reading: http://www.coli.uni-saarland.de/projects/milca/esslli http://www.cs.bham.ac.uk/~hxt/cw04/barker.pdf http://plato.stanford.edu/entries/discourse-representation-theory http://lingo.stanford.edu/sag/papers/copestake.pdf

Word sense disambiguation To select the word meaning which makes the most sense in context typically given a list of words and associated word senses: e.g. from a dictionary of from an online resource such as WordNet. Reading: http://en.wikipedia.org/wiki/Word-sense_disambiguation

Coreference resolution To determine which words refer to the same objects in a text anaphora resolution is the most common example; matching pronouns with the nouns or names that they refer to Reading: http://en.wikipedia.org/wiki/Coreference

Relationship extraction To identify the relationships among named entities in a text example: "Albert's niece Ann got engaged to John." inferred relations: DaughterOfSibling(Albert,Ann); Engaged(Ann, John) Reading: http://en.wikipedia.org/wiki/Relationship_extraction

Speech act classification To classify the speech act of utterances in a discourse possible speech acts: yes-no question, content question, statement, assertion, etc. Reading: http://en.wikipedia.org/wiki/Speech_act

Main NLP tasks: Text-level analysis Automatic summarisation Produce a readable summary of a text, or of several texts; such as newspaper articles or patient journals Reading: http://en.wikipedia.org/wiki/Automatic_summarization

Document classification / text categorisation To assign documents to one or more categories, based on their content Reading: http://en.wikipedia.org/wiki/Document_classification

Information retrieval (IR) Storing, searching and retrieving information from texts or databases Reading: http://en.wikipedia.org/wiki/Information_retrieval http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Information extraction (IE) Extracting semantic information from text; this covers tasks such as named entity recognition, coreference resolution, relationship extraction, etc. Reading: http://en.wikipedia.org/wiki/Information_extraction

To the top Page updated: 2012-03-13 11:35 Send as email Print page Show as pdf

Natural Language Processing, lecture notes | Centre for Language ... [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch