Identifying Linguistic Structure in a Quantitative Analysis of Bulgarian [PDF]

Identifying Linguistic Structure in a. Quantitative Analysis of Bulgarian Dialect. Pronunciation. Jelena Prokic j.prokic

0 downloads 9 Views 2MB Size

Recommend Stories


A quantitative analysis of rhoticity in Dorset
Where there is ruin, there is hope for a treasure. Rumi

A Quantitative Analysis
Almost everything will work again if you unplug it for a few minutes, including you. Anne Lamott

PDF Quantitative Analysis for Management
At the end of your life, you will never regret not having passed one more test, not winning one more

[PDF] Download Quantitative Investment Analysis
I cannot do all the good that the world needs, but the world needs all the good that I can do. Jana

Quantitative Analysis of Lavender
In every community, there is work to be done. In every nation, there are wounds to heal. In every heart,

A Quantitative Linguistic Analysis of National Institutes of Health R01 Application Critiques from
Forget safety. Live where you fear to live. Destroy your reputation. Be notorious. Rumi

Toward a Quantitative Analysis of Online Pornography
Seek knowledge from cradle to the grave. Prophet Muhammad (Peace be upon him)

a quantitative analysis of inquiries submitted to
Ask yourself: How could I be a better friend to people? Next

Bulgarian
No matter how you feel: Get Up, Dress Up, Show Up, and Never Give Up! Anonymous

Quantitative Analysis
The only limits you see are the ones you impose on yourself. Dr. Wayne Dyer

Idea Transcript


Identifying Linguistic Structure in a Quantitative Analysis of Bulgarian Dialect Pronunciation

Jelena Prokic [email protected]

03.11.2006. Sofia

Outline 

The goal of the thesis  Aggregate analysis  Identification of linguistic structure in the aggregate analysis



Previous work



Aggregate analysis  New data set  L04



Regular sound correspondences   

Extraction Quantification Results 2

The Goal of the Thesis 

To do an aggregate analysis of the Bulgarian dialects using  new data set  L04



To identify the underlying linguistic structure in the aggregate analysis  regular sound correspondences were extracted from the aligned pairs of words  for the 10 most frequent sound correspondences a separate analysis of each site was made

3

Previous Work 

Aggregate analysis of dialect divisions  



Identification of linguistic structure in the aggregate analysis  



successfully applied to various languages on Bulgarian applied by Osenova et all. (2006)

aggregating over a subset of data (Nerbonne, 2005) factor analysis (Nerbonne, 2006)

Extraction of sound correspondences 

Kondrak (Kondrak, 2002) applied it in the task of cognate identification

4

Osenova et al. 2006



Aggregate analysis of dialect divisions in Bulgaria   

data set: 36 words collected from 490 sites suprasegmentals and diacritics were removed L04 toolkit



Cluster analysis



Multidimensional scaling

5

Osenova et al. 2006 Cont.

Map of Bulgarian dialect divisions taken from Stoykov (2002)

6

Osenova et al. 2006 Cont.

Ruse

Pleven Shumen

Varna

Lovech Teteven Sofia Burgas

Plovdiv Blagoevgrad Malko Tyrnovo

Razlog

Smolyan

Classification map from Osenova et al. (2006)

7

Osenova et al. 2006 Cont.

Ruse

Pleven Shumen

Varna

Lovech Teteven Sofia Burgas

Plovdiv Blagoevgrad Malko Tyrnovo

Razlog

Smolyan

Continuum map from Osenova et al. (2006)

8

Osenova et al. 2006 Cont. 

Both maps give a reliable picture of the dialect divisions    



the most important division is between East and West Rodopi area is the most incoherent area around Varna and Schumen is distinct from the neighbouring areas area around Teteven is also distinct

Dialectometrical methods were successfully applied to a Slavic language for the first time

9

Extraction of Linguistic Structure 

Nerbonne (2005)  





aggregates over a subset of the data, namely vowels the differences between the sites are calculated using both complete phonetic transcriptions and also using only vowels results: vowels are probably responsible for a great deal of aggregate differences (r = 0.936)

Nerbonne (2006)   

applies factor analysis to the results of the dialectometrical analysis only vowels are investigated results: 3 factors are most important, explaining 35% of the total amount of variance

10

Sound Correspondences 

Kondrak (2002) extracts regular sound correspondences and uses them to identify cognates in a bilingual word list



Melamed’s parameter estimation models were adopted and used to determine sound correspondences



The more regular sound correspondences two words contain the more likely it is that they are cognates and not borrowings



This method has outperformed other methods for cognate identification

11

New Data Set 

Data from the project Buldialect – Measuring linguistic unity and diversity in Europe



117 words collected from 84 sites



Words include nouns, verbs, pronouns, and prepositions in different word forms



All phonetic transcriptions were in X-SAMPA format

12

Distribution of 84 Sites

Distribution of 84 sites from the new data set

13

Part I: Aggregate Analysis 

L04 toolkit  alignment of word transcriptions  Levensthein algorithm  cluster analysis  multidimensional scaling



Preprocessing of the data 

suprasegmentals and diacritics were removed 



s’ s\ “s *s *”s “s\ all represented as s

palatalized/non-palatalized opposition preserved

14

Aggregate Analysis Cont. 

Alignments were based on the following principles:    

vowel can match only with the vowel consonant can match only with the consonant [i] and [u] can match both with vowels and sonorants [j] can match both with vowels and consonants

Example 1:

[ 4] zelenigrad [ 24] merichleri b e l i b_j a l i -------------------------1 1

15

Aggregate Analysis Cont. 

Insertions, deletions, and substitutions have the same cost – 1



The distance between two strings was normalized by the length of the longest alignment that gives the minimal cost



The distance between two aligned strings in Example 1 would be 0.5



Distances between the aligned pairs of transcriptions are used to calculate the distance between each pair of sites



The results were analyzed using cluster and multidimensional scaling (MDS) analyses

16

Dendograms 1

aldomirovci, slivnica golemo malovo, sliven dolna melna, tran zelenigrad, tran diva slatina, mont kopilovci, mont stakevci, blgr. varbovo, blgr babjak, razl bansko, razl dobarsko, razl belica, razl bogdanov dol, pern dobroslavci, sof govedarci, sam shiroki dol, sam bov, svog dolni bogrov, sof zanozhene, berk gradec, vd vinarovo, vid ruzhinci, belgr zamfirovo, berk varvara, paz gega, petr senokos, blgr kreta, vrach smochevo, dupn beglezh, luk chernogorovo, paz devenci, luk galata, tetev dolna beshovica, vrach gabare, bslat trastenik, plev petarnica, plev goljama zheljazna, tetev

5

4 3

asparuhovo, prov krivnja, razgr osenec, razgr pevec, targ vardun, targ dolna studena, bel starmen, bel varbica, presl ganchovec, drjan vranilovci, gabr zdravkovec, gabr borisovo, elh straldzha, jamb dragodanovo, sliv ljubenova mahala, nzag tihomirovo, stzag kalipetrovo, sil vabel, nik enina, kaz shipka, kaz garvan, sil golica, varn kozichino, pom shtipsko, prov

2

brashljan, mtarn stoilovo, mt zabernovo, mt drabishna, ivgr huhla, ivgr sredec, zlgr hvojna, asgr pavelsko, asgr dinevo, hask stambolovo, hask nova nadezhda, hs ezerovo merichleri, chirp izvorovo, harm momkovo, svgr valche pole, svgr svirkovo, harm opan, stzag belene, svisht tranchovica, nik sekirovo, plov

0.0

0.001

0.002

0.003

0.004

Old data set (Osenova et al., 2006)

momchilovci, smol ustovo, sm 0.0

0.002

0.004

0.006

0.008

New data set

17

Cluster Maps

Ruse

Pleven Shumen

Varna

Lovech Teteven Sofia Burgas

Plovdiv Blagoevgrad Malko Tyrnovo

Razlog

Smolyan

Old data set

New data set

18

MDS Maps

Ruse

Pleven Shumen

Varna

Lovech Teteven Sofia Burgas

Plovdiv Blagoevgrad Malko Tyrnovo

Razlog

Smolyan

Old data set

New data set

19

Results



Clear division between East and West (‘yat’ realization border)



Rodopi area is the most incoherent



Both cluster and MDS map conforms with the maps presented in Osenova et al. (2006) and the map presented in Stoykov (2002)



New data set gave a faithful picture of the dialect divisons in Bulgaria

20

Part II: Regular Sound Correspondences 

Problem: How to extract linguistic structure from aggregate comparison?



Suprasegmentals and diacritcs were removed



Word pronunciation transcriptions were aligned using L04



For each pair of sites one best alignment for every word is taken into account (1.18 alignments per word pronunciation pair)

Example 2: f

n

u t r e v dz t r e ------------------------------1 1 1

f

n u t r e v dz t r -----------------------------1 1 1

e

21

Regular Sound Correspondences Cont. 

Phonetic distance between 2 segments is not taken into account, they are either identical or not



Segments that do not match were extracted from all aligned pairs and sorted according to their frequency

22

Regular Sound Correspondences Cont. Example 3: Babjak j a Golica ǡ s ------------------1 1 1

phon1

j

phon2 No.

2

Beglezh a s S. Dol j a -----------------------------1 1

a ǡ

s

1

2

Table 1: Sound correspondences extracted from the alignments in Example 3

23

Regular Correspondences Cont. 

For each pair of sites and every word correspondences were summed



Results: e

o

i

u

52246

40981

ǡ

ǡ

ə

e

ǡ

dz

e

dz

dz

dz

ə

39414

33391

33184

32753

32177

28976

v

j

22462

21475

Table 2: 10 most frequent correspondences from the whole data set



Eight out of ten most frequent correspondences involve substitution or insertion/deletion of vowels

24

Correspondence Index 

Correspondence index is obtained by comparing every site to all other sites with respect to the first ten correspondences



Goal:  

to see if the site belongs to the group where 1 or the other sound is present to see if there is a geographical cohesion in the sites that use 1 or the other sound in the correspondence

Method:

 



only one best alignment for each word pronunciation pair was taken into account all sound correspondences were extracted, both matching and nonmatching r

ǡ

e

o

e

s

k

d

l

v

r

ǡ

i

u

e

s

k

d

l

v

35

35

29

27

27

26

25

24

24

24

Table 3: 10 most frequent correspondences for the pair Aldomirovci-Borisovo

25

Correspondence Index Cont. 

For each pair of the most frequent correspondences (Table 2) a correspondence index is calculated for each site using the following formula:

1 n Si →Sj , i =1,...,n ∑ n −1 j=1,j≠i n – number of sites

Si → S

j

- comparison of each 2 sites with respect to certain sound correspondence

26

Correspondence Index Cont. Si →Sj

is calculated applying the following formula:

| s,s'| | s,s'|+| s,s| |s,s'|

- the number of times sound s seen in the word pronunciation collected at site1, was aligned with s’ in the word pronunciation collected at site2

| s,s|

- the number of times sound s seen in the word pronunciation collected at site1 stayed unchanged

27

Correspondence Index Cont. Correspondence index for the pair [e]-[i] for Aldomirovci and Borisovo: s

e

i

e

s’

i

e

e

29

0

27

No.

Table 4: Number of times [e] correspondes to [e] and [i] for the site pair Aldomirovci-Borisovo

| e,i | 29 = = 0.5178 | e,i |+| e,e| 29+27

Index for site1 (Aldomirovci)

| e,i | 0 = =0.0 | e,i |+| e,e| 0+27

Index for site2 (Borisovo)

28

Correspondence Index Cont. 

Every site was compared to all other sites resulting in 83 indexes per site



The general correspondence index for each site represents the mean of all 83 indexes  

Aldomirovci 0.2328 Borisovo 0.1538



Sites with the higher values of the general index represent the sites where sound [e] tends to be present



Sites with the lower values of the general index represent the sites where sound [i] tends to be present 29

Correspondence Index Cont. 

General correspondence index was calculated for every site with the respect to the 10 most frequent correspondences found in the data set



General indexes were analyzed using composite clustering and MDS-cophenetic method resulting in 2 types of maps:  

composite cluster maps MDS-cophenetic maps

30

[e]-[i] correspondence

Composite cluster map

MDS-cophenetic map

31

[o]-[u] correspondence

Composite cluster map

MDS-cophenetic map

32

[dz]-[ø] correspondence

Composite cluster map

MDS-cophenetic map

33

[ǡ]-[e] correspondence

Composite cluster map

MDS-cophenetic map

34

[ǡ]-[dz] correspondence

Composite cluster map

MDS-cophenetic map

35

[ə]-[dz] correspondence

Composite cluster map

MDS-cophenetic map

36

[e]-[dz] correspondence

Composite cluster map

MDS-cophenetic map

37

[ǡ]-[ə] correspondence

Composite cluster map

MDS-cophenetic map

38

[v]-[ø] correspondence

Composite cluster map

MDS-cophenetic map

39

[j]-[ø] correspondence

Composite cluster map

MDS-cophenetic map

40

Results 

Maps show that there is a geographical cohesion in the distribution of sites



Maps show similarity with the traditional maps



West-East division is based on the following correspondences: 



[e]-[i]

[o]-[u]

[ǡ]-[e]

[ǡ]-[dz]

[e]-[dz]

[ǡ]-[ə]

[v]-[ø]

Area around Kozichino and Golica is characterized by the presence of [e], [ǡ], and [v] sounds

41

Drawbacks of the Method



Analyzes only one sound alternation at a time



In the analysis of the sound alternations no context is taken into account

42

Future Work 

More sites should be included



Instead of a simple phone representation of segments, feature representation of segments should be used



Stress should be included



MDS-cophenetic maps should include scale

43

References 

[Kondrak 2002] G. Kondrak. Algorithms for Language Reconstruction. PhD Thesis, University of Toronto.



[Nerbonne 2005] John Nerbonne. Various Variation Aggregates in the LAMSAS South. Accepted to appear in (10/2005) Catherine Davis and Michael Picone (eds.) Language Variety in the South III. Tuscaloosa: University of Alabama Press.



[Nerbonne 2006] John Nerbonne. Identifying Linguistic Structure in Aggregate Comparison. Accepted (5/2006) to appear in Literary and Linguistic Computing 21(4), 2006. (J.Nerbonne & W.Kretzschmar, Jr. (eds.) Progress in Dialectometry: Toward Explanation)



[Osenova et al. 2006] Petya Osenova, Wilbert Heeringa and John Nerbonne A Quantitative Analysis of Bulgarian Dialect Pronunciation. To appear.



[Stoykov 2002] S. Stoykov. Bulgarska dialektologiya. Sofia, 4th ed.

44

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.