Dynamic Hashing [PDF]

C-C Tsai. P.7. Search vs. Hashing. ▫ Search tree methods: key comparisons ..... (H(x)+i2)%b. ▫ Quadratic probing use

3 downloads 12 Views 511KB Size

Recommend Stories


hashing i
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Bayesian Supervised Hashing
If you are irritated by every rub, how will your mirror be polished? Rumi

Multilinear Hyperplane Hashing
The only limits you see are the ones you impose on yourself. Dr. Wayne Dyer

PdF Dynamic Media Writing
Learning never exhausts the mind. Leonardo da Vinci

Dynamic PDF Documents
Suffering is a gift. In it is hidden mercy. Rumi

[PDF] Dynamic Systems
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

[PDF] Dynamic Hedging
If you want to go quickly, go alone. If you want to go far, go together. African proverb

[PDF] Dynamic Systems
Life isn't about getting and having, it's about giving and being. Kevin Kruse

Discrete Graph Hashing
Life isn't about getting and having, it's about giving and being. Kevin Kruse

Locality sensitive hashing
Sorrow prepares you for joy. It violently sweeps everything out of your house, so that new joy can find

Idea Transcript


Chapter 8 Hashing Concept of Hashing „ Static Hashing „ Dynamic Hashing „

C-C Tsai

P.1

Concept of Hashing „

In CS, a hash table, or a hash map, is a data structure that associates keys (names) with values (attributes). „ „ „ „

Look-Up Table Dictionary Cache Extended Array

1

Origins of the Term „

The term "hash" comes by way of analogy with its standard meaning in the physical world, to "chop and mix.” D. Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953; the term hash came into use some ten years later.

C-C Tsai

Example

A small phone book as a hash table. (Figure is from Wikipedia)

2

Dictionaries „

Collection of pairs. (key, value) Each pair has a unique key.

„ „

„

Operations. Get (theKey) Delete (theKey) Insert (theKey, theValue)

„ „ „

Just An Idea „

Hash table : „ „

„

Collection of pairs, Lookup function (Hash function)

Hash tables are often used to implement associative arrays, „ „

Worst-case time for Get, Insert, and Delete is O(size). Expected time is O(1).

3

Search vs. Hashing Search tree methods: key comparisons

„

„

Time complexity: O(size) or O(log n)

Hashing methods: hash functions

„

„

Expected time: O(1)

Types

„

„ „

Static hashing Dynamic hashing

C-C Tsai

P.7

Static Hashing „

Key-value pairs are stored in a fixed size table called a hash table. „ „ „ „

A hash table is partitioned into many buckets. Each bucket has many slots. Each slot holds one record. A hash function f(x) transforms the identifier (key) into an address in the hash table s slots 0 1 s-1 . . . 0 b buckets

C-C Tsai

1

b-1

. . .

. . .

. . . . . .

P.8

4

Example of Hash table 2 slots

0 1 2 26 3 buckets 4 5 6 … 25

Slot 0 acos

Slot 1 atan

char define exp float

ceil

floor

C-C Tsai

P.9

Data Structure for Hash Table #define MAX_CHAR 10 #define TABLE_SIZE 13 typedef struct { char key[MAX_CHAR]; /* other fields */ } element; element hash_table[TABLE_SIZE];

C-C Tsai

5

Other Extensions

Hash List and Hash Tree (Figure is from Wikipedia)

Formal Definition „

Hash Function „

In addition, one-to-one / onto

6

The Scheme

Figure is from Data Structures and Program Design In C++, 1999 Prentice-Hall, Inc.

Ideal Hashing „

Uses an array table[0:b-1]. „ „

„

„

Each position of this array is a bucket. A bucket can normally hold only one dictionary pair.

Uses a hash function f that converts each key k into an index in the range [0, b-1]. Every dictionary pair (key, element) is stored in its home bucket table[f[key]].

7

Example Pairs are: (22,a), (33,c), (3,d), (73,e), (85,f). „ Hash table is table[0:7], b = 8. „ Hash function is key (mod 11). Example: 22 mod 11 = 2, 33 mod 11 = 3, 3 mod 11 = 0, … „

What Can Go Wrong?

„ „

Where does (26,g) go? Keys that have the same home bucket are synonyms. „

„

22 and 26 are synonyms with respect to the hash function that is in use. 22 mod 11 = 2 and 26 mod 11 = 2

The bucket for (26,g) is already occupied.

8

Some Issues „

Choice of hash function. „ „

„

„

Really tricky! To avoid collision (two different pairs are in the same bucket.) Size (number of buckets) of hash table.

Overflow handling method. „

Overflow: there is no space in the bucket for the new pair.

Example

synonyms: char, ceil, clock

overflow

0 1 2 3 4 5 6 … 25

Slot 0 acos

Slot 1 atan

synonyms

char define exp float

ceil

synonyms

floor

9

Choice of Hash Function „

Requirements „ „

„

„

easy to compute minimal number of collisions

If a hashing function groups key values together, this is called clustering of the keys. A good hashing function distributes the key values uniformly throughout the range.

Some hash functions „

Division „

„

Middle of square „

„

H(x):= return x % k H(x):= return middle digits of x2

Multiplicative: „

H(x):= return the first few digits of the fractional part of x*k, where k is a fraction. (advocated by D. Knuth in TAOCP vol. III)

„

Digit analysis: „

If all the keys have been known in advance, then we could delete the digits of keys having the most skewed distributions, and use the rest digits as hash address.

10

Some hash functions (Cont’d) „

Folding: Partition the identifier x into several parts, and add the parts together to obtain the hash address Example: if x=12320324111220 and partition x into p1=123, p2=203, p3=241, p4=112, p5=20; then return the address Σ pi = 123+203+241+112+20 = 699 „ Shift folding vs. folding at the boundaries Σ pi = 123+302+241+211+20 = 897 (reverse p2 and p4) „

Hashing By Division „ „

„

Domain is all integers. For a hash table of size b, the number of integers that get hashed into bucket i is approximately 232/b. The division method results in a uniform hash function that maps approximately the same number of keys into each bucket.

11

Hashing By Division II „

In practice, keys tend to be correlated. If divisor is an even number, odd integers hash into odd home buckets and even integers into even home buckets. Example: 20%14 = 6, 30%14 = 2, 8%14 = 8 15%14 = 1, 3%14 = 3, 23%14 = 9 „ If divisor is an odd number, odd (even) integers may hash into any home. Example: 20%15 = 5, 30%15 = 0, 8%15 = 8 15%15 = 0, 3%15 = 3, 23%15 = 8 „

Hashing By Division III „

„

„

Similar biased distribution of home buckets is seen in practice, when the divisor is a multiple of prime numbers such as 3, 5, 7, … The effect of each prime divisor p of b decreases as p gets larger. Ideally, choose large prime number b. Alternatively, choose b so that it has no prime factors smaller than 20.

12

Hash Algorithm via Division void init_table(element ht[]) { int i; for (i=0; i79 h(14,1)=(1+1*4) mod 13 = 5->98 h(14,2)=(1+2*4) mod 13 = 9->14

C-C Tsai

P.40

20

Random Probing „

Random probing works incorporating with random numbers. „ „ „

H(x):= (H’(x) + S[i]) % b S[i] is a table with size b-1 S[i] is a random permuation of integers [1,b-1].

C-C Tsai

Rehashing „

„

Rehashing: Try H1, H2, …, Hm in sequence if collision occurs. Here Hi is a hash function. Double hashing is one of the best methods for dealing with collisions. „

„

If the slot is full, then a second hash function is calculated and combined with the first hash function. H(k, i) = (H1(k) + i H2(k) ) % m

21

Summary: Hash Table Design „

„

Performance requirements are given, determine maximum permissible loading density. Hash functions must usually be custom-designed for the kind of keys used for accessing the hash table. We want a successful search to make no more than 10 comparisons (expected). „ „

„

Sn ~ ½(1 + 1/(1 – α)) α ≤ 18/19

We want an unsuccessful search to make no more than 13 comparisons (expected). „ „

Un ~ ½(1 + 1/(1 – α)2) α ≤ 4/5

So α = 0. Expected chain length is α. Sn ~ 1 + α/2. Un ~ α

Refer to D. Knuth’s TAOCP vol. III for the proof.

C-C Tsai

24

Comparison „ „

„

Average number of bucket accesses per identifier retrieved. If open addressing is used, then each table slot holds at most one element, therefore, the loading factor α can never be greater than 1. If external chaining is used, then each table slot can hold many elements, therefore, the loading factor may be greater than 1.

Conclusion „

The main tradeoffs between these methods are that linear probing has the best cache performance but is most sensitive to clustering, while double hashing has poorer cache performance but exhibits virtually no clustering; quadratic probing falls in between the previous two methods.

C-C Tsai

25

Dynamic Hashing „ In this hashing scheme the set of keys can be varied, and the address space is allocated dynamically – – –

File F: a collection of records Record R: a key + data, stored in pages (buckets) space utilization

NumberOfRecord NumberOfPages * PageCapacity

Trie „ Trie: a binary tree in which an identifier is located by its bit sequence „ Key lookup is faster. Looking up a key of length m takes worst case O(m) time.

26

Dynamic Hashing Using Directories „ Some identifiers requiring 3 bits per character. Example: M (# of pages)=4, P (page capacity)=2 Allocation: lower order two bits

Identifiers Binary representaiton a0 a1 b0 b1 c0 c1 c2 c3

100 000 100 001 101 000 101 001 110 000 110 001 110 010 110 011

h(a0,2)=00 h(a1,2)=01 h(b0,2)=00 ...

A trie to hold identifiers „ Read it in reverse order. „

c5: 110 101 c1: 110 001

27

Dynamic Hashing Using Directories II „

We need to consider some issues! „ „

„

Skewed Tree, Access time increased.

Fagin et. al. proposed extendible hashing to solve above problems. „

Ronald Fagin, Jürg Nievergelt, Nicholas Pippenger, and H. Raymond Strong, Extendible Hashing - A Fast Access Method for Dynamic Files, ACM Transactions on Database Systems, 4(3):315-344, 1979.

Dynamic Hashing Using Directories III „ „ „

A directories is a table of pointer of pages. The directory has k bits to index 2k entries. We could use a hash function to get the address of entry of directory, and find the page contents at the page.

28

The directory of the three tries

a0 a1 b0 b1 c0 c1 c2 c3

100000 100001 101000 101001 110000 110001 110010 110011

C5

110101

h(a0,2)=00 h(a1,2)=01 ... h(a0,3)=000 h(a1,3)=001 h(c5,3)=101 ...

h(c1,4)=0001 h(b1,4)=1001 h(c5,3)=0101 ...

Dynamic Hashing Using Directories IV „

„

„

It is obvious that the directories will grow very large if the hash function is clustering. Therefore, we need to adopt the uniform hash function to translate the bits sequence of keys to the random bits sequence. Moreover, we need a family of uniform hash functions, since the directory will grow.

29

Dynamic Hashing Using Directories IV „

„

„

a family of uniform hash functions:

If the page overflows, then we use hashi to rehash the original page into two pages, and we coalesce two pages into one in reverse case. Thus we hope the family holds some properties like hierarchy.

Analysis 1. Only two disk accesses. 2. Space utilization „

„

~ 69 %

If there are k records and the page size p is smaller than k, then we need to distribute the k records into left page and right page. It should be a symmetric binomial distribution.

30

Analysis II If there are j records in the left page, then there are k-j records in the right page. The probability is:

„

Thus the space utilization is

Overflow pages „

To avoid doubling the size of directory, we introduce the idea of overflow pages, i.e., „

„

If overflow occurs, then we allocate a new (overflow) page instead of doubling the directory. Put the new record into the overflow page, and put the pointer of the overflow page to the original page. (like chaining.)

31

Overflow pages II „

„

„

Obviously, it will improve the storage utilization, but increases the retrieval time. Larson et. al. concluded that the size of overflow page is from p to p/2 if 80% utilization is enough. (p is the size of page.) For better space utilization, we could monitor „ „ „

„

Access time Insert time Total space utilization

Fagin et al. conclude that it performed at least as well or better than B-tree, by simulation.

Extendible Hashing „

„

„

„

Fagin, R., Nievergelt, J., Pippenger, N., and Strong, H. R. Extendible Hashing - A Fast Access Method for Dynamic Files. ACM Trans. Database System 4, 3 (Sept. 1979), 315-344. Tamminen, M. Extendible Hashing with Overflow. Information Processing Lett. 15, 5 (Dec. 1982), 227-232. Mendelson, H. Analysis of Extendible Hashing. IEEE Trans. on Software Engineering, SE-8, 6 (Nov. 1982), 611-619. Yao, A. C. A Note on the Analysis of Extendible Hashing. Information Processing Letter 11, 2 (1980), 84-86.

32

Directoryless Dynamic Hashing „

„

„

If we have a contiguous space that is large enough to hold all the records, we could estimate the directory and leave the memory management mechanism to OS, e.g., paging. Ref. "Linear Hashing: A new tool for file and database addressing", VLDB 1980. by W. Litwin. Ref. Larson, “Dynamic Hash Tables,” Communications of the ACM, pages 446–457, April 1988, Volume 31, Number 4.

Map a trie to the contiguous space without directory.

33

Linear Hashing „

„

Drawback of previous mapping: It wastes space, since we need to double the contiguous space if page overflow occurs. How to improve: Intuitively, add only one page, and rehash this space!

Add new page one by one.

Ref. Larson, “Dynamic Hash Tables,” Communications of the ACM, pages 446–457, April 1988, Volume 31, Number 4.

Eventually, the space is doubled. Begin new phase!

34

Linear Hashing II. „

The suitable family of hashing functions:

Where N is the minimum size of hash table, c is a constant, and M is a large prime. This family of hash functions is given by Larson, “Dynamic Hash Tables,” Communications of the ACM, pages 446–457, April 1988, Volume 31, Number 4.

Example of Splitting a Bucket

Ref. Larson, “Dynamic Hash Tables,” Communications of the ACM, pages 446–457, April 1988, Volume 31, Number 4.

The case that keys is rehashed into new page.

35

Recall Overflow pages: If overflow occurs, than we allocate a new (overflow) page instead of doubling the directory

„

No new keys be rehashed into new pages

The family of hash function hash(key, r) := key (mod 2^{r-1})

Analysis „

Space utilization is not good! „ „

„

[Litwin] ~ 60% Litwin suggested to keep overflows until the space utilization exceeds the predefined amount. It can also be solved by open addressing, etc.

36

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.