Search Engines and Google - UiO [PDF]

The Anatomy of a Large-Scale Hypertextual Web Search Engine. • http://infolab.stanford.edu/~backrub/google.html. • D

0 downloads 9 Views 754KB Size

Recommend Stories


Search engines guide
If you want to go quickly, go alone. If you want to go far, go together. African proverb

Search Engines Exercise 1
You often feel tired, not because you've done too much, but because you've done too little of what sparks

Google Search Rescue For Dummies
Don't fear change. The surprise is the only way to new discoveries. Be playful! Gordana Biernat

Photonic Crystals & MEMS - UiO [PDF]
Acoustic wave moves diaphragm (150 - 300 µm diameter). ▫ Photonic-crystal mirror and fiber-end mirror forms a Fabry-Perot interferometer. ▫ Reflected power in fiber changes with a change in the diaphragm-fiber gap. ▫ Photonic-crystal mirror pr

Untitled - UiO
Everything in the universe is within you. Ask all from yourself. Rumi

Untitled - UiO
When you talk, you are only repeating what you already know. But if you listen, you may learn something

Untitled - UiO
And you? When will you begin that long journey into yourself? Rumi

Untitled - UiO
Your big opportunity may be right where you are now. Napoleon Hill

Untitled - UiO
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

Untitled - UiO
Life is not meant to be easy, my child; but take courage: it can be delightful. George Bernard Shaw

Idea Transcript


Google Technology Vera Goebel, Ifi/UiO, 2011

Search Engine: Crawler, PageRank, indexes MapReduce GFS – Google File System Data Center Google Big Table (slides + video)

1

Search Engine

2

Crawler • •

Process that downloads web pages to a Page Repository. Examine pages for links to other pages and insert the ones that are not in the Page Repository in the set for pages to be crawled. http://goo.gl/gG3s

3

Crawler Challenge Terminating search

Description Dynamically generated pages could create a forever loop

Solution Limit number of pages to crawl with a “depth” limit per site

Managing the repository

1. Duplication of URL to be 1. An efficient index for checking crawled stored pages 2. Duplicated pages due to mirror 2. Minhash and locality-sensitive sites, different routes, hashing signatures plagiarism, etc.

Selecting the next page

How to prioritise next page to be Give priority to “important” pages crawled?

Speeding up the crawl

1. Scale to several machines 1. How many processes should 2. Assign processes to entire be simultaneously run? hosts or sites 2. How to synchronise them to 3. Do not issue frequent requests avoid they crawl the same site. to a single site. Several 3. Avoid DoS attack processes in a single machine due to idle states.

4

Query Processing in Search Engines

• Search engine queries are not like SQL queries

• Require inverted indices • Disk access is very expensive to offer the user acceptable response time

• Matched records are ranked before showing to the user 5

Recursive Formulation of Page Rank

Algorithm for identifying “important” pages: Web page is important if many important pages link to it.

The Matrix M, the transition matrix of the Web has element rank r, mij in row i and column j, where Yahoo!

1.mij = 1/r if page j has a link to page i, and there are a total of r≥1 pages that j links to 2.mij = 0 otherwise Transition Matrix 1/2 1/2 0 0 1/2 0

Microsoft

The Web in 1839 6

Microsoft

Amazon

Amazon

Microsoft

1

= 1/2 0 Yahoo!

Amazon

M

Yahoo!

Spider Traps and Dead Ends Yahoo!

Amazon

0

Yahoo!

0

Amazon

1

Microsoft

Yahoo!

Microsoft

Amazon

Microsoft becomes a spider trap

0

Yahoo!

0

Amazon

0

Microsoft

Microsoft

Microsoft becomes a dead end

7

Link • •

Spam

Spam farming in order to accumulate and concentrate PageRank on a few pages

S

… …

Links to the spam farm from pulicly accessible blogs, with messages like “I agree with you. See x1234.mySpam.Farm.c om”

Links from outside

8

Inverted Indices •



Essential for Web Queries

Uses indirect buckets for space efficiency

... the cat is fat ... cat ... was raining cats and dogs ... dog

... Fido the dog ...

Inverted Index

Documents

Buckets

9

Cat

Type

Position

title

5

header

10

anchor

3

text

57

Document

Dogs compared with cats Doc 1

Doc 2

Dog

title

100

title

12 Doc 3

Sorting more information in the inverted index 10

Map-Reduce Parallelism Framework • •

Large-scale parallel machines share high load operations such as joins Distributed architectures

• •

Map

Reduce

Input Key-Value Pairs

Grid, networks and corporate DBs

Output Lists

Sort Intermediate Key-Value Pairs by Keys

MRP paradigm expresses large-scale computations

Execution of map and reduce functions

11

Google Search • Uses: links, PageRank, anchors, proximity and visual presentation (e.g. bold text is weighted higher) in search logic. 1. Search the index 2. Analyze the web pages for relevance 3. Evaluate the site’s reputation

4. Rank the web pages 12

Google’s System Anatomy

http://goo.gl/yYbb 13

Google File System - Motivation

14

Google Data Centers

15

Chunks & Chunk Servers

16

Master Servers

17

Master Server – Chunk Server Communication

18

GFS - Architecture

19

Read Operation

20

Write Operation

21

Write Operation (cont.)

22

References • • • • •



The Anatomy of a Large-Scale Hypertextual Web Search Engine http://infolab.stanford.edu/~backrub/google.html

Database Systems. The Complete Book. Second Edition. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, ACM Symposium on Operating Systems Principles, 2003 Jonathan Strickland, How the Google File System Works, HowStuffWorks.com, 2010

Wikipedia Contributors, Google File System, Wikipedia -The Free Encyclopedia, 2010

23

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.