Big Data Analytics [PDF]

The 5 V's. • Volume. → large amounts of data generated every second (emails, twitter messages, videos, sensor data..

3 downloads 5 Views 10MB Size

Recommend Stories


Big Data Analytics
Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

Analytics and Big Data
Don’t grieve. Anything you lose comes round in another form. Rumi

Analytics and big data
Stop acting so small. You are the universe in ecstatic motion. Rumi

big data analytics
Life isn't about getting and having, it's about giving and being. Kevin Kruse

big data analytics
The best time to plant a tree was 20 years ago. The second best time is now. Chinese Proverb

Big Data Analytics
So many books, so little time. Frank Zappa

Big Data Analytics
Learning never exhausts the mind. Leonardo da Vinci

Big Data and Analytics
What you seek is seeking you. Rumi

Big Data Analytics
This being human is a guest house. Every morning is a new arrival. A joy, a depression, a meanness,

Big Data Analytics
Your task is not to seek for love, but merely to seek and find all the barriers within yourself that

Idea Transcript


Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing

Izabela Moise, Evangelos Pournaras, Dirk Helbing

1

Big Data "The world is crazy. But at least it’s getting regular analysis."

Izabela Moise, Evangelos Pournaras, Dirk Helbing

2

Big Data Explosion 1.8 ZB of information world-wide (2011)

Izabela Moise, Evangelos Pournaras, Dirk Helbing

3

Big data becomes Real data

• Big Data became real in 2013 • Obama "the Big Data

President" • Oscar prediction • Finding and telling data-driven

stories in billions of Tweets

Izabela Moise, Evangelos Pournaras, Dirk Helbing

4

The 5 V’s • Volume → large amounts of data generated every second (emails, twitter messages, videos, sensor data...) • Velocity → the speed of data moving in and out data management systems (videos going viral...) → “on-the-fly” • Variety → different data formats in terms of structured or unstructured (80%) data • Value → insights we can reveal within the data • Veracity → trustworthiness of the data Izabela Moise, Evangelos Pournaras, Dirk Helbing

5

Big Data

Extremely large datasets that are hard to deal with using traditional (Relational) Databases • Storage/Cost • Search/Performance • Analytics and Visualization

Izabela Moise, Evangelos Pournaras, Dirk Helbing

6

Big data Different Types of Data:

→ Structured data (Relational, Tables)

Typical Operations: I

I

→ Semi Structured Data (XML, JSON, Logfiles)

I

→ Graph Data (Social Network, Semantic Web)

I

I

Data warehouse, OLAP

Index, Searching, Querying I

→ Unstructured Data (Free Text, Webpages)

Aggregation & Statistics

Keyword bases search Pattern matching

Knowledge discovery I I

Data Mining Statistical Modeling

→ Streaming Data

Izabela Moise, Evangelos Pournaras, Dirk Helbing

7

Big Data Old model: “Query the world”: Data acquisition coupled to a specific hypothesis New model: “Download the world”: Data acquisition supports many hypotheses • Examples: I E-commerce: Transaction, Customer tracking, Social Graph I Security: Logfiles, Anomaly detection I Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) I Biology: lab automation, high-throughput sequencing, I Oceanography: high-resolution models, cheap sensors, Izabela Moise, Evangelos Pournaras, Dirk Helbing satellites

8

Big Data Old model: “Query the world”: Data acquisition coupled to a specific hypothesis New model: “Download the world”: Data acquisition supports many hypotheses • Examples: I E-commerce: Transaction, Customer tracking, Social Graph I Security: Logfiles, Anomaly detection I Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) I Biology: lab automation, high-throughput sequencing, I Oceanography: high-resolution models, cheap sensors, Izabela Moise, Evangelos Pournaras, Dirk Helbing satellites

8

Big Data Old model: “Query the world”: Data acquisition coupled to a specific hypothesis New model: “Download the world”: Data acquisition supports many hypotheses • Examples: I E-commerce: Transaction, Customer tracking, Social Graph I Security: Logfiles, Anomaly detection I Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) I Biology: lab automation, high-throughput sequencing, I Oceanography: high-resolution models, cheap sensors, Izabela Moise, Evangelos Pournaras, Dirk Helbing satellites

8

MapReduce

Izabela Moise, Evangelos Pournaras, Dirk Helbing

9

Parallel and distributed programming paradigms • Partitioning 1. Computation 2. Data

• Mapping

→ Assign computation and data parts to resources • Synchronisation • Communication

→ Send intermediate data between workers • Scheduling

Izabela Moise, Evangelos Pournaras, Dirk Helbing

10

Parallel and distributed programming paradigms • Partitioning 1. Computation 2. Data

• Mapping

→ Assign computation and data parts to resources • Synchronisation • Communication

→ Send intermediate data between workers • Scheduling

A paradigm is an abstraction that hides the implementation of these issues from the users Izabela Moise, Evangelos Pournaras, Dirk Helbing

10

MapReduce

• An abstraction for performing computations on data X X X X X

automatic parallelization of computations large-scale data distribution simple, yet powerful interface user-transparent fault tolerance commodity hardware

• Introduced by Google in 2004: paradigm and implementation

Izabela Moise, Evangelos Pournaras, Dirk Helbing

11

Motivation: Common operations on data

• Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output

Izabela Moise, Evangelos Pournaras, Dirk Helbing

12

Motivation: Common operations on data

• Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output

Provide a functional abstraction for these two operations

Izabela Moise, Evangelos Pournaras, Dirk Helbing

12

Izabela Moise, Evangelos Pournaras, Dirk Helbing

13

MapReduce Programming Model • Input & Output: each a set of key/value pairs • Programmer specifies two functions:

map(in_key , in_value) → list(out_key , intermediate_value) • Processes input key/value pairs • Produces set of intermediate pairs:

reduce(out_key , list(intermediate_value)) → list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)

Inspired by primitives of functional programming languages such as Lisp, Scheme and Haskell Izabela Moise, Evangelos Pournaras, Dirk Helbing

14

What is MapReduce used for? • At Google – Index construction for Google Search – Article clustering for Google News – Statistical machine translation

• At Yahoo! – “Web map” powering Yahoo! Search – Spam detection for Yahoo! Mail

• At Facebook – Data mining – Ad optimization – Spam detection

Izabela Moise, Evangelos Pournaras, Dirk Helbing

15

Example: Word Count

Reduce phase is optional: Jobs can be Map-only Izabela Moise, Evangelos Pournaras, Dirk Helbing

16

Example: Word Count

Izabela Moise, Evangelos Pournaras, Dirk Helbing

17

How does it work?

Izabela Moise, Evangelos Pournaras, Dirk Helbing

18

Key Characteristics

• Parallelism – map() and reduce() functions run in parallel – each working on different data. – reduce phase cannot start until map phase is completely finished.

• Locality – master program assigns tasks based on location of data: tries to have map() tasks on the same machine as physical file data, or at least the same rack

Izabela Moise, Evangelos Pournaras, Dirk Helbing

19

Izabela Moise, Evangelos Pournaras, Dirk Helbing

20

The Hadoop Project

Izabela Moise, Evangelos Pournaras, Dirk Helbing

21

Hadoop Tool Suite

Izabela Moise, Evangelos Pournaras, Dirk Helbing

22

HDFS • Distributed File System

→ An open-source implementation of Google File System • Data split into chunks of fixed (configurable) size − 64MB

default • Two server types: 1. Namenode - keeps the metadata 2. Datanode - stores the data

• Failures handled through chunk level replication

→ 3 replicas: local, same rack, different rack • Write-once-ready-many pattern

→ Files are append-only • Optimized for large files, sequential reads Izabela Moise, Evangelos Pournaras, Dirk Helbing

23

HDFS-Hadoop Distributed File System

Izabela Moise, Evangelos Pournaras, Dirk Helbing

24

HDFS Design

Izabela Moise, Evangelos Pournaras, Dirk Helbing

25

HDFS Design • The Namenode manages the filesystem namespace – filesystem tree and metadata for all the files and directories in the tree → Stored persistently on local disk – chunk placement on datanodes → reconstructed from datanodes when the system starts • Single point of failure – if the Namenode fails, the filesystem is not usable anymore – the metadata can be stored on a remote disk so that the namespace can be reconstructed if the Namenode fails • Datanodes report periodically the list of chunks they store • Namenode front page is at http://namenode-name:50070/ – basic statistics of the cluster – browse the file system Izabela Moise, Evangelos Pournaras, Dirk Helbing

26

HDFS File-based data structures • SequenceFiles – Data structure for binary key-value pairs

Izabela Moise, Evangelos Pournaras, Dirk Helbing

27

HDFS configuration

• fs.default.name,

set to hdfs://localhost/ the default HDFS port 8020 → HDFS clients will use this property to work out where the namenode is running so they can connect to it. • dfs.replication → Chunk replication, by default set to 3

Izabela Moise, Evangelos Pournaras, Dirk Helbing

28

HDFS command line interface

Izabela Moise, Evangelos Pournaras, Dirk Helbing

29

The Hadoop MapReduce framework

Izabela Moise, Evangelos Pournaras, Dirk Helbing

30

Some Hadoop Terminology • Job: a program that executes map and reduce processing

across a data set • Task: an execution of a Mapper or a Reducer on a slice of data

also called, Task-In-Progress (TIP) • Task Attempt: a particular instance of an attempt to execute a

task on a machine

Izabela Moise, Evangelos Pournaras, Dirk Helbing

31

Hadoop Internals • Hadoop uses its own RPC protocol • Communications are initiated by TaskTracker nodes → heartbeat mechanism • A single JobTracker per cluster – accepts Job requests from clients – Job is a Java “jar” file + an XML file containing program configuration options – the Job client places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code

• A single TaskTracker instance runs on each slave node – TaskTracker forks separate Java process for task instances

Izabela Moise, Evangelos Pournaras, Dirk Helbing

32

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.