Idea Transcript
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing
Izabela Moise, Evangelos Pournaras, Dirk Helbing
1
Big Data "The world is crazy. But at least it’s getting regular analysis."
Izabela Moise, Evangelos Pournaras, Dirk Helbing
2
Big Data Explosion 1.8 ZB of information world-wide (2011)
Izabela Moise, Evangelos Pournaras, Dirk Helbing
3
Big data becomes Real data
• Big Data became real in 2013 • Obama "the Big Data
President" • Oscar prediction • Finding and telling data-driven
stories in billions of Tweets
Izabela Moise, Evangelos Pournaras, Dirk Helbing
4
The 5 V’s • Volume → large amounts of data generated every second (emails, twitter messages, videos, sensor data...) • Velocity → the speed of data moving in and out data management systems (videos going viral...) → “on-the-fly” • Variety → different data formats in terms of structured or unstructured (80%) data • Value → insights we can reveal within the data • Veracity → trustworthiness of the data Izabela Moise, Evangelos Pournaras, Dirk Helbing
5
Big Data
Extremely large datasets that are hard to deal with using traditional (Relational) Databases • Storage/Cost • Search/Performance • Analytics and Visualization
Izabela Moise, Evangelos Pournaras, Dirk Helbing
6
Big data Different Types of Data:
→ Structured data (Relational, Tables)
Typical Operations: I
I
→ Semi Structured Data (XML, JSON, Logfiles)
I
→ Graph Data (Social Network, Semantic Web)
I
I
Data warehouse, OLAP
Index, Searching, Querying I
→ Unstructured Data (Free Text, Webpages)
Aggregation & Statistics
Keyword bases search Pattern matching
Knowledge discovery I I
Data Mining Statistical Modeling
→ Streaming Data
Izabela Moise, Evangelos Pournaras, Dirk Helbing
7
Big Data Old model: “Query the world”: Data acquisition coupled to a specific hypothesis New model: “Download the world”: Data acquisition supports many hypotheses • Examples: I E-commerce: Transaction, Customer tracking, Social Graph I Security: Logfiles, Anomaly detection I Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) I Biology: lab automation, high-throughput sequencing, I Oceanography: high-resolution models, cheap sensors, Izabela Moise, Evangelos Pournaras, Dirk Helbing satellites
8
Big Data Old model: “Query the world”: Data acquisition coupled to a specific hypothesis New model: “Download the world”: Data acquisition supports many hypotheses • Examples: I E-commerce: Transaction, Customer tracking, Social Graph I Security: Logfiles, Anomaly detection I Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) I Biology: lab automation, high-throughput sequencing, I Oceanography: high-resolution models, cheap sensors, Izabela Moise, Evangelos Pournaras, Dirk Helbing satellites
8
Big Data Old model: “Query the world”: Data acquisition coupled to a specific hypothesis New model: “Download the world”: Data acquisition supports many hypotheses • Examples: I E-commerce: Transaction, Customer tracking, Social Graph I Security: Logfiles, Anomaly detection I Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) I Biology: lab automation, high-throughput sequencing, I Oceanography: high-resolution models, cheap sensors, Izabela Moise, Evangelos Pournaras, Dirk Helbing satellites
8
MapReduce
Izabela Moise, Evangelos Pournaras, Dirk Helbing
9
Parallel and distributed programming paradigms • Partitioning 1. Computation 2. Data
• Mapping
→ Assign computation and data parts to resources • Synchronisation • Communication
→ Send intermediate data between workers • Scheduling
Izabela Moise, Evangelos Pournaras, Dirk Helbing
10
Parallel and distributed programming paradigms • Partitioning 1. Computation 2. Data
• Mapping
→ Assign computation and data parts to resources • Synchronisation • Communication
→ Send intermediate data between workers • Scheduling
A paradigm is an abstraction that hides the implementation of these issues from the users Izabela Moise, Evangelos Pournaras, Dirk Helbing
10
MapReduce
• An abstraction for performing computations on data X X X X X
automatic parallelization of computations large-scale data distribution simple, yet powerful interface user-transparent fault tolerance commodity hardware
• Introduced by Google in 2004: paradigm and implementation
Izabela Moise, Evangelos Pournaras, Dirk Helbing
11
Motivation: Common operations on data
• Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output
Izabela Moise, Evangelos Pournaras, Dirk Helbing
12
Motivation: Common operations on data
• Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output
Provide a functional abstraction for these two operations
Izabela Moise, Evangelos Pournaras, Dirk Helbing
12
Izabela Moise, Evangelos Pournaras, Dirk Helbing
13
MapReduce Programming Model • Input & Output: each a set of key/value pairs • Programmer specifies two functions:
map(in_key , in_value) → list(out_key , intermediate_value) • Processes input key/value pairs • Produces set of intermediate pairs:
reduce(out_key , list(intermediate_value)) → list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)
Inspired by primitives of functional programming languages such as Lisp, Scheme and Haskell Izabela Moise, Evangelos Pournaras, Dirk Helbing
14
What is MapReduce used for? • At Google – Index construction for Google Search – Article clustering for Google News – Statistical machine translation
• At Yahoo! – “Web map” powering Yahoo! Search – Spam detection for Yahoo! Mail
• At Facebook – Data mining – Ad optimization – Spam detection
Izabela Moise, Evangelos Pournaras, Dirk Helbing
15
Example: Word Count
Reduce phase is optional: Jobs can be Map-only Izabela Moise, Evangelos Pournaras, Dirk Helbing
16
Example: Word Count
Izabela Moise, Evangelos Pournaras, Dirk Helbing
17
How does it work?
Izabela Moise, Evangelos Pournaras, Dirk Helbing
18
Key Characteristics
• Parallelism – map() and reduce() functions run in parallel – each working on different data. – reduce phase cannot start until map phase is completely finished.
• Locality – master program assigns tasks based on location of data: tries to have map() tasks on the same machine as physical file data, or at least the same rack
Izabela Moise, Evangelos Pournaras, Dirk Helbing
19
Izabela Moise, Evangelos Pournaras, Dirk Helbing
20
The Hadoop Project
Izabela Moise, Evangelos Pournaras, Dirk Helbing
21
Hadoop Tool Suite
Izabela Moise, Evangelos Pournaras, Dirk Helbing
22
HDFS • Distributed File System
→ An open-source implementation of Google File System • Data split into chunks of fixed (configurable) size − 64MB
default • Two server types: 1. Namenode - keeps the metadata 2. Datanode - stores the data
• Failures handled through chunk level replication
→ 3 replicas: local, same rack, different rack • Write-once-ready-many pattern
→ Files are append-only • Optimized for large files, sequential reads Izabela Moise, Evangelos Pournaras, Dirk Helbing
23
HDFS-Hadoop Distributed File System
Izabela Moise, Evangelos Pournaras, Dirk Helbing
24
HDFS Design
Izabela Moise, Evangelos Pournaras, Dirk Helbing
25
HDFS Design • The Namenode manages the filesystem namespace – filesystem tree and metadata for all the files and directories in the tree → Stored persistently on local disk – chunk placement on datanodes → reconstructed from datanodes when the system starts • Single point of failure – if the Namenode fails, the filesystem is not usable anymore – the metadata can be stored on a remote disk so that the namespace can be reconstructed if the Namenode fails • Datanodes report periodically the list of chunks they store • Namenode front page is at http://namenode-name:50070/ – basic statistics of the cluster – browse the file system Izabela Moise, Evangelos Pournaras, Dirk Helbing
26
HDFS File-based data structures • SequenceFiles – Data structure for binary key-value pairs
Izabela Moise, Evangelos Pournaras, Dirk Helbing
27
HDFS configuration
• fs.default.name,
set to hdfs://localhost/ the default HDFS port 8020 → HDFS clients will use this property to work out where the namenode is running so they can connect to it. • dfs.replication → Chunk replication, by default set to 3
Izabela Moise, Evangelos Pournaras, Dirk Helbing
28
HDFS command line interface
Izabela Moise, Evangelos Pournaras, Dirk Helbing
29
The Hadoop MapReduce framework
Izabela Moise, Evangelos Pournaras, Dirk Helbing
30
Some Hadoop Terminology • Job: a program that executes map and reduce processing
across a data set • Task: an execution of a Mapper or a Reducer on a slice of data
also called, Task-In-Progress (TIP) • Task Attempt: a particular instance of an attempt to execute a
task on a machine
Izabela Moise, Evangelos Pournaras, Dirk Helbing
31
Hadoop Internals • Hadoop uses its own RPC protocol • Communications are initiated by TaskTracker nodes → heartbeat mechanism • A single JobTracker per cluster – accepts Job requests from clients – Job is a Java “jar” file + an XML file containing program configuration options – the Job client places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code
• A single TaskTracker instance runs on each slave node – TaskTracker forks separate Java process for task instances
Izabela Moise, Evangelos Pournaras, Dirk Helbing
32