Idea Transcript
Realtime Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing
Izabela Moise, Evangelos Pournaras, Dirk Helbing
1
60 seconds
Izabela Moise, Evangelos Pournaras, Dirk Helbing
2
Big Data
Izabela Moise, Evangelos Pournaras, Dirk Helbing
3
Big Streaming Data
Izabela Moise, Evangelos Pournaras, Dirk Helbing
3
Real-time Data Analytics
Izabela Moise, Evangelos Pournaras, Dirk Helbing
4
Motivation
• Many important applications must process large streams of live data and provide results in near-real-time → Social network trends → Website statistics → Intrusion detection systems
Izabela Moise, Evangelos Pournaras, Dirk Helbing
5
Patterns Driving Most Streaming Use Cases
Izabela Moise, Evangelos Pournaras, Dirk Helbing
6
Challenges
• arbitrary and interactive exploration • recency matter: alerts on recent changes • availability
Izabela Moise, Evangelos Pournaras, Dirk Helbing
7
Is Hadoop a Solution?
• Upload data to Hadoop • Query it → done!
Izabela Moise, Evangelos Pournaras, Dirk Helbing
8
Is Hadoop a Solution? No.
Hadoop was designed for batch processing Input all data at once, process it and write a large output.
Izabela Moise, Evangelos Pournaras, Dirk Helbing
9
Batch vs. Real-time Processing Batch processing: Collect data over a period of time, input the batch into the system, process it and write the output. Examples: billing systems.
Real-time processing: Continually input, process and output data. Data must be processed in a small time period. Examples: radar systems, customer services, ATMs.
Izabela Moise, Evangelos Pournaras, Dirk Helbing
10
Big Streaming Data Processing Fraud detection in bank transactions
Anomalies in sensor data
Cat videos in tweets
Izabela Moise, Evangelos Pournaras, Dirk Helbing
11
How to Process Big Streaming Data Distributed Processing System Raw Data Streams
Processed Data
Scales to hundreds of nodes Achieves low latency Efficiently recover from failures Integrates with batch and interactive processing
Izabela Moise, Evangelos Pournaras, Dirk Helbing
12
Storm - distributed and fault-tolerant realtime computation
Izabela Moise, Evangelos Pournaras, Dirk Helbing
13
Twitter Storm
• free and open source distributed realtime computation system • reliably process unbounded streams of data • analysis on streams of data as they come in, so you can react
to data as it happens • can be easily integrated with any programming language • developed at Backtype and open sourced by Twitter
→ realtime analytics, online machine learning, continuous computation, distributed RPC, ETL
Izabela Moise, Evangelos Pournaras, Dirk Helbing
14
Izabela Moise, Evangelos Pournaras, Dirk Helbing
15
Storm architecture • Nimbus X similar to job tracker X distributes code across cluster X assigns tasks X handles failures • Supervisor (Worker nodes) X similar to task tracker X runs bolts and spouts as tasks • Zookeeper X cluster coordination Izabela Moise, Evangelos Pournaras, Dirk Helbing
16
Main concepts - Streams
• an unbounded sequence of tuples • core abstraction in Storm • tuple = list of values (any datatype) • value must be serializable • stream = group of tuples
Izabela Moise, Evangelos Pournaras, Dirk Helbing
17
Spouts
• sources of streams • read tuples from an external source • emit them into Storm • spouts can emit more than one stream • example: read from Twitter streaming API Izabela Moise, Evangelos Pournaras, Dirk Helbing
18
Bolts
• process input streams and produce new streams • filtering, functions, aggregation, joins.. • complex transformations require multiple bolts
Izabela Moise, Evangelos Pournaras, Dirk Helbing
19
Topology
• network of spouts and bolts • graph: node = spout or bolt, edge = which bolt subscribes to
which stream
Izabela Moise, Evangelos Pournaras, Dirk Helbing
20
Tasks
spouts and bolts execute as many tasks across the cluster Izabela Moise, Evangelos Pournaras, Dirk Helbing
21
Stream grouping
When a tuple is emitted, which task does it go to? • shuffle groupings: pick a random task • fields groupings: consistent hashing on a subset of tuple fields • all groupings: send to all tasks • global groupings: pick task with lowest id • direct groupings: the source decides which component will
receive the tuple
Izabela Moise, Evangelos Pournaras, Dirk Helbing
22
Topologies run forever
• Starting a topology
• Killing a topology
Izabela Moise, Evangelos Pournaras, Dirk Helbing
23
Examples
• number of occurrences of each hashtag in an input stream of
tweets
Izabela Moise, Evangelos Pournaras, Dirk Helbing
24
Reach Topology in Twitter • Reach: number of unique people exposed to a tweet on Twitter
Izabela Moise, Evangelos Pournaras, Dirk Helbing
25
Reach Topology
Izabela Moise, Evangelos Pournaras, Dirk Helbing
26
Hadoop vs. Storm Hadoop • batch processing • runs jobs to completion • stateful nodes • scalable • guarantees no data loss • open source • big batch processing
Storm • real-time processing • topologies run forever • stateless nodes • scalable • guarantees no data loss • open source • fast, reactive, real-time
processing
Izabela Moise, Evangelos Pournaras, Dirk Helbing
27
Hadoop and Storm
Izabela Moise, Evangelos Pournaras, Dirk Helbing
28