Realtime Data Analytics [PDF]

Big Streaming Data Processing. Fraud detection in bank transactions. Anomalies in sensor data. Cat videos in tweets. Iza

0 downloads 4 Views 34MB Size

Recommend Stories


Data & Analytics
I want to sing like the birds sing, not worrying about who hears or what they think. Rumi

Data Analytics
The beauty of a living thing is not the atoms that go into it, but the way those atoms are put together.

Data Analytics
Your task is not to seek for love, but merely to seek and find all the barriers within yourself that

big data and data analytics
Learning never exhausts the mind. Leonardo da Vinci

Big data, Analytics & Unstructured data
So many books, so little time. Frank Zappa

eBook PdF Data Analytics Made Accessible
Ask yourself: Have I done anything lately worth remembering? Next

[PDF] Data Mining for Business Analytics
You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

Data Science & Big Data Analytics
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

[PDF] Data Mining for Business Analytics
So many books, so little time. Frank Zappa

data analytics in aquaculture
If you are irritated by every rub, how will your mirror be polished? Rumi

Idea Transcript


Realtime Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing

Izabela Moise, Evangelos Pournaras, Dirk Helbing

1

60 seconds

Izabela Moise, Evangelos Pournaras, Dirk Helbing

2

Big Data

Izabela Moise, Evangelos Pournaras, Dirk Helbing

3

Big Streaming Data

Izabela Moise, Evangelos Pournaras, Dirk Helbing

3

Real-time Data Analytics

Izabela Moise, Evangelos Pournaras, Dirk Helbing

4

Motivation

• Many important applications must process large streams of live data and provide results in near-real-time → Social network trends → Website statistics → Intrusion detection systems

Izabela Moise, Evangelos Pournaras, Dirk Helbing

5

Patterns Driving Most Streaming Use Cases

Izabela Moise, Evangelos Pournaras, Dirk Helbing

6

Challenges

• arbitrary and interactive exploration • recency matter: alerts on recent changes • availability

Izabela Moise, Evangelos Pournaras, Dirk Helbing

7

Is Hadoop a Solution?

• Upload data to Hadoop • Query it → done!

Izabela Moise, Evangelos Pournaras, Dirk Helbing

8

Is Hadoop a Solution? No.

Hadoop was designed for batch processing Input all data at once, process it and write a large output.

Izabela Moise, Evangelos Pournaras, Dirk Helbing

9

Batch vs. Real-time Processing Batch processing: Collect data over a period of time, input the batch into the system, process it and write the output. Examples: billing systems.

Real-time processing: Continually input, process and output data. Data must be processed in a small time period. Examples: radar systems, customer services, ATMs.

Izabela Moise, Evangelos Pournaras, Dirk Helbing

10

Big Streaming Data Processing Fraud detection in bank transactions

Anomalies in sensor data

Cat videos in tweets

Izabela Moise, Evangelos Pournaras, Dirk Helbing

11

How to Process Big Streaming Data Distributed Processing System Raw Data Streams

Processed Data

Scales to hundreds of nodes Achieves low latency Efficiently recover from failures Integrates with batch and interactive processing

Izabela Moise, Evangelos Pournaras, Dirk Helbing

12

Storm - distributed and fault-tolerant realtime computation

Izabela Moise, Evangelos Pournaras, Dirk Helbing

13

Twitter Storm

• free and open source distributed realtime computation system • reliably process unbounded streams of data • analysis on streams of data as they come in, so you can react

to data as it happens • can be easily integrated with any programming language • developed at Backtype and open sourced by Twitter

→ realtime analytics, online machine learning, continuous computation, distributed RPC, ETL

Izabela Moise, Evangelos Pournaras, Dirk Helbing

14

Izabela Moise, Evangelos Pournaras, Dirk Helbing

15

Storm architecture • Nimbus X similar to job tracker X distributes code across cluster X assigns tasks X handles failures • Supervisor (Worker nodes) X similar to task tracker X runs bolts and spouts as tasks • Zookeeper X cluster coordination Izabela Moise, Evangelos Pournaras, Dirk Helbing

16

Main concepts - Streams

• an unbounded sequence of tuples • core abstraction in Storm • tuple = list of values (any datatype) • value must be serializable • stream = group of tuples

Izabela Moise, Evangelos Pournaras, Dirk Helbing

17

Spouts

• sources of streams • read tuples from an external source • emit them into Storm • spouts can emit more than one stream • example: read from Twitter streaming API Izabela Moise, Evangelos Pournaras, Dirk Helbing

18

Bolts

• process input streams and produce new streams • filtering, functions, aggregation, joins.. • complex transformations require multiple bolts

Izabela Moise, Evangelos Pournaras, Dirk Helbing

19

Topology

• network of spouts and bolts • graph: node = spout or bolt, edge = which bolt subscribes to

which stream

Izabela Moise, Evangelos Pournaras, Dirk Helbing

20

Tasks

spouts and bolts execute as many tasks across the cluster Izabela Moise, Evangelos Pournaras, Dirk Helbing

21

Stream grouping

When a tuple is emitted, which task does it go to? • shuffle groupings: pick a random task • fields groupings: consistent hashing on a subset of tuple fields • all groupings: send to all tasks • global groupings: pick task with lowest id • direct groupings: the source decides which component will

receive the tuple

Izabela Moise, Evangelos Pournaras, Dirk Helbing

22

Topologies run forever

• Starting a topology

• Killing a topology

Izabela Moise, Evangelos Pournaras, Dirk Helbing

23

Examples

• number of occurrences of each hashtag in an input stream of

tweets

Izabela Moise, Evangelos Pournaras, Dirk Helbing

24

Reach Topology in Twitter • Reach: number of unique people exposed to a tweet on Twitter

Izabela Moise, Evangelos Pournaras, Dirk Helbing

25

Reach Topology

Izabela Moise, Evangelos Pournaras, Dirk Helbing

26

Hadoop vs. Storm Hadoop • batch processing • runs jobs to completion • stateful nodes • scalable • guarantees no data loss • open source • big batch processing

Storm • real-time processing • topologies run forever • stateless nodes • scalable • guarantees no data loss • open source • fast, reactive, real-time

processing

Izabela Moise, Evangelos Pournaras, Dirk Helbing

27

Hadoop and Storm

Izabela Moise, Evangelos Pournaras, Dirk Helbing

28

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.