High Performance Analytics for IoT environments - MityLytics [PDF]

Distributed File System. (HDFS). NoSQL. (HBase, Cassandra,. ScyllaDB)). Analytic SQL. (Hive, BigQuery, Spark SQL, RedShi

0 downloads 3 Views 270KB Size

Recommend Stories


Step towards High Performance Analytics
You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

IoT and Risk Analytics
And you? When will you begin that long journey into yourself? Rumi

Building Blocks for IoT Analytics Internet-of-Things Analytics
We may have all come on different ships, but we're in the same boat now. M.L.King

Senior Manager for Performance Analytics
You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

TypeScript High Performance Pdf
Nothing in nature is unbeautiful. Alfred, Lord Tennyson

PdF High Performance Python
This being human is a guest house. Every morning is a new arrival. A joy, a depression, a meanness,

Visual Analytics of Urban Environments using High-Resolution Geographic Data
You're not going to master the rest of your life in one day. Just relax. Master the day. Than just keep

[PDF] High-Performance Training for Sports
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Mobile application for MyICPC with context acquisition for IoT environments
Kindness, like a boomerang, always returns. Unknown

Managing for High Performance
Before you speak, let your words pass through three gates: Is it true? Is it necessary? Is it kind?

Idea Transcript


High Performance Analytics for IoT environments Manish Singh, CTO MityLytics Inc

IoT applications 1. 2. 3. 4.

Real-time Low latency High reliability High availability

Diversity of platforms 1. 2. 3. 4. 5. 6.

Private Cloud Public Cloud On-premise Hybrid cloud Bare-metal cloud Platforms

Others and Managed service providers Understand ● ● ● ● ● ●

Tradeoffs Cost Management Control Scale Dynamic scaling

Sample Big Data Pipeline Data Access Connectors

Raw Data

Logs

Sensors

Pub/Sub (Kafka, Amazon Kinesis)

Source Sink (Flume)

SQL (Sqoop)

Records

Interactive Querying

MapReduce (Hadoop)

DAG (Spark)

Script (Pig)

Workflow Scheduling (Oozie)

Serving Databases, Web Frameworks, Visualization Search (Sclr)

Machine Learning (Spark MLIB, H2O)

Queues (RabbitMQ, ZeroMQ, REST MQ, Amazon SQS))

Real-time Analysis

Custom Connector (REST, Websocket, AWS IoT, Azure IoT Hub)

Stream Processing (Storm)

In-Memory (Spark Streaming)

Data Storage

Distributed File System (HDFS)

NoSQL (HBase, Cassandra, ScyllaDB))

NoSQL (Hbase, Cassandra, DynamoDB, MongDB)

SQL (MySQL)

Web/App Servers

Streams

Analytic SQL (Hive, BigQuery, Spark SQL, RedShift)

Connectors

Databases

Batch Analysis

Web Frameworks (Django)

Visualization Frameworks (Lightening, pyGal, Seaborn)

5

Infrastructure Element

Type 1

Type 2

Compute

4 Physical Cores @ 3.4 GHz (1 × E3-1240 v3)

24 Physical Cores @ 2.2 GHz

Memory

32 GB of DDR3 RAM

256 GB of DDR4 ECC RAM

Storage

120 GB of SSD (2 x 120 GB in RAID 1)

2.8 TB of SSD (6 x 480GB SSD)

Network

2 Gbps Bonded Network

20 Gbps Bonded Network

Streaming Ingestion performance Produce rate (Records/sec)

Records sent

Throughput

Latency Median

95th %tile

100 K

50 Million

99.99K records/sec

7.79 ms

1 ms

1 Million

50 Million

977K records/sec

154.95 ms

586 ms

10 Million

50 Million

1.3M records/sec

160.42 ms

514 ms

123.31MB/sec peak Network throughput

SPARK PERFORMANCE

Shuffle Max Shuffle Shuffle Max Shuffle Read Read time Write Write time Scale (secs) (MB/stage) (secs) factor Tasks/stage (MB/stage)

Total time (secs)

0.1

40

0.8

0.097

0.8

2

66

1

400

89.8

3

89.8

16

366

2

800

367.3

57

367.4

438

5040

Software stack transactions Hadoop - Maps, Reduces, shuffles Kafka - Message rate, partitions, replication Spark - RDD, list operations Hive - Maps, reduces Query - Query latency, queries/sec Cassandra - Gossip, read/write

Sample Application transactions Data ingestion - Kafka Data stream processing - Spark Data batch processing - Hadoop Data storage - Cassandra, HDFS Legacy data interfaces - Hive Data indexing and search - Elastic, Solr ...

Methodology ● Run benchmarks in isolation on the cluster ● Run benchmarks together with each other ● Compare performance in the 2 cases above to see if clusters for the software stacks should be isolated or should they continue to be co-located and to understand the cost/performance tradeoffs.

Lifecycle 1. 2. 3. 4. 5.

Plan Design Deploy Operate Repeat

Summary Deployment Stage

Questions/Challenges

Best practices

Planning

● What hardware/cloud providers should I use? ● Given existing resources will I be able to meet SLAs? ● Can I reuse existing hardware?

● Compare how various hardware configurations stack up

Production

● What's my utilization and performance like? ● How do I troubleshoot infrastructure problems? ● Am I meeting SLAs?

● Clearly determine utilization and performance and tradeoff

Scaling

● What resources will I need in the future? ● Difficult to extrapolate

● Plan ahead

Q&A

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.