Idea Transcript
High Performance Analytics for IoT environments Manish Singh, CTO MityLytics Inc
IoT applications 1. 2. 3. 4.
Real-time Low latency High reliability High availability
Diversity of platforms 1. 2. 3. 4. 5. 6.
Private Cloud Public Cloud On-premise Hybrid cloud Bare-metal cloud Platforms
Others and Managed service providers Understand ● ● ● ● ● ●
Tradeoffs Cost Management Control Scale Dynamic scaling
Sample Big Data Pipeline Data Access Connectors
Raw Data
Logs
Sensors
Pub/Sub (Kafka, Amazon Kinesis)
Source Sink (Flume)
SQL (Sqoop)
Records
Interactive Querying
MapReduce (Hadoop)
DAG (Spark)
Script (Pig)
Workflow Scheduling (Oozie)
Serving Databases, Web Frameworks, Visualization Search (Sclr)
Machine Learning (Spark MLIB, H2O)
Queues (RabbitMQ, ZeroMQ, REST MQ, Amazon SQS))
Real-time Analysis
Custom Connector (REST, Websocket, AWS IoT, Azure IoT Hub)
Stream Processing (Storm)
In-Memory (Spark Streaming)
Data Storage
Distributed File System (HDFS)
NoSQL (HBase, Cassandra, ScyllaDB))
NoSQL (Hbase, Cassandra, DynamoDB, MongDB)
SQL (MySQL)
Web/App Servers
Streams
Analytic SQL (Hive, BigQuery, Spark SQL, RedShift)
Connectors
Databases
Batch Analysis
Web Frameworks (Django)
Visualization Frameworks (Lightening, pyGal, Seaborn)
5
Infrastructure Element
Type 1
Type 2
Compute
4 Physical Cores @ 3.4 GHz (1 × E3-1240 v3)
24 Physical Cores @ 2.2 GHz
Memory
32 GB of DDR3 RAM
256 GB of DDR4 ECC RAM
Storage
120 GB of SSD (2 x 120 GB in RAID 1)
2.8 TB of SSD (6 x 480GB SSD)
Network
2 Gbps Bonded Network
20 Gbps Bonded Network
Streaming Ingestion performance Produce rate (Records/sec)
Records sent
Throughput
Latency Median
95th %tile
100 K
50 Million
99.99K records/sec
7.79 ms
1 ms
1 Million
50 Million
977K records/sec
154.95 ms
586 ms
10 Million
50 Million
1.3M records/sec
160.42 ms
514 ms
123.31MB/sec peak Network throughput
SPARK PERFORMANCE
Shuffle Max Shuffle Shuffle Max Shuffle Read Read time Write Write time Scale (secs) (MB/stage) (secs) factor Tasks/stage (MB/stage)
Total time (secs)
0.1
40
0.8
0.097
0.8
2
66
1
400
89.8
3
89.8
16
366
2
800
367.3
57
367.4
438
5040
Software stack transactions Hadoop - Maps, Reduces, shuffles Kafka - Message rate, partitions, replication Spark - RDD, list operations Hive - Maps, reduces Query - Query latency, queries/sec Cassandra - Gossip, read/write
Sample Application transactions Data ingestion - Kafka Data stream processing - Spark Data batch processing - Hadoop Data storage - Cassandra, HDFS Legacy data interfaces - Hive Data indexing and search - Elastic, Solr ...
Methodology ● Run benchmarks in isolation on the cluster ● Run benchmarks together with each other ● Compare performance in the 2 cases above to see if clusters for the software stacks should be isolated or should they continue to be co-located and to understand the cost/performance tradeoffs.
Lifecycle 1. 2. 3. 4. 5.
Plan Design Deploy Operate Repeat
Summary Deployment Stage
Questions/Challenges
Best practices
Planning
● What hardware/cloud providers should I use? ● Given existing resources will I be able to meet SLAs? ● Can I reuse existing hardware?
● Compare how various hardware configurations stack up
Production
● What's my utilization and performance like? ● How do I troubleshoot infrastructure problems? ● Am I meeting SLAs?
● Clearly determine utilization and performance and tradeoff
Scaling
● What resources will I need in the future? ● Difficult to extrapolate
● Plan ahead
Q&A