Challenges and Opportunities - Frankfurt Big Data Lab [PDF]

Apr 30, 2015 - Goethe University Frankfurt â Institute for Computer Science â DBIS. Schedule (preliminary, please ch

0 downloads 5 Views 2MB Size

Report

Download PDF

PNG Network

Recommend Stories

Challenges and opportunities with big data

Do not seek to follow in the footsteps of the wise. Seek what they sought. Matsuo Basho

A Review on Big Data Challenges and Opportunities

You often feel tired, not because you've done too much, but because you've done too little of what sparks

Opportunities and Challenges Read PDF

This being human is a guest house. Every morning is a new arrival. A joy, a depression, a meanness,

Big Data, Analytics and Industry 4.0 Opportunities

The only limits you see are the ones you impose on yourself. Dr. Wayne Dyer

Opportunities and Challenges for Data Physicalization

What you seek is seeking you. Rumi

big data: challenges and opportunities for storage and management of data to be available and

Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

Significance and Challenges of Big Data Research

No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

Big Data: Concept, Handling and Challenges

Do not seek to follow in the footsteps of the wise. Seek what they sought. Matsuo Basho

Big Privacy: Challenges and Opportunities of Privacy Study in the Age of Big Data

If you feel beautiful, then you are. Even if you don't, you still are. Terri Guillemets

Challenges and Opportunities

You're not going to master the rest of your life in one day. Just relax. Master the day. Than just keep

Idea Transcript

Principles of E-Commerce I: Business and Technology. (PoE1) Focus: Big Data Platforms Prof. Roberto V. Zicari with support of Todor Ivanov, Marten Rosselli and Dr. Karsten Tolle 2015 SS

Principles of E-Commerce I (PoE1) Focus: Big Data Platforms Responsible: Prof. Roberto V. Zicari with support of Todor Ivanov, Marten Rosselli and Dr. Karsten Tolle Time and location: Tuesday 10:15 - 11:45, SR 11 (Informatikgebäude) Wednesday 10:15 - 11:45, SR 307 (Informatikgebäude)

Goethe University Frankfurt – Institute for Computer Science – DBIS

Basic information Webpage Frankfurt Big Data Lab: http://www.bigdata.unifrankfurt.de/principles-e-commerce-ss2015/

Homepage DBIS: http://www.dbis.cs.uni-frankfurt.de E-Mail: [email protected]

Attention: We will try to announce changes and news on the webpage. You should have a look before each lecture.

Goethe University Frankfurt – Institute for Computer Science – DBIS

Hands-on This course is a hands-on course. The exercise and lecture parts are mixed as needed. Please make sure that you bring at least one notebook for two persons for each course slot. (In case this is a problem send an email to [email protected].)

Goethe University Frankfurt – Institute for Computer Science – DBIS

How to get the CPs (6) and the final score … Each participant needs to do four practical assignments. Make a presentation of the results. Details will follow.

Registration: With the first assignment, we will collect your data (Name, Matrikelnummer und Studiengang) for the registration to this course.

Goethe University Frankfurt – Institute for Computer Science – DBIS

Schedule (preliminary, please check Web site!) Prof. Dott. Ing. Roberto V. Zicari - Intro Todor Ivanov - Hadoop Dr. Karsten Tolle - GraphDBs Marten Rosselli - NoSQL Students - Presentations

Tuesday 1 2 3 4 5 6 7 8 9 10 11 12 13 14

14. April 2015 21. April 2015 Intro to Hadoop 1 - HDFS & MapReduce 28. April 2015 Hadoop Ecosystem 1 5. Mai 2015 Data Acquisition 1 12. Mai 2015 Graphs, GraphDBs 19. Mai 2015 Intro to Pig 1 26. Mai 2015 Student Presentations 1 2. Juni 2015 Advanced Pig 1 9. Juni 2015 Intro to Hive 2 16. Juni 2015 NoSQL 23. Juni 2015 NoSQL - Exercise 30. Juni 2015 Impala 7. Juli 2015 NoSQL 14. Juli 2015 Student Presentations 3

Wednesday 4/15/2015 Intro to Big Data 4/22/2015 Intro to Hadoop 2 - HDFS & MapReduce 4/29/2015 Hadoop Ecosystem 2 5/6/2015 Data Acquisition 2 5/13/2015 Semantic Web, LOD 5/20/2015 Intro to Pig 2 5/27/2015 Student Presentations 2 6/3/2015 Intro to Hive 1 6/10/2015 Advanced Hive 1 6/17/2015 NoSQL 6/24/2015 NoSQL - Exercise 7/1/2015 New Big Data Technologies - Spark … 7/8/2015 NoSQL 7/15/2015 Student Presentations 4

Goethe University Frankfurt – Institute for Computer Science – DBIS

Ringvorlesung Focus Big Data, Internet of Things and Data Science Series of 10 guest lectures, open to all. Course start/end: Thursday, 23.04.2015 to Thursday, 16.07.2015

Time: Every Thursdays, 14:15 – 15:45 Location: Robert-Mayer-Straße 11-15, Room SR 307 (Informatikgebäude) Webpage: http://www.bigdata.uni-frankfurt.de/soft-skills-entrepreneurship-m-ssk-bsos/

Frankfurt Big Data Lab: http://www.bigdata.uni-frankfurt.de

Goethe University Frankfurt – Institute for Computer Science – DBIS

Ringvorlesung Schedule Date

Speaker

Title

30.04.2015

Prof. Roberto V. Zicari, Frankfurt Big Data Lab, Goethe University Frankfurt

Big Data: A Data Driven Society?

07.05.2015

Prof. Nikos Korfiatis, Assistant Professor of Business Analytics, Norwich Business School, University of East Anglia, UK

Big Data and Regulation

21.05.2015

Dr. Alexander Zeier, Managing Director, Globally for In-Memory Solutions at Accenture

In-Memory Technologies and Applications: S/4 HANA

28.05.2015

Jörg Besier, Managing Director at Accenture, Digital Delivery Lead ASG

Towards a data-driven economy. How Big Data fuels the digital economy

11.06.2015

Klaas Wilhelm Bollhoefer, Chief Data Scientist, The unbelievable Machine Company

Introduction to Data Science

18.06.2015

Prof. Dr. Katharina Morik, TU Dortmund University

Big Data Analytics in Astrophysics

25.06.2015

Prof. Hans Uszkoreit, Scientific Director, German Research Center for Artificial Smart Data Web – Value chains for industrial Intelligence (DFKI) applications

02.07.2015

Matthew Eric Bassett, Director and Co-Founder of Gower Street Analytics. Former Director, Data Science at NBCUniversal International, UK

Data Science and the future of the movie business

09.07.2015

Prof. Dr. Christoph Schommer, University of Luxembourg

Algorithms for Data Privacy

16.07.2015

Thomas Jarzombek, Mitglied des Deutschen Bundestages

Big Data and its challenges for today´s politics

Goethe University Frankfurt – Institute for Computer Science – DBIS

Big Data slogans “Big Data: The next frontier for innovation, competition, and productivity ” (McKinsey Global Institute)

“Data is the new gold ” Open Data Initiative, European Commission (aim at opening up Public Sector Information).

9

What Data? The term ”Big Data" refers to large amounts of different types of data produced with high velocity from a high number of various types of sources. Handling today's highly variable and real-time datasets requires new tools and methods, such as powerful processors, software and algorithms. “The term ”Open Data" refers to a subset of data, namely to data made freely available for re-use to everyone for both commercial and non-commercial purposes”.

Linked Data is about using the Web to connect related data that wasn't previously linked, or using the Web to lower the barriers to linking data currently linked using other methods. 10

This is Big Data. Every day, 2.5 quintillion bytes (=2,5 exabytes) of data are created.

This data comes from: digital pictures, videos, posts to social media sites, intelligent sensors, purchase transaction records, cell phone GPS signals to name a few. In 2013, estimates reached 4 zettabytes of data generated worldwide (*) • Mary Meeker and Liang Yu, Internet Trends, Kleiner Perkins Caulfield Byers, 2013, http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013.

11

Source http://aci.info/2014/07 /12/the-dataexplosion-in-2014minute-by-minuteinfographic/

How Big is Big Data? 1 megabyte = 1,000,000 =106 bytes 1 gigabyte = 109bytes 1 terabyte = 1,000,000,000,000 bytes = 1012bytes ------------------------------------------------------------------

1 petabyte is 1,000 terabytes (TB) =1015bytes 1 exabyte = 1018bytes 1 zettabyte is 1,000 000,000,000,000,000,000== 1021bytes

“Imagine that every person (320,590,000) in the United States took a digital photo every second of every day for over a month. All of those photos put together would equal about one zettabyte” (*) (*) BIG DATA: SEIZING OPPORTUNITIES, PRESERVING VALUES Executive Office of the President, MAY 2014 -The White House, Washington.

13

Another Definition of Big Data “Big Data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze” (McKinsey Global Institute) – This definition is Not defined in terms of data size (data sets will increase) – Vary by sectors (ranging from a few dozen terabytes to multiple petabytes)

14

Examples of gigabyte-sized storage( source Wikipedia) •

One hour of Standard-definition television SDTV- video at 2.2 Mbit/s is approximately 1 GB.

•

Seven minutes of High-definition television HDTV- video at 19.39 Mbit/s is approximately 1 GB.

•

114 minutes of uncompressed Compact disc -CD-quality audio at 1.4 Mbit/s is approximately 1 GB.

•

A digital optical disc storage DVD-R single- layer can hold about 4.7 GB.

•

A dual-layered Blu-ray Disc -digital optical disc data storage Blu-ray disc- can hold about 50 GB.

15

Examples of the use of terabyte (source Wikipedia) • Audio: One terabyte of audio recorded at CD quality contains approx. 2000 hours of audio. • Climate science: In 2010, the German Climate Computing Centre (DKRZ) was generating 10000 TB of data per year • Video: Released in 2009, the 3D animated film Monsters vs. Aliens used 100 TB of storage during development • The Hubble Space Telescope has collected more than 45 terabytes of data in its first 20 years of observations. • Historical Internet traffic: In 1993, total Internet traffic amounted to approximately 100 TB for the year. As of June 2008, Cisco Systems estimated Internet traffic at 160 TB/s (which, assuming to be statistically constant, comes to 5 zettabytes for the year). In other words, the amount of Internet traffic per second in 2008 exceeded all of the Internet traffic in 1993.

16

Examples of the use of the petabyte (source Wikipedia) • Databases: Teradata Database 12 has a capacity of 50 petabytes of compressed data

• Data mining: In August 2012, Facebook's Hadoop clusters include the largest single HDFS cluster known, with more than 100 PB physical disk space in a single HDFS filesystem. Yahoo stores 2 petabytes of data on behavior. • Telecommunications (usage): AT&T transfers about 30 petabytes of data through its networks each day. • Internet:Google processed about 24 petabytes of data per day in 2009 • Data storage system: In August 2011, IBM was reported to have built the largest storage array ever, with a capacity of 120 petabytes. 17

Examples of the use of the petabyte (source Wikipedia) • Photos: As of January 2013, Facebook users had uploaded over 240 billion photos, with 350 million new photos every day. For each uploaded photo, Facebook generates and stores four images of different sizes, which translated to a total of 960 billion images and an estimated 357 petabytes of storage. • Music: One petabyencoded songs te of averageMP3- (for mobile, roughly one megabyte per minute), would require 2000 years to play. • Games: World of Warcraft uses 1.3 petabytes of storage to maintain its game. • Physics:experiments in the Large Hadron Collider produce about 15 petabytes of data per year • Climate science:German Climate Computing Centre (DKRZ) has a storage capacity of 60 petabytes of climate data. 18

The “Internet of Things” • „The “Internet of Things” is a term used to describe the ability of devices to communicate with each other using embedded sensors that are linked through wired and wireless networks. • These devices could include your thermostat, your car, or a pill you swallow so the doctor can monitor the health of your digestive tract. • These connected devices use the Internet to transmit, compile, and analyze data“ (*)

(*) BIG DATA: SEIZING OPPORTUNITIES, PRESERVING VALUES Executive Office of the President, MAY 2014 -The White House, 19 Washington.

What is Big Data supposed to create? “Value” (McKinsey Global Institute): – Creating transparencies – Discovering needs, expose variability, improve performance – Segmenting customers – Replacing/supporting human decision making with automated algorithms – Innovating new business models,products,services 20

http://jtonedm.com/2013/06/05/business-value-of-big-data-and-analytics-bda13/

21

The value of big data

Source: http://www.cmswire.com/cms/information-management/big-data-smart-data-and-the-fallacy-that-lies-between017956.php#null

What is Data Science? (sourcehttp://datascience.nyu.edu/what-is-data-science/

Data science involves using automated methods to analyze massive amounts of data and to extract knowledge from them.

One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modeling, statistics, analytics, and mathematics. 23

Big Data Analytics (source: http://community.lithium.com/t5/Science-of-Social-blog/Big-Data-Reduction-2-Understanding-PredictiveAnalytics/ba-p/79616

1. Descriptive Analytics The purpose of descriptive analytics is simply to summarize and tell you what happened. simplest class of analytics that you can use to reduce big data into much smaller, but consumable bites of information. Compute descriptive statistics (i.e. counts, sums, averages, percentages, min, max and simple arithmetic: + − × ÷) that summarizes certain groupings or filtered version of the data, which are typically simple counts of some events. They are mostly based on standard aggregate functions in databases 24

Big Data Analytics (source: http://community.lithium.com/t5/Science-of-Social-blog/Big-Data-Reduction-2-Understanding-PredictiveAnalytics/ba-p/79616

2. Predictive Analytics • The purpose of predictive analytics is NOT to tell you what will happen in the future. It cannot do that. In fact, no analytics can do that. Predictive analytics can only forecast what might happen in the future, because all predictive analytics are probabilistic in nature.

forecasting:

25

Examples of Non-Temporal Predictive Analytics • An example of non-temporal predictive analytics where a model uses someone’s existing social media activity data (data we have) to predict his/her potential to influence (data we don’t have).

• Another well-known example of non-temporal predictive analytics in social analytics is sentiment analysis. •

(sourcehttp://community.lithium.com/t5/Science-of-Social-blog/Big-Data-Reduction-2-Understanding-PredictiveAnalytics/ba-p/79616

26

Big Data Analytics souce: http://community.lithium.com/t5/Science-of-Social-blog/Big-Data-Reduction-3-From-Descriptive-to-Prescriptive/ba-p/81556

3. Prescriptive Analytics Prescriptive analytics not only predicts a possible future, it predicts multiple futures based on the decision maker’s actions. A prescriptive model can be viewed as a combination of multiple predictive models running in parallel, one for each possible input action. 27

This predictive model must have two more added components in order to be prescriptive: source: http://community.lithium.com/t5/Science-of-Social-blog/Big-Data-Reduction-3-From-Descriptive-to-Prescriptive/ba-p/81556

• Actionable: The data consumers must be able to take actions based on the predicted outcome of the model • Feedback System: The model must have a feedback system that tracks the adjusted outcome based on the action taken. This means the predictive model must be smart enough to learn the complex relationship between the user’s action and the adjusted outcome through the feedback data 28

How Big Data will be used? Combining Data together is the real value for corporations: 90% corporate data 10% social media data Sensors data just begun (e.g. smart meters) Key basis of competition and growth for individual firms (McKinsey Global Institute).

29

http://cdn.ttgtmedia.com/rms/onlineImages/BI_0814_page7_graphic1.png

30

Examples of BIG DATA USE CASES • • • •

Log Analytics Fraud Detection Social Media and Sentiment Analysis Risk Modeling and Management

31

http://www.loadedtech.com.au/blog/bid/156700/Big-Data-Survey-Brings-New-Insights

32

Big Data can generate financial value(*) across sectors, e.g. • Health care • Public sector administration • Global personal location data • Retail • Manufacturing (McKinsey Global Institute) (*)Note: but it could be more than that! 33

Limitations • Shortage of talent necessary for organizations to take advantage of big data. • Very few PhDs. – Knowledge in statistics and machine learning, data mining. – Managers and Analysts who make decision by using insights from big data. Source: McKinsey Global Institute 34

Smart Data? • Big data provides the infrastructure for economically storing and processing unprecedented amount of data. • But undigested big data (e.g. terabytes of raw logs) and the technology required for it (e.g. Hadoop, Cassandra, etc.) is pretty much inaccessible to the average business person. • There is a huge disconnect between what big data provides and what businesses need. Smart data is how you can fill the gap Source: http://www.cmswire.com/cms/information-management/big-data-smart-data-and-the-fallacy-that-lies-between017956.php#null

35

http://tarrysingh.com/2014/07/fog-computing-happens-when-big-data-analytics-marries-internet-of-things/

36

Big Data: What are the consequences? “Any technological or social force that reaches down to affect the majority of society`s members is bound to produce a number of controversial topics” (John Bittner, 1977)

But, what are the “true” consequences of a society being reshaped by “systematically building on data analytics” ?

37

Big Data: Challenges 1. Data 2. Process

3. Management

38

Data Challenges • Volume: dealing with the size of it In the year 2000, 800,000 petabytes (PB) of data stored in the world (source IBM). Expect to reach 35 zettabytes (ZB) by 2020. Twitter generates 7+ terabytes (TB) of data every day. Facebook 10TB.

Scale and performance requirements strain conventional databases. Scalability has three aspects: • Data Volume, • Hardware Size, and • Concurrency. 39

Analytics Data Platform for Big Data Mike Carey (EDBT Keynote 2012): Big Data in the Database World (early 1980s till now) - Parallel Data Bases. Shared-nothing architecture, declarative set-oriented nature of relational queries, divide and conquer parallelism (e.g. Teradata) - Re-implemention of relational databases (e.g. HP/Vertica, IBM/Netezza, Teradata/ Aster Data,EMC/ Greenplum.) Big Data in the Systems World (late 1990s) - Apache Hadoop (inspired by Google GFS, MapReduce, contributed by large Web companies.e.g. Yahoo!, Facebook - Google BigTable, - Amazon Dynamo. 40

Data Challenges • Variety: handling multiplicity of types, sources and formats Sensors, smart devices, social collaboration technologies. Data is not only structured, but raw, semi structured, unstructured data from web pages, web log files (click stream data), search indexes, e-mails, documents, sensor data, etc. 41

Structured Data

Employee EmpNo 100 200 150 170 105

Ename Bob Bob Peter Doug John

DeptNo 10 20 10 20 10

DeptName Marketing Purchasing Marketing Purchasing Marketing

42

Clickstream Data Clickstream data is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log (source http://www.jafsoft.com/searchengines/log_sample.html) and( http://hortonworks.com)

•

•

fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1pre2 ([email protected])" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FASTWebCrawler/2.1-pre2 ([email protected])"

•

ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)"

•

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

• • •

• •

43

Potential Uses of Clickstream Data (source http://hortonworks.com) One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data. Now enterprises can use Hadoop Data Platforms (HDP) to refine and analyze clickstream data. They can then answer business questions such as: • What is the most efficient path for a site visitor to research a product, and then buy it? • What products do visitors tend to buy together, and what are they most likely to buy in the future? • Where should I spend resources on fixing or enhancing the user experience on my website?

44

Variety (cont.) • A/B testing (two versions (A and B) are compared, which are identical except for one variation that might affect a user's behavior), sessionization (Behavioral analytics focuses on how and why users behave by grouping events into sessions/ A session ID is created and stored every time a

, bot detection (A bot is formed when a computer gets infected), and pathing analysis (statistics: directed dependencies among a set of variables/multiple regression analysis) all require powerful analytics on many petabytes of semistructured Web data. user visits your web page or mobile application.)

45

Twitter source: What Twitter`s Made Of By Paul Ford, Bloomberg Businessweek, November 11-17, 2013.

It`s short- 140 characters If you open one up one Tweet and look inside…. You find (via an “Application Programming Interface”, API), e.g: •-Identity of the creator (bot or human) •-Location from which it originated •-Date and time it went out •-Number of people who read the tweet, “fav`d” a tweet, • number of retweets, etc. … we call it “metadata” You can access this info requesting a “API key” from Twitter (fast automated procedure), you get a Web address, and access it46 as raw data for computers to read….

Twitter Example of metadata

“Coordinates” part of the tweet: This value contains geographical information- latitude and longitude (in a format called GeoJSON-public open standard) “Place” part of the tweet: Specific, named locations. (multiple coordinates-polygons over the surface of the earth).

47

Twitter With this metadata (places and time), by applying some math one can reveal, for example …

•how far one tweeter is from another •learn when people are active in “social media engagement” 48

Twitter More metadata …

“withheld copyright” – if set to true…trouble over copyright “withheld_in_countries” list of countries in which the tweet is banned “possibly_sensitive” –if set to true links to potentially offensive things : nudity, violence, or medical procedures (a user can check a box in his profile, automatically flagge

49

Example of search indexes (source https://cloudant.com/)

• Search indexes are defined by a javascript function. This is run over all of your documents, in a similar manner to a view's map function, and defines the fields that your search can query. • A simple search function function(doc) { index("name", doc.name); } Defining an analyzer

"indexes": { "mysearch" : { "analyzer": "whitespace", "index": "function(doc){ ... }" }, } 50

Analyze Machine and Sensor Data (source :http://hortonworks.com/hadoop-tutorial/how-to-analyze-machine-and-sensor-data/ )

Sensor Data •A sensor is a device that measures a physical quantity and transforms it into a digital signal. Sensors are always on, capturing data at a low cost, and powering the “Internet of Things.” Sensors data: separating signal to noise ratio Potential Uses of Sensor Data Sensors can be used to collect data from many sources, such as: •To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or airplane engines. This data can be used for predictive analytics, to repair or replace these items before they break. •To monitor natural phenomena such as meteorological patterns, underground pressure during oil extraction, or patient vital statistics during recovery from a medical procedure. 51

Raw data (source http://www.wisegeek.org/what-is-raw-data.htm) Raw data, also known as source data or atomic data, is information that has not been processed in order to be displayed in any sort of presentable form. The raw form may look very unrecognizable and be nearly meaningless without processing, but it may also be in a form that some can interpret, depending on the situation. This data can be processed manually or by a machine. In some cases, raw data may be nothing more than a series of numbers. The way those numbers are sequenced, however, and sometimes even the way they are spaced, can be very important information. A computer may interpret this information and give a readout that then may make sense to the reader. Binary code is a good example of raw data. Taken by itself as a printout, a binary code does very little for the computer user — at least the vast majority of users. When it is processed through a computer, on the other hand, it provides more understandable information. In fact, binary code is typically the source code for everything a computer user sees. 52

Sensor data logged to a text file. Imported data into Excel (sourceMemos From the Cube)

53

Data Challenges cont • Velocity (reacting to the flood of information in the time required by the application) Stream computing: e.g. “Show me all people who are currently living in the Bay Area flood zone”- continuosly updated by GPS data in real time. (IBM) Challenge: “the change of the data structure; the consumer has no longer control over the source of data creation; this requires the concept of late binding; it also poses a major challenge in regards to governance and data quality; with the shift of the transformation of data from ETL to at-timeof-consumption the ‘ETL-knowledge’ must be give to every consumer; tools will have to help on that.” -- Thomas, Fastner, eBay

Combining multiple data sets 54

Data Challenges cont. • Personally Identifiable Information – much of this information is about people. Can we extract enough information to help people without extracting so much as to compromise their privacy? Partly, this calls for effective industrial practices. Partly, it calls for effective oversight by Government. Partly – perhaps mostly – it requires a realistic reconsideration of what privacy really means. (Paul Miller) “right to be forgotten”. 1,000 a day ask Google to remove search links (145,000 requests have been made in the European Union 55 covering 497,000+ web links)

Data Challenges cont. – Data

dogmatism – analysis of big data can offer quite

remarkable insights, but we must be wary of becoming too beholden to the numbers. Domain experts – and common sense – must continue to play a role. e.g. It would be worrying if the healthcare sector only responded to flu outbreaks when Google Flu Trends told them to. (Paul Miller)

56

Process Challenges The challenges with deriving insight include - Capturing data,

- Aligning data from different sources (e.g., resolving when two objects are the same), - Transforming the data into a form suitable for analysis,

- Modeling it, whether mathematically, or through some form of simulation, - Understanding the output — visualizing and sharing the results, (Laura Haas, IBM Research) 57

Management Challenges Data Privacy, Security, and Governance. - ensuring that data is used correctly (abiding by its intended uses and relevant laws), - tracking how the data is used, transformed, derived, etc, - and managing its lifecycle. “Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits” (Michael Blaha).

58

Big Data: Data Platforms “In the Big Data era the old paradigm of shipping data to the application isn`t working any more. Rather, the application logic must “come” to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack.  Hadoop: Processing moves to where the data is!

Data management “With terabytes, things are actually pretty simple -- most conventional databases scale to terabytes these days. However, try to scale to petabytes and it`s a whole different ball game.” (Florian Waas, previously at Pivotal) 59

Big Data Analytics • In order to analyze Big Data, the current state of the art is a parallel database or NoSQL data store, with a Hadoop connector. – Concerns about performance issues arising with the transfer of large amounts of data between the two systems. The use of connectors could introduce delays, data silos, increase TCO.

– What about existing Data Warehouses? 60

Which Analytics Platform for Big Data? • • • • • •

NoSQL (document store, key-value store,…) NewSQL InMemory DB Hadoop Data Warehouses Plus… scripts, workflows, and ETL-like data transformations

….Are we going back to “Federated ” Databases? This just seems like too many “moving parts”.

61

http://blogs.teradata.com/data-points/tag/hadoop/page/2/

62

http://vision.cloudera.com/cloudera-connect-the-blueprint-to-an-information-driven-enterprise/

63

High Performance High Functionality Big Data Software Stack - Geoffrey Fox, Judy Qiu, Shantenu Jha, Indiana and Rutgers University http://www.exascale.org/bdec/sites/www.e xascale.org.bdec/files/whitepapers/fo x.pdf

64

http://www.analytics-tools.com/p/home.html

65

Build your own database…

Spanner: Google’s Globally-Distributed Database Spanner is Google’s scalable, multi-version, globally- distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. Spanner: Google's Globally-Distributed Database Published in the Proceedings of OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October, 2012. Recipient of the Jay Lepreau Best Paper Award.

66

Google AdWords Ecosystem One shared database backing Google's core AdWords business Legacy DB: Sharded MySQL Critical applications driving Google's core ad business • 24/7 availability, even with data center outages •Consistency required – ○ Can't afford to process inconsistent data – ○ Eventual consistency too complex and painful Scale: 10s of TB, replicated to 1000s of machines

F1:

A new database, built from scratch, designed to operate at Google scale, without compromising on RDBMS features. Co-developed with new lower-level storage system, Spanner •Better scalability •Better availability •Equivalent consistency guarantees •Equally powerful SQL query

www.stanford.edu/class/cs347/slides/f1.pdf 67

Google F1 - A Hybrid Database F1 - A Hybrid Database combining the • Scalability of Bigtable • Usability and functionality of SQL databases •Key Ideas •Scalability: Auto-sharded storage •Availability & Consistency: Synchronous replication •High commit latency: Can be hidden •○ Hierarchical schema ○ Protocol buffer column types ○ Efficient client code A scalable database without going NoSQL. F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business Jeff Shute, Mircea Oancea, Stephan Ellner, Ben Handy, Eric Rollins, Bart Samwel, Radek Vingralek, Chad Whipkey, Xin Chen, Beat Jegerlehner, Kyle Littlefield, Phoenix Tong SIGMOD May 22, 2012 68

Hadoop Limitations Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm. The missing piece of the Hadoop puzzle is accounting for real time changes. Apache™ Hadoop® YARN (MapReduce 2.0 (MRv2)) is a sub-project of Hadoop at the Apache Software Foundation that takes Hadoop beyond batch to enable broader data-processing. 69

Replacing Hadoop

Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley (https://spark.apache.org)

Databricks was founded out of the UC Berkeley AMPLab by the creators of Apache Spark. A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.

The Stratosphere project (TU Berlin, Humboldt University, Hasso Plattner Institute) (www.stratosphere.eu) contributes to Apache Flink is a platform for efficient, distributed, general-purpose data processing. flink.incubator.apache.org The ASTERIX project (UC Irvine- started 2009) http://asterix.ics.uci.edu Four years of R&D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.opensource Apache-style licence “ 70

Which Language for Analytics? • There is a trend in using SQL for analytics and integration of data stores. (e.g. SQL-H, Teradata QueryGrid) Is this good?

71

Graphs and Big Data (sources: http://www.graphanalysis.org/SC12/02_Feo.pdf) (http://neo4j.com/developer/graph-database/)

•

The breadth of problems requiring graph analytics is growing rapidly

• • • • • •

Large Network Systems Social Networks Packet Inspection Natural Language Understanding Semantic Search and Knowledge Discovery CyberSecurity

72

NoSQL graph databases examples Neo4j, InfiniteGraph, AllegroGraph Data Model: Nodes and Relationships

(http://neo4j.com/de veloper/graphdatabase/)

Hadoop Benchmarks Quantitatively evaluate and characterize the Hadoop deployment through benchmarking

HiBench: A Representative and Comprehensive Hadoop Benchmark Suite Intel Asia-Pacific Research and Development Ltd

THE HIBENCH SUITE HiBench -- benchmark suite for Hadoop, consists of a set of Hadoop programs including both synthetic micro-benchmarks and real-world applications. Micro Benchmarks : Sort, WordCount , TeraSort, EnhancedDFSIO Web Search : Nutch Indexing, Page Rank Machine Learning: Bayesian Classification, K-means Clustering

Analytical Query : Hive Join, Hive Aggregation 74

Big Data Benchmarks TPC launched TPCx-HS: “industry’s first standard for benchmarking big data systems, is designed to provide metric and methodologies to enable fair comparisons of systems from various vendors”

-- Raghunath Nambiar (CISCO), chairman of the TPC big data committee , August 18, 2014.

75

Big Data and the Cloud – What about traditional enterprises? – Very early adoption for analytics In general people are concerned with the protection and security of their data. Hadoop in the cloud: Amazon has a significant webservices business around Hadoop.

76

Big Data for the Common Good • Very few people seem to look at how Big Data can be used for solving social problems. Most of the work in fact is not in this direction.

Why this? Lack of obvious economic and personal incentives… What can be done in the international research and development communities to make sure that some of the most brilliant ideas do have an impact also for social issues? 77

Big Data for the Common Good “As more data become less costly and technology breaks barrier to acquisition and analysis, the opportunity to deliver actionable information for civic purposed grow. This might be termed the “common good” challenge for Big Data.” (Jake Porway, DataKind) 78

Challenges and Opportunities - Frankfurt Big Data Lab [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch