Important Dates

Workshop: June 16-17, 2015

Paper submission: May 8, 2015

Paper acceptance: May 18, 2015

Camera-ready deadline, for selected full papers: October 1, 2015

 

n/a

Program

Workshop Program

The main workshop will be held on June 16 and 17 at the Bahen Centre of Information Technology at the University of Toronto, Canada. The WBDB program is typically dynamically changed. A PDF version can be downloaded here.

June 15

Workshop reception at the University of Toronto Faculty Club 6:30 pm - 8:30 pm.

June 16

Workshop Day 1 @ Bahen Centre (Room BA 1180)

Start time Title Speaker
8:30 am Breakfast  
9:00 am Opening & Introduction Round Tilmann Rabl
9:45 am Academic Keynote: Waterloo Benchmarks for Graph Data Management Tamer Özsu
10:45 am What is BigBench Michael Frank
11:30 am Tuning and Optimizing an end to end benchmark - Our Experience tuning Big data Benchmark for BigBench Specification Yi Zhou
12:00 pm Lunch  
1:00 pm Ivan Filho
1:45 pm Lessons Learned: Developing a Standard Raghunath Nambiar
2:10 pm BigBench Evolution: Observations and Recommendations John Poelman
2:35 pm Performance Evaluation of Spark SQL using BigBench Todor Ivanov
3:00 pm Coffee break  
3:30 pm Introduction to Discussion Tilmann Rabl
3:45 pm Breakout discussion  
5:15 pm Consolidation  
6:30 pm Dinner  

The dinner will be held at the Hart House Gallery Grill.

June 17

Workshop Day 2 @ Bahen Centre (Room BA 1180)

Start time Title Speaker
8:30 am Breakfast  
9:00 am Recap Tilmann Rabl
9:15 am Industrial Keynote: SAP HANA Platform Evolution from In Memory RDBMS to Enterprise Big Data Infrastructure Anil Goel
10:15 am Announcement of the First Big Data Benchmarking Challenge Tilmann Rabl
10:30 am Using BigBench to Evaluate an Automated Physical Design of Materialized Views Jiang Du
10:45 am Set of Metrics to Evaluate HDFS and S3 Performance on Amazon EMR with Avro and Parquet Formats Zeev Lieber
11:10 am Benchmarking the Availability and Fault Tolerance of Cassandra Marten Rosselli
11:35 am Big Data Benchmarking needs Big Metadata Generation Boris Glavic
12:00 pm Lunch  
1:00 pm Invited Presentation: BigDataBench: An open-source Big Data Benchmark suite Jianfeng Zhan
1:45 pm From performance profiling to Predictive Analytics while evaluating Hadoop cost-effectiveness using ALOJA Nicolas Poggi
2:10 pm A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Berral
2:35 pm ALOJA-HDI: A characterization of cost-effectiveness of PaaS Hadoop on the Azure Cloud Aaron Call
3:00 pm Coffee break  
3:30 pm SPEC Research Group on Big Data Meeting  
5:30 pm Closing Remarks and Announcement of WBDB2015.in  

Detailed Program

Keynote Speakers

Tamer Özsu (University of Waterloo)

Title: Waterloo Benchmarks for Graph Data Management

Abstract: Recent advances in Information Extraction, Linked Data Management and the Semantic Web have led to a rapid increase in both the volume and the variety of graph-structured data that are publicly available. As businesses start to capitalize on graph-structured data, graph data management systems are being exposed to workloads that are far more diverse and dynamic than what they were designed to handle. Two general graphs types are identified: property graphs, and RDF graphs with different systems being proposed for the management and analysis of each of these. In addition to differences in graph structures, there are differences in languages for their use: RDF has a well-defined standard language (SPARQL), while no commonly accepted language exists for property graphs. Although there are benchmarks that have been proposed for graph processing systems, it is not clear if they are suitable for realistic workloads. In this talk, we present two Waterloo benchmarks: WatDiv (Waterloo SPARQL Diversity Test Suite) for RDF/SPARQL processing and WGB (Waterloo Graph Benchmark) for property graph analytics. We demonstrate some of the shortcomings of the existing benchmarks that motivate our designs, outline the specifics of these benchmarks, and demonstrate the use of these benchmarks that reveal issues that are not well exposed by existing benchmarks.

Bio: M. Tamer Özsu is Professor of Computer Science at the David R. Cheriton School of Computer Science, and Associate Dean (Research) of the Faculty of Mathematics at the University of Waterloo. His research is in data management focusing on large-scale data distribution and management of non-traditional data. He is a Fellow of the Association for Computing Machinery (ACM), and of the Institute of Electrical and Electronics Engineers (IEEE), an elected member of the Science Academy of Turkey, and member of Sigma Xi and American Association for the Advancement of Science (AAAS). He currently holds a Cheriton Faculty Fellowship at the University of Waterloo.

Anil Goel (SAP)

Title: SAP HANA Platform Evolution from In Memory RDBMS to Enterprise Big Data Infrastructure [PDF]

Bio: Anil K Goel is a Chief Architect at SAP where he works with the globally distributed HANA Platform and Database engineering team to drive forward looking architectures, vision, strategy and execution for all SAP data management products and technologies. He oversees data platform related co-innovation projects with hardware and software partners as well as collaborative research and internship programs with many universities in North America and Europe. His interests include database system architecture, in-memory and large scale distributed computing, self-management of software systems and cost modelling.

Anil earned a PhD in CS from University of Waterloo. He also holds M.Tech. (CS) from Indian Institute of Technology, Delhi and B.E. (Electronics and Communications Engineering) from University of Delhi.

Invited Presentations

Ivan Filho (Google)

Title: PerfKit - Benchmarking the Cloud [PDF]

Abstract: The Google Cloud Performance team is responsible for the competitive analysis of Google Cloud products. The experiments the team runs generate PBs of data and tens of billions of data points that are analyzed close to real time, both programmatically and through ad hoc queries. In this talk we will cover the problems the team faces benchmarking Google Cloud Platform, some of the solutions we adopted, as well as two of our tools - PerfKit Benchmarker and PerfKit Explorer, both recently open sourced. We will also cover how the team uses BigQuery to handle massive amounts of data.

Bio: Ivan Santa Maria Filho is the technical lead of Google Cloud Performance team. He is responsible for product profiling and improvements, and handling high profile customer engagements. Ivan works on the design and development of PerfKit Benchmarker, an industry effort to establish a Cloud benchmarking suite, and PerfKit Explorer, an interactive dashboarding solution. In the past Ivan did capacity planning for the Google fleet, and shipped products like SQL Azure and Bing. He holds a BSc and MSc in Computer Science.

Jianfeng Zhan (Chinese Academy of Sciences)

Title: BigDataBench: An open-source Big Data Benchmark suite [PDF]

Abstract: As a multi-discipline research and engineering effort, i.e., system, architecture, and data management, from both industry and academia, BigDataBench is an open-source big data benchmark suite, publicly available from http://prof.ict.ac.cn/BigDataBench. This talk presents the big data benchmarking methodology, big data dwarf workloads, data sets, scalable data generation tools, workloads subsetting, and multi-tenancy in BigDataBench. Also, this talk will introduce China’s first industry-standard big data benchmark suite - BigDataBench-DCA - released by our group together with major industry partners, including Telecom Research Institute Technology, Huawei, Intel (China), Microsoft (China), IBM CDL, Baidu, Sina, INSPUR, ZTE and etc. BigDataBench-DCA is a subset of BigDataBench. BigDataBench-DCA is available from http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/

Bio: Dr. Jianfeng Zhan is Full Professor and Deputy Director at Computer Systems Research Center, Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS). His research work is driven by interesting problems. He enjoys building new systems, and collaborating with researchers with different background. He founded the BPOE workshop, focusing on big data benchmarks, performance optimization and emerging hardware.

Regular Presentations

Tilmann Rabl (University of Toronto)

Title: Opening and Introduction [PDF]

Title: 1st WBDB Big Data Benchmarking Challenge [PDF]

Michael Frank (bankmark)

Title: What is BigBench [PDF]

Yi Zhou (Intel)

Title: Tuning and Optimizing an end to end benchmark - Our Experience tuning Big data Benchmark for BigBench Specification [PDF]

Abstract: Benchmarking Big Data systems is an open and challenge problem. The existing Micro-Benchmarks(e.g., terasort) wouldn’t represent end-to-end workload behavior under real-world conditions. To solve this issue, a new towards industry standard and end-to-end benchmark suite called BigBench has been created. In this presentation, we will share the contribution to BigBench, tuning & optimization experience and benchmark results based on Spark SQL.

 

Raghunath Nambiar (Cisco)

Title: http://clds.sdsc.edu/sites/clds.sdsc.edu/files/WBDB 2015 ca RNambiar.pdf">Lessons Learned from Developing a Standard

Abstract: This session will have a close look at the history of Industry standard benchmarks – success, failures, way too early and too late; and relevance (and irrelevance)  of them in today’s rapidly changing industry landscape driven by new generations of applications, services and systems.

John Poelman (IBM)

Title: BigBench Evolution: Observations and Recommendations [PDF]

Abstract: BigBench is a leading candidate to become an industry standard benchmark for Big Data analytics. BigBench is designed to represent a diverse range of queries based on realistic Big Data use cases. However, the measure of a good Big Data benchmark stretches beyond how comprehensive and representative its queries are. Important additional metrics include how easy the benchmark kit is to work with and how adaptable it is to work on specific Hadoop stacks with vendor specific extensions. In this paper, we will share our experience and lessons learned from adapting the in-development BigBench kit to run on the IBM Hadoop stack as both the kit and the stack have evolved. This includes an evaluation of the out-of-box readiness of the kit and sharing the quantitative metrics on the changes necessary to get the benchmark working, as well as observations and experience while tuning the performance. We will also describe the work needed to run BigBench with Spark and with Big SQL, IBM's SQL over Hadoop implementation.

Todor Ivanov (Goethe University Frankfurt)

Title: Performance Evaluation of Spark SQL using BigBench [PDF]

Abstract: In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.

Jiang Du (University of Toronto)

Title: Using BigBench to Evaluate an Automated Physical Design of Materialized Views [PDF]

Abstract: A major drawback of current scalable data analytical systems such as Hadoop and Spark is that they do not exploit the potential of reusing intermediate results among queries which could significantly improve performance. Current materialization approaches ignore the physical data layout (e.g., partitioning) for materialized views which is critical for performance of such systems. DeepSea integrates automatic physical design (horizontal partitioning) into view maintenance. In particular, if deemed beneficial we will horizontally partition a materialized view based on the workload characteristics, i.e., parts of the view that are frequently accessed by queries will be partitioned more fine-grained than parts that are accessed infrequently. In this work, we evaluate our workload-aware view creation and partitioning approach using BigBench, a widely accepted big data benchmark. BigBench is more suited for our evaluation than other traditional benchmarks such as TPC-H, because we need queries with user defined functions that are popular in big data analytics and fine grained control over skewness in the data.

Zeev Lieber (Amazon)

Title: Set of Metrics to Evaluate HDFS and S3 Performance on Amazon EMR with Avro and Parquet Formats [PDF]

Abstract: We present a set of performance metrics easily obtainable by analyzing Hadoop log files, under assumption that all nodes have their clocks synchronized using NTP, or a comparable protocol. We further present a set of derived metrics based on these raw metrics, that allow to judge different aspects of system's performance on a scale of 0-100, the higher the better. These metrics allow to detach the analysis from the particulars of a specific hardware and network, and focus on system's configuration, such as encoding, compression, scheduling etc. At the same time, we use the original raw metrics to judge overall system performance, includng that of particular hardware and network.

We then apply this method to analyze performance differences between HDFS and S3, using Avro and Parquet serialization on Amazon EMR cluster. We present the insights obtained by using this methodology and conclude that while HDFS does provide performance benefits due to data locality, they may not be significant enough to justify the management chanllenges it presents.

Marten Rosselli (Goethe University Frankfurt)

Title: Benchmarking the Availability and Fault Tolerance of Cassandra [PDF]

To be able to handle big data workloads, modern NoSQL database management systems like Cassandra are designed to scale well over multiple machines. However, with each additional machine in a cluster, the likelihood for hardware failure increases. In order to still achieve high availability and fault tolerance as required by many applications, the data needs to be replicated within the cluster.

In this paper, we evaluate the availability and fault tolerance of Cassandra. We analyze the impact of a node outage within a Cassandra cluster on the throughput and latency for different workloads. Our results show that update operations are more negatively affected by a node outage compared to read operations. For a workload consisting of concurrent read and update operations, the update operations are slowed down by concurrent read operations which becomes especially evident after a node outage. Cassandra proved to be well suited to achieve high availability and fault tolerance. Especially for read intensive applications that require high availability, Cassandra is a good choice.

Boris Glavic (IIT Chicago)

Title: Big Data Benchmarking needs Big Metadata Generation [PDF]

Abstract: One of the characteristics that is typically attributed to Big Data is variety. While there exist scalable techniques for generating data at Big Data dimensions, which is essential for creating data to be used in benchmarking Big Data platforms, the metadata (structure) of this data is part of the input to such data generators. Thus, the variety of generated data is limited, because significant manual work has to be spend on designing this metadata. In this talk, I argue that a scalable metadata generator is needed to produce Big Data with realistic amount of variety. Furthermore, it should be possible to automatically generate workloads based on generated metadata. I will introduce iBench, a scalable metadata generator that was originally developed to support evaluation of data integration systems. I identify three major extensions which will enable Big Data benchmarking with iBench, namely: 1) integration with a scalable data generator such as PDGF or Myriad, 2) automatic generation of workloads to match the generated metadata, and 3) support for semistructured data.

Nicolas Poggi (Barcelona Supercomputer Center)

Title: From performance profiling to Predictive Analytics while evaluating Hadoop cost-effectiveness using ALOJA [PDF]

Abstract: The main goals of the ALOJA research project from BSC-MSR, are to explore and automate the characterization of cost-effectiveness of Big Data deployments. The development of the project over its first year, has resulted in a open source benchmarking platform, an online public repository of results with over 42,000 Hadoop job executions, and web-based analytic tools to gather insights about system's cost-performance. This article describes the evolution of the project's focus and research lines from over a year of continuously benchmarking Hadoop under different configuration and deployments options, and discusses the motivation both technical and market-based of such changes. During this time, ALOJA's target has evolved from a previous low-level profiling of Hadoop runtime, passing through extensive benchmarking and evaluation of a large body of results via aggregation, to currently leveraging Predictive Analytics (PA) techniques. Where our ongoing efforts in PA show promising results to automatically model system's behavior for Knowledge Discovery and support foresighting cost-effectiveness of new defined systems, saving in benchmarking time and costs.

Josep Berral (Barcelona Supercomputer Center)

Title: A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML [PDF]

Abstract: In this article we present a case of study about the capabilities of automatic modeling of Hadoop benchmarks using ALOJA-ML, the machine learning tool from the ALOJA project. The project ALOJA is an open, vendor-neutral repository of Hadoop executions, currently featuring more than 40.000 Hadoop executions from the Intel HiBench suite. Modeling benchmark executions allow us to estimate the results of new or untested configurations or hardware set-ups, but doing this manually can be difficult due to Hadoop run-time environments complexity. For this we are applying machine learning techniques, to obtain models from past executions, predict new executions, and discover knowledge from the created models. Here we present the first experiences on modeling benchmarks on a generalist model and individually, also we present an algorithm to display how software and hardware variables affect Hadoop executions according to the learned model.

 

Aaron Call (Barcelona Supercomputer Center)

Title: ALOJA-HDI: A characterization of cost-effectiveness of PaaS Hadoop on the Azure Cloud [PDF]

Abstract: HDInsight is a Big Data Platform-as-a-Service (PaaS) solution from Microsoft Azure. In this work we present the extension to the ALOJA benchmarking project to characterize the cost-effectiveness and performance, in this case of the Hadoop service.

We analyse the behaviour of HDInsight Hadoop by applying benchmarking and measuring the efficiency of the service and compare the execution times and costs for different cluster sizes. Finally we define the metric for how cost-effective are the different deployed clusters and present the results.