Big Data: Reference Datasets


The rapid growth in velocity, volume, and variety of data poses an increasingly daunting challenge to big data analytic tools. According to a report from IDC titled, The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the far East, only 0.5% of the 643 exabytes of useful data have been analyzed.  As the volume grows to 40,000 exabytes in 2020, about a third or 13,000 exabytes would be of value for analytics. Our vision is to help analytic tool vendors, R&D establishments, data scientists and end users scale and meet this humongous big data challenge by:

  1. Providing a platform with a variety of big data sets, both structured and unstructured.
  2. Providing standardized performance metrics for both hardware and software.
  3. Providing a combination of real and synthetic datasets.
  4. Providing datasets classified by types/genres, conforming to a taxonomy geared to cater to specialized algorithms.
  5. Enabling users to run their analytic tools on the data sets and quantify their tools with the standard performance metrics.
  6. Providing tools for visual representation of analytics for effective comprehension.