Skip to end of metadata
Go to start of metadata

General Questions

What is Vertascale?

Data stored in Hadoop is difficult to access, discover and analyze, especially for non technical users. Vertascale solves this problem with an intuitive desktop application that allows users to access and discover and analyze data stored in a number of systems including Hadoop and Amazon S3. Vertascale provides highly tactile query and analysis application that empowers users to rapidly identify, segment and analyze large datasets, and then easily dashboard, sample, and export results. Vertascale dramatically expedites access, visibility and time-to-answer on big data.

Who should use Vertascale?

Vertascale is designed to provide real-time insights for data professionals including business analysts, data scientists and engineers. Vertascale also supports detail-level queries suitable for QA and support personnel. In short, any user within an organization who wants to better understand data stored in S3 or HDFS or other supported file systems is a suitable user of Vertascale.

How do I install the Vertascale application?

Any user can install and run the Vertascale application right on their (Windows or Mac) desktop computer.  Users will immidiatly be able to explore data on their local machine, and connect to remote systems, such as Amazon S3 or Hadoop (HDFS) to explore data that resides in these systems

What file formats does Vertascale support?

The Vertascale application allows users to view and interact data stored in any of the following formats:

  • Text (CSV, TSV, Custom delimiters etc.)
  • JSON
  • ZIP
  • GZIP
  • XML (coming soon)
  • Hadoop Binary format (coming soon)

Do I need to be a programmer to use the Vertascale application?

No, the Vertascale application installs on a users desktop machine and allows you to interact with data files across systems using a familiar "explorer" type interface. Most functions are accessible directly from the UI, without any programing knowledge required  For more advanced users the application provides full support for writing more complex functions in Map Reduce directly from the UI. 

How does Vertascale work with Excel?

Excel is a great application for analyzing smaller datasets, typically less than 100MB. Vertascale allows users to connect to much larger datasets that may be stored on remote systems and in formats not natively supported by Excel (for example JSON), using Vertascale, data professionals can discover, query, analyze and segment data in place, and then export subsets for further analysis using Excel, R or other familiar tools.

Do I need to be running Hadoop to use Vertascale?

No, the Vertascale application connects to a number of different data sources including, Hadoop (HDFS), Amazon S3 and  FTP right from your desktop. Users can browse, query, analyze and export data in any supported format, even across systems.

What does it mean when you say "Vertascale operates on data in place"?

With Vertascale there are no expensive "add-on" systems to install and no complex ETL processes to manage between systems. Vertascale allows users to preview, query and analyze data stored in HDFS, Amazon S3 and other supported systems in place directly from their desktop computer.

How is Vertascale priced?

The Vertascale application is FREE to use in Beta!

Big Data Concepts and Challenges

What is "Top N" Summary Analysis?

Summary Analysis is the ability to identify, and summarize millions of scattered records, instantly and concisely. Vertascale accomplishes this by boiling down huge datasets into powerful summaries. Vertascale's query capability is unique in its ability to provide a real-time Summary Analysis that updates in response to any user query. By doing so, Vertascale greatly reduces "time-to-insight" on large datasets, and eliminates the batch-processing cycle that is typical in the Hadoop environment. Vertascale also provides more traditional "record" retrieval for detailed Boolean queries (similar to a SQL query).

Isn't Map Reduce the Silver Bullet?

Map Reduce is a powerful batch data processing technique, popularized by Hadoop, but it suffers from a few drawbacks. Due to its "brute force" batch processing nature, Map Reduce jobs must churn through all of the data in order to perform analysis. As data volumes grow, analysis times increase. In general, Map Reduce does not support real-time interaction for data analysis. Rather, the analyst typically works with a programmer to create and run a job, reviews the results, and then modifies the job as needed. This process can take hours to days. While frameworks like Hive make it easier for analysts to create Hadoop jobs, they do nothing to reduce the execution time it takes to find answers.

What are "Haystack-in-Haystack" queries and does Vertascale support them?

Traditionally, database searches have focused on finding a needle in a haystack, i.e., a few records matching a specific set of criteria. However, data volumes have rapidly increased, so that today, analysts, engineers and data scientists often need to determine and quantify larger segments of data within a corpus, i.e., the haystack-in-haystack.

For instance, give a huge volume of mobile application activity logs, an analyst might want to know:

  • What are the top 10 cities my users reside in?
  • Which operating systems are most used by my clients?
  • Which activities are my users performing most?

Each of the questions above is a request not for individual records, but for Summary Analysis over potentially huge portions of the entire dataset. In the traditional database world, answering such questions would typically involve complex SQL queries or the layering a Business Intelligence tool on top of a relational database. However, these methods have failed to scale up to today's data volumes and storage needs, hence ushering in the era of Hadoop and batch data processing with Map Reduce. While Map Reduce can answer these types of queries, its batch mode of operation is time consuming, with each query churning though all of the available data to generate an answer.

Vertascale's unique query and Summary Analysis capability supports Haystack-in-Haystack queries on very large datasets.

Does Vertascale support "Needle-in-Haystack" queries?

Yes. Vertascale supports queries that drill down to any level of precision, including specification of a single record. This is accomplished by providing the analyst with a Boolean query language (familiar to SQL users.) Vertascale supports the ability to export the results of queries, even if the results contain millions or billions of records. This is accomplished by exporting the results to HDFS or S3.

Precise queries are required for a number of use cases, including:

  • Audit
  • Support inquiry
  • Bill generation
  • Debugging

What is the "I don't know what I don't know" problem?

In a traditional RDBMS with a well-defined schema, limited data volumes and consistent ETL, users have a good sense for what data is available and where to find it. However, with large data volumes stored in mixed-structure and schema-less design, users often can't identify what's in their corpus or eveb where answers to specific questions reside. This is "I don't know what I don't know" problem.

For example, if an analyst needed to determine how many users are New England, he would immediately be confronted with several preliminary questions that would need to be answered before he could proceed:

  • Does the data include a column for the US State? If so, in what format?
  • Or, is the data collected by Latitude and Longitude?
  • Perhaps the data includes only a city name...note sure...

Vertascale's Summary Analysis feature solves the "I don't know what I don't know" problem by illuminating each column of the data, before the analyst even has to formulate a first query. 

  • No labels