Apache Spark Integration

Apache Spark is a fast and general-purpose cluster computing system. MashZone NextGen Explorer version 10.0 ships with Apache Spark and integrates it to process large CSV and JSON source files and execute queries on those data sources. XML source files are currently not supported. When dealing with bigger source files, the query execution time is lower compared to MashZone NextGen Explorer default engine. Thus, the overall performance of MashZone NextGen Explorer will benefit in those cases. As a guidance, data sets with 500,000 lines and above should be processed using the now-integrated Spark engine.

By default, the MashZone NextGen Explorer installation runs the Visual Analytics Server without Apache Spark integration. To enable Apache Spark you have to perform the following steps.

3. To run MashZone NextGen Explorer with Spark open a command line program and run <MashZone NextGen Explorer installation>\bin\vaserver --embeddedSpark