Xonai Accelerator

This page introduces the Xonai Accelerator, a next-generation solution to deliver hardware acceleration for Apache Spark jobs with zero application code changes.


Redefining Big Data: Up to 80% Reduced Execution Time

Xonai built a technology to tap into the full potential of the data processing hardware and integrated it with the industry-leader Big Data analytics solution: Apache Spark.

The Xonai Accelerator leverages new technology to redefine the approach for optimizing petabyte-scale ETL. It is not an engine to simply process individual operations faster with optimized implementations of these, but a solution designed from the ground up to eliminate the many layers of indirection within Big Data pipelines by efficiently combining these as a single unit with a custom compiler. This exposes opportunities to do petabyte-scale ETL with far fewer data infrastructure resources.

This section describes the cornerstone features behind reducing resource utilization of petabyte-scale ETL up to 80%.

Generalized Operation Fusion

Operations (e.g. Filters, Aggregates, Joins) are represented in a canonical loop format within our DSL to facilitate combining them into a single, efficient kernel. This removes inefficiencies inherent to running multiple individual operations in sequence, which typically move data and cannot be co-optimized with other operations as a whole.

The benefit of this core optimization process is particularly pronounced in complex stages with many operations, as these are transformed into very few vectorized loops.

Real-time Resource Calibration

The Xonai Accelerator learns from resource allocation patterns while the application is executing to predict the right amount of resources that should be preallocated for several running operations. This results in faster execution as reallocations are minimized or even eliminated in subsequent partitions.

Fast Cache Serializer

The Xonai Accelerator has enabled by default a fast cache serializer measured to be up to 6X faster compared to the default Spark cache mechanism. The lz4 compression scheme is used by default as in Spark, but new zstd and uncompressed schemes are also available.

The Xonai uncompressed cache mechanism delivers the best performance at the cost of more memory. However, the memory optimizations derived from mechanisms described in this article may decrease memory enough for the uncompressed scheme to be activated without needing to increase the total executor memory.

High-Performance Aggregations

The Xonai Accelerator is equipped with optimizations to deal with very resource-intensive applications aggregating hundreds of billions of rows. The primary mechanisms that make this possible are summarized in this subsection.

Guaranteed Vectorization

The aggregation is designed to always process data in a vectorized loop generated by the compiler. The unified program representation makes the aggregate particularly unique as other operations in the same Spark stage can be pipelined into the aggregate at instruction granularity. This significantly accelerates aggregations compared to the original application.

Faster Shuffle Exchange

Problem: The process of exchanging data between partitions is costly both in execution time and memory resources, often representing a significant fraction of the total execution time.

The Xonai Accelerator incorporates a fast and cache-efficient process of repartition and sorting data at the machine instruction level while aggregating. This results in a significantly faster shuffle exchange operation as all CPU-bound parts of the process are removed. The JVM memory allocation for the shuffle exchange becomes negligible as data coming out of the aggregate is ready to be directly serialized into the network.

In customer applications processing nearly a petabyte of data a day, improvements of nearly 20% less execution time added on top of already accelerated Spark jobs were measured as a result of this process alone.

Reduced Spilling

Problem: Data is spilled into the disk when it cannot fit in the memory while aggregating (or doing other processes), resulting in a costly increase in the total execution time and storage requirements.

The Xonai Accelerator can aggregate in chunks that dynamically fit the executor memory capacity, resulting in zero or negligible residual spilling for most aggregations. The process of aggregating in chunks activates on-demand and has no additional computational overhead, but can potentially shuffle a few extra rows. It was never measured however more than 10% in extra shuffled rows in real applications, which is by several orders of magnitude better than spilling data to the disk.

In complex and memory-intensive customer applications, it was measured up to 50% less execution time as a result of preventing terabytes of spilled data.


Compatibility Guarantees

The Xonai Accelerator achieves bit-by-bit compatibility with data transformation results produced by any supported Spark runtime, and its entire test suite is designed to enforce comparing results for every supported operation. Any operation that is not bit-by-bit compatible will be disabled by default and fallback to the Spark default execution engine unless explicitly enabled via configuration properties. This principle extends to all Spark decision-making processes behind generating query plans, which are not affected by the accelerator as the default execution model is to simply execute the generated plans faster.

Same Environment

Additionally, the Xonai Accelerator does not require application code, hardware, or changes to the operating environment. The accelerator has no direct interaction with the underlying cluster manager (e.g. YARN or Kubernetes), as from its perspective Spark is only completing tasks faster. Any dynamic resource management feature, such as dynamic allocation, simply allocates fewer resources as a result of faster execution.


Deployment

The Xonai Accelerator can be deployed to multiple Spark runtimes or data platforms without requiring application code or environment changes. Find out more in our deployment guides.