Snowflake announces the public preview of Snowpark Connect for Spark. The new architecture enables Apache Spark code to run directly in Snowflake warehouses without maintaining separate Spark clusters.
Until now, many Snowflake organizations have chosen the Spark Connector to process Snowflake data with Spark code. However, this approach introduced data movement, resulting in additional costs, latency, and governance complexity.
Snowpark Connect eliminates these issues by executing data processing directly in Snowflake. This prevents data movement and reduces latency, while maintaining a unified governance framework.
The solution works with Apache Iceberg tables, including externally managed Iceberg tables and catalog-linked databases. Organizations can leverage the power of the Snowflake platform without moving data or rewriting Spark code.
Tip: Snowflake further into open data via Apache Iceberg updates
Spark Connect as a foundation
With the introduction of Apache Spark 3.4, Spark Connect became available, a client-server architecture that decouples user code from the Spark cluster. This separation forms the basis for Snowpark Connect.
The new solution eliminates the complexity of managing separate Spark environments. Organizations no longer have to struggle with dependencies, version compatibility, and Spark infrastructure upgrades.
Performance and cost benefits
Snowflake claims significant benefits for customers using Snowpark Client. On average, they see 5.6 times faster performance compared to managed Spark solutions. In addition, they achieve 41 percent cost savings.
With Snowpark Connect, organizations get these benefits without having to rewrite their existing Spark code. The solution supports modern Spark DataFrame, Spark SQL, and user-defined functions (UDFs). Snowflake’s elastic compute runtime with virtual warehouses provides automatic performance tuning and scaling.
Current limitations
Snowpark Connect currently only supports Spark 3.5.x versions and is limited to Python environments. Java and Scala support is in development.
Key Spark functionalities such as RDD, Spark ML, MLlib, Streaming, and Delta APIs are not yet part of Snowpark Connect. Semantic differences may exist between supported APIs and standard Spark implementations.
The solution is available through various clients, including Snowflake Notebooks, Jupyter notebooks, Snowflake stored procedures, VSCode, Airflow, and Snowpark Submit.