Practical Software

Discover the Latest Features in Apache Spark 4.0

Islam Elbanna — Fri, 27 Jun 2025 18:55:07 GMT

Apache Spark 4.0 is a major update, bringing new APIs, performance improvements, and a more modular design. Here's a summary of what's new and why it matters. In this post, we’ll look at the most exciting new features in Apache Spark 4.0, what they mean for developers and data engineers, and how they prepare Spark for the future.

1. Spark Connect: A New Client-Server Protocol

Spark Connect, introduced in version 3.4, offers a client-server architecture that enables remote connectivity to Spark clusters using the DataFrame API, allowing it to be embedded in modern data applications, IDEs, notebooks, and programming languages. Check out more details at https://practical-software.com/how-spark-connect-enhances-the-future-of-apache-spark-connectivity

Spark Connect has seen significant advancements in Spark 4.0, aiming to achieve a near-parity with "Spark Classic" and enhance its capabilities as a decoupled client-server architecture. Here's a breakdown of what's new for Spark Connect in Spark 4.0:

Enhanced API Coverage and Compatibility: A major focus has been on expanding the API coverage for Spark Connect to bring it very close to the full functionality of traditional Spark applications, making it much smoother to migrate existing applications to Spark Connect. Switching between using Spark Classic and Spark Connect is now more seamless due to improved compatibility between their Python and Scala APIs. Spark ML functionalities are now supported over Spark Connect, allowing users to leverage Spark's machine learning capabilities remotely.
Multi-Language Support: Beyond the existing Python and Scala clients, Spark 4.0 introduces new, community-supported Spark Connect clients for Go, Swift, and Rust, significantly broadening the range of languages developers can use to interact with Spark clusters. This expanded language support allows developers to utilize Spark in their preferred language, even outside the JVM ecosystem, via the Connect API.

The spark.api.mode configuration in Apache Spark determines whether an application runs in Spark Classic or Spark Connect mode. Setting it to connect enables Spark Connect, which allows client applications to interact with a remote Spark server. This example demonstrates how to configure spark.api.mode in a PySpark application:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkConnectExample") \
    .config("spark.api.mode", "connect") \
    .master("spark://your_spark_master_url") \
    .getOrCreate()

# Your Spark code here
data = spark.read.csv("your_data.csv")
data.show()

spark.stop()

2. Performance and Catalyst Improvements

Spark 4.0 continues to push boundaries in query optimization and execution, which reduced the chances of OOM by providing

Faster joins and shuffle operations.
Improved adaptive query execution (AQE).
Better codegen for complex queries, reducing JVM overhead.

3. Python UDF Performance Improvements

Python is the most popular language for Spark, but Python UDFs have been a performance bottleneck for a while. In Spark 4.0, there were major improvements that resulted in a significant speedup for PySpark workloads using UDFs in large pipelines.

Vectorized UDFs Using Apache Arrow has broader support
- Traditional UDFs in Spark process data row-by-row, requiring each row to be serialized and sent from the JVM (Spark engine) to the Python process. The Python UDF then executes on one row, and the results are sent back to the JVM. This process incurs significant overhead due to per-row communication and serialization.
- With Apache Arrow, Spark can batch rows into a columnar format (Arrow Tables), send entire batches between the JVM and Python at once, and process them using Pandas UDFs (also known as vectorized UDFs). Instead of processing row-by-row, you handle entire Pandas Series or DataFrames in your UDF, which is much faster!
Better Python-JVM serialization.
Enhanced error reporting for PySpark.

4. Ecosystem Modernization and Cleanup

Dropped support for legacy Hive features (e.g. HiveContext, Hive Metastore Dialects and Hive SerDe Support).
Streamlined dependency management gives up the monolithic JARs and introduces a more modular packaging system, allowing you to include only the components you need.

This means cleaner code bases, smaller builds, and less dependency hell.

Conclusion

Apache Spark 4.0 is not just focused on performance; it’s about adapting Spark for today's cloud-based, data-driven world. Whether you're creating a streaming ETL pipeline, an ML workflow, or a large analytics dashboard, these updates aim to make it faster, more adaptable, and ready for the future.

Already using Spark 4.0? Share your thoughts and benchmarks in the comments!

Monolith or Microservices: How to Choose the Best Software Architecture for your case

Islam Elbanna — Fri, 30 May 2025 11:13:13 GMT

In software engineering, choosing a system architecture is an essential decision that can affect the entire life-cycle of an application, from development and testing to deployment, maintenance, and debugging. There are two main architectures: Monolithic and Microservices. Each has its own strengths and trade-offs, and understanding each design is essential to building the right one for your needs.

What is Monolithic Architecture?

A monolithic architecture is a traditional model for designing software applications. In this approach, all components; User Interface, Business Logic, and Data Access are implemented in a single code-base and deployed as one unit.

Pros

Simplicity: Since all are in one place, then it is much easier to develop, test, and deploy, especially for small teams.
Performance: Since all communications happen within the same process or between multiple processes on the same machine, completing a request or an operation is much faster. This is because it avoids the need to communicate between multiple services, which might involve costly or slow networks.
Ease of debugging: Since logs and traces are from one running application, it is much easier to trace and debug a request.
Simple infrastructure: No need to set up complex infrastructure since we only require one set of configurations to deploy a single application.
Less technologies: Since it is a single system, you can use the same technology everywhere, which can be easier for smaller teams that specialize in a limited set of technologies.

Cons

Scalability limitations: Since the workload varies across different types of system’s functionalities, scaling the system can be very costly. This is because it involves replicating the entire application, which might require some expensive servers, even though most other functionalities don't need scaling.
Tight coupling: Since the codebase is all in one place, changes in one part may require redeploying the entire system. This carries the risk of deploying unfinished code in other areas and requires more testing to ensure no other parts are affected. Additionally, since all code is accessible from anywhere (even if visibility is controlled), it can lead to more dependencies between modules. To change one part, you might need to modify other parts that rely on that code.
Slower development over time: As codebase grows, onboarding and maintaining code becomes harder, since the developer needs to be aware of most of the system to be able to change one piece of code.
Technology lock-in: Difficult to use or experiment with different technologies within one application, even if using a different type of technology would be more suitable for some parts of the system.

What are Microservices?

Microservices divide an application into loosely connected, independently deployed services. Each service handles a specific business function and communicates with other services through predefined APIs.

Pros

Independent scalability: Based on the workload of each part of the system, you can scale only the services that need it. This can be done automatically to allocate resources to the right places, improving utilization.
Resilience: Failure in one service doesn’t necessarily bring down the entire system.
Technology flexibility: Each service can use a different tech stack depending on the use case and the most efficient way to implement it.
Faster deployments: Teams can develop and release services independently and more quickly because they only need to test the components that depend on their service, rather than the entire system. This is especially true if there are no changes to the API.
Faster on-boarding: New developers don't need to understand the entire system to start contributing to a service, which saves time when on-boarding a new member.

Cons

Increased complexity: Managing multiple services increases operational and architectural overhead.
Distributed systems challenges: Issues like network latency, fault tolerance, and eventual consistency must be handled.
Deployment complexity: Requires complex CI/CD pipelines and different configurations for each service, which adds an overhead from the operational side.
Data management: Each service might require its own type of database, which adds complexity to coordination. While this is technically beneficial, supporting different types of storage can be challenging. It involves managing fault tolerance, backups, and maintaining availability.
Complex debugging and monitoring: Tracing a single request through multiple services is very challenging and requires creating a custom infrastructure to make logging, debugging, monitoring and alerting easier.

The Hidden Cost of Microservices: Operational Overhead

While Microservices offer flexibility and scalability, they demand a robust supportive ecosystem. Without this, the architecture can quickly become a maintenance nightmare. . Here are the key systems around Microservices:

Service Discovery: Helps services dynamically locate and communicate with each other in a distributed environment. e.g. Kubernetes Service Discovery.
API Gateway and Load Balancing: Acts as a single entry point for clients, managing routing, load balancing, authentication, and request aggregation. This is important because each service might be deployed on different servers and zones due to auto-scaling. It distributes traffic across multiple service instances to enhance performance and reliability, ensuring that each service or client can identify which server to communicate with in a balanced manner. e.g. AWS API Gateway, Spring Cloud Gateway, gRPC-LB, Nginx and Envoy.
Security (Authentication & Authorization): Ensures secure communication between services and controls access efficiently, so it doesn't impact the overall request latency. e.g. OAuth2 and OpenID.
Configuration Management: Centralizes and dynamically manages configurations across services. It can be connected with a dynamic auto-scaling system, which also monitors and provides fault tolerance to maintain the system's resilience. e.g. Kubernetes, AWS Systems Manage, Resilience4j and Spring Cloud Config.
Centralized Logging & Monitoring: Aggregates logs and metrics and tracks requests across services to help with debugging, analyzing latency, and monitoring performance. e.g. ELK Stack (Elasticsearch, Logstash, Kibana), Prometheus + Grafana and OpenTelemetry.
Deployment and CI/CD Pipelines: Automates the testing, building, and deploying of Microservices. It allows for gradual deployment, which helps catch issues by deploying to a small percentage of traffic first. This approach supports a safer continuous delivery process. e.g. Jenkins and GitLab CI/CD.
Messaging queues: While it is an optional component, it enables decoupled communication between services via events and improves resilience of the system. e.g. Apache Kafka and RabbitMQ.

Without these systems, developers may struggle to debug issues, track down failures, or even understand how components interact. Microservices don't just decentralize code, but also decentralize responsibility, which can result in chaos if not carefully orchestrated.

When to Choose What?

Go Monolith If: You are a small team or an early-stage startup, the application is simple and it is not expected to grow quickly, and you need to launch rapidly with minimal infrastructure.
Go Microservices If: Your application has a clear domain and context, you have the resources to work on different parts of the application at the same time, which need to use different technologies and be scaled independently, and you have the infrastructure and experience to support it.

Conclusion

Microservices are powerful but not a one-size-fits-all solution. While they offer scalability and flexibility, they also bring significant complexity that requires proper tools and practices to manage. If your team isn't prepared to invest in logging, monitoring, tracing, and service orchestration, starting with a monolith might be the wiser option.

Choose your architecture based on the specific needs and maturity of your product and the resources you have, not just on current trends. The choice of tools should consider factors like cloud vs on-site, team expertise, and specific use cases.

Selecting the Best File Formats for Apache Spark: Parquet, ORC, CSV and more

Islam Elbanna — Sat, 24 May 2025 13:36:44 GMT

One of the most important decisions in your Apache Spark pipeline is how you store your data. The data format you choose can dramatically affect performance, storage costs, and query speed. Let’s explore the most common file formats supported by Apache Spark, and in which cases they can fit the most.

Different file formats

There are different types of data formats commonly used in data processing, especially with tools like Apache Spark, broken into categories based on their structure and use case:

Row-Based File Formats

The data is stored row by row, and it is easy to write and process linearly, but less efficient for analytical queries where only a few columns are needed.

CSV (Comma-Separated Values)

CSV is a plain text, row-based format where columns are separated by commas. It is easy to work with but not efficient for big data.

Pros: CSV is human-readable, simple to write and read, and is used globally.

Cons: CSV lacks data types, requiring Spark to infer column types from a sample of the CSV file, which adds extra work and may not be accurate. Additionally, CSV has poor compression and struggles with encoding complex data.

Use cases: Legacy systems, small data exports, debugging, and working with spreadsheets.

Reading CSV file in Apache Spark example:

# Pyspark example
df = spark.read.options(delimiter=",", header=True).csv(path)

# Scala example
val df = spark.read.option("delimiter", ",").option("header", "true").csv(path)

JSON (JavaScript Object Notation)

JSON is a lightweight, text-based format for exchanging data. It uses human-readable text to store and send information, but it can be slow and doesn't enforce a schema.

Pros: JSON is readable and widely supported by many systems, and can store semi-structural data.

Cons: JSON is slow to parse, and each row must be a valid JSON for Spark to parse. Additionally, from a storage perspective, JSON produces large files because many boilerplate tokens and key names are repeated in each row, and it lacks schema enforcement.

Use case: Mainly use JSON for debugging or exploring data. It can also be used to integrate with external systems that provide JSON, which you can't control, but don’t depend on it as the final storage data format.

Reading JSON file in Apache Spark example:

# Pyspark example
df = spark.read.json(path)

# Scala example
val df = spark.read.json(path)

Apache Avro

Apache Avro is a row-based format often used with Kafka pipelines and data exchange scenarios. It supports descriptive extendable schema and is compact for serialization.

Pros: Avro is efficient in storage, since it is in binary format, and has a great schema evolution feature.

Cons: While Avro is efficient in storage, it is not optimized for columnar queries, since you need to scan the whole file to read specific columns.

Use case: Avro is mainly used with real-time streaming systems like Kafka because it is easy to serialize and transmit. It also allows for easy schema evolution through a schema registry.

The spark-avro module is external and not included in the spark-submit or spark-shell by default, but spark-avro_VERSION and its dependencies can be directly added to spark-submit using --packages

./bin/spark-submit --packages org.apache.spark:spark-avro_VERSION

./bin/spark-shell --packages org.apache.spark:spark-avro_VERSION

Reading Avro file in Apache Spark example:

# Pyspark example
df = spark.read.format("avro").load(path)

# Scala example
val df = spark.read.format("avro").load(path)

Columnar File Formats

The data is stored column by column, making them ideal for analytics and interactive dashboards where only a subset of columns is queried.

Parquet (The Gold Standard for Analytics)

Parquet is a columnar binary format optimized for analytical queries. It’s the most popular format for Spark workloads.

Pros: Parquet is built for efficient reads with compression and predicate push-down, which makes it fast, compact, ideal for Spark, Hive, Presto.

Cons: Parquet is slightly slower to write than row-based formats.

Use case: Parquet is the first choice for Spark and analytical queries, data lakes, cloud storage.

Reading Parquet file in Apache Spark example:

# Pyspark example
df = spark.read.parquet(path)

# Scala example
val df = spark.read.parquet(path)

Apache ORC (Optimized Row Columnar)

ORC is another columnar format, optimized for the Hadoop ecosystem, especially Hive.

Pros: ORC has a high compression ratio, and is optimized for scan-heavy queries, and supports predicates push-down similar to Parquet.

Cons: ORC has less support outside Hadoop tools, which makes it harder to integrate with other tools.

Use case: Hive-based data warehouses, HDFS-based systems.

Reading ORC file in Apache Spark example:

# Pyspark example
df = spark.read.format("orc").load(path)

# Scala example
val df = spark.read.format("orc").load(path)

Summary table

Format	Type	Compression	Predicate Push-down	Best Use Case
Parquet	Columnar	Excellent	✅ Yes	Big data, analytics, selective queries
ORC	Columnar	Excellent	✅ Yes	Hive-based data lakes
Avro	Row-based	Good	❌ No (limited)	Kafka pipelines, schema evolution
JSON	Row-based	None	❌ No	Debugging, integration
CSV	Row-based	None	❌ No	Legacy formats, ingestion, exploration

Conclusion

Choosing the right file format in Spark is not just a technical decision, but it's a strategic one. Parquet and ORC are solid choices for most modern workloads, but your use case, tools, and ecosystem should guide your choice.

How Spark Connect Enhances the Future of Apache Spark Connectivity

Islam Elbanna — Sun, 18 May 2025 15:59:49 GMT

Apache Spark has been a popular choice for large-scale distributed data processing. However, as data teams move to cloud architectures and separate computes from client interfaces, the traditional tightly coupled Spark driver model has begun to reveal its limitations. In this article we will explore the new Spark Connect feature, the future of the remote execution.

What is Spark Connect?

Spark Connect is a decoupled client-server protocol that lets Spark clients, like Python or Java applications, interact with a Spark driver process over the network. Unlike traditional Spark applications where the client starts and controls the driver, Spark Connect uses a gRPC-based protocol to communicate with a running Spark Connect server. Think of it as Spark as a Service for your data apps and notebooks.

Spark Connect is introduced in Spark 3.4 and further improved in 3.5. It changes how clients connect to and interact with a Spark cluster, providing more flexibility, scalability, and language support.

Spark Connect is not a cluster manager. It's a protocol that allows clients to communicate with a Spark driver remotely, while still using traditional cluster modes underneath (like YARN or Kubernetes).
Spark Connect makes client-side development easier and is ideal for integrating Spark into tools like VSCode, Jupyter, or web apps.
Decoupling the client from the Spark cluster makes it easier to upgrade and scale the cluster separately from the client. This approach removes dependency conflicts and offers greater flexibility in language support.

Why Spark Connect?

Before Spark Connect, running a Spark application meant you had to combine the Spark driver with your client logic. This led to long startup times, dependency conflicts, and poor IDE integration. It was also difficult to use interactive notebooks or mobile/web-based interfaces with Spark backend.

With Spark Connect, clients are lightweight and only need a compatible client library. You can embed Spark inside VSCode, Jupyter notebooks, web apps, and mobile apps. This setup allows for easier scaling and faster iteration.

How does Spark Connect Work?

A connection is established between the client and the Spark server.
The client converts a DataFrame query into an unresolved logical plan, which describes what the operation should do, not how it should be executed.
The unresolved logical plan is encoded and sent to the Spark server.
The Spark server optimizes and executes the query.
The Spark server sends the results back to the client.

Practical example: Using Spark Connect with PySpark

Step 1: Start the Spark Connect Server

# This launches the Spark Connect endpoint
$ ./bin/spark-connect-server

Step 2: Connect from a Python Client

from pyspark.sql import SparkSession

# sc:// is the special URI scheme used for Spark Connect
spark = SparkSession.builder.remote("sc://localhost:").getOrCreate()

df = spark.read.csv("example.csv", header=True)
df.groupBy("category").count().show()

Best for the following use cases

Interactive Data Science: Use Jupyter or VSCode to run Spark jobs remotely
CI/CD Pipelines: Validate jobs in GitHub Actions or GitLab CI
Remote Data Apps: Build APIs and dashboards powered by Spark
Multi-Tenant Platforms: Serve multiple users via a single Spark backend

Limitations

Spark Connect is still the early stages, so some features like complex UDFs or Streaming might have limited support.
You need to upgrade to at least Spark 3.5+ for a more stable version.
Monitoring and debugging are still developing for Spark Connect.

Spark Connect alternatives

Spark Job Server and Apache Livy are similar projects that expose Spark jobs through REST APIs. It is typically used to manage job submissions from external apps like dashboards and notebooks, enabling remote interaction with Spark. However, it differs fundamentally in design, use cases, and maturity.

Feature	Spark Connect	Spark Job Server	Apache Livy
Type	Built-in gRPC client-server protocol	External REST API server	REST-based Spark session manager
Official Status	✅ Native to Apache Spark (3.4+)	❌ Community project (not officially maintained)	🟡 Incubating under Apache (inactive since 2021)
Client Language Support	Python, Scala, Java, Go, Rust, Dotnet	REST only, language-agnostic	REST + limited Scala/Python clients
Architecture	Lightweight clients + Spark driver over gRPC	External server + job runners	External service managing Spark sessions
Latency / Interactivity	⚡ Very low latency, interactive (DataFrame API)	High (submit job, poll status)	Medium-high
Streaming Support	❌ Limited (in progress)	❌ No	🟡 Partial (limited with batch-like APIs)
Stateful Sessions	✅ Persistent client-side SparkSession	✅ Yes (Job Server Contexts)	✅ Yes (Livy Sessions)
Authentication / Security	SSL/gRPC auth (evolving)	Manual or custom	Kerberos, Hadoop-compatible
Ease of Deployment	✅ Easy with Spark 3.5+	❌ Complex, often fragile	❌ Tricky to deploy & scale
Use Case Fit	Interactive apps, notebooks, CI/CD	Ad hoc job submission, dashboards	Multi-user notebooks, REST access
Extensibility / Maintenance	✅ Actively developed	❌ Unmaintained / legacy	🟡 Outdated, low activity

Conclusion

Spark Connect is the future of remote Spark native interaction. It's fast and perfect for Developers, Notebooks, and Micro-services.
Livy and Spark Job Server were temporary solutions before Spark had native client-server support. They work well for some REST API-based job orchestration scenarios but are now considered outdated and are not maintained.
If you're starting a new project, go with Spark Connect. If you're maintaining an older system, Livy or Spark Job Server might still be useful for now.

Mastering Apache Spark SQL: Create Complex Queries with Common Table Expressions CTE (WITH Clause)

Islam Elbanna — Sat, 17 May 2025 12:00:52 GMT

Apache Spark SQL uses SQL capabilities to process large-scale structured data. One powerful feature in modern SQL is the WITH clause, supported in Spark SQL as Common Table Expressions (CTE). CTE offer a more organized, readable, and often more efficient way to build complex queries. This article will explain what CTE is, why it is valuable in Spark SQL, and explore its syntax with practical examples.

What is a Common Table Expression (CTE)?

A Common Table Expression, or CTE, is a named, temporary result set that you define within a single SQL statement. It's like a temporary, virtual table that only exists while the query is running. A CTE starts with the WITH clause, followed by one or more named sub-queries.

The basic syntax example is:

WITH expression_name [ ( column_name [ , ... ] ) ] [ AS ] ( query ) [ , ... ]

expression_name: A unique name you assign to your temporary result set.
(column_name, ...): An optional list of column aliases for the CTE's output. If not provided, Spark SQL will infer column names from the SELECT statement within the CTE.
AS (query): The SELECT statement that defines the logic for your CTE.

Why Use CTE in Spark SQL?

While you can often achieve similar results using nested sub-queries, CTE brings several significant advantages to Spark SQL development:

Improves readability: Complex queries can quickly become difficult to follow and modify due to nested sub-queries. CTEs let you break down common logic into smaller, named, and more manageable parts. Each CTE acts as a logical unit of work, making the entire query easier to understand, debug, and maintain.
Enhances usability: A key benefit of CTEs is that you can reference them multiple times within the same WITH clause or the final SELECT statement. This helps avoid code duplication and ensures consistency in your intermediate calculations.
Simplifies debugging: By breaking down the logic into separate blocks, you can easily debug each part of the CTE independently. This helps you find issues much faster than trying to debug a single, complex query.
Potential for Optimization: While CTEs are defined as temporary result sets, Spark often treats them like logical views. This allows Spark's Catalyst Optimizer to apply optimizations, such as pushing down predicates, across CTE boundaries. This can result in more efficient execution plans, particularly when a CTE is used multiple times. Spark might materialize the result or optimize its execution just once.

Practical example

Suppose we have a sales table and want to find the total sales for each product category.

-- Sample Data Setup (for demonstration purposes)
-- This would typically be a pre-existing table or DataFrame
CREATE OR REPLACE TEMPORARY VIEW sales AS
SELECT * FROM VALUES
    ('Electronics', 'Laptop', 1200.00, '2024-01-15'),
    ('Electronics', 'Mouse', 25.00, '2024-01-15'),
    ('Clothing', 'T-Shirt', 20.00, '2024-01-16'),
    ('Electronics', 'Keyboard', 75.00, '2024-01-16'),
    ('Clothing', 'Jeans', 50.00, '2024-01-17'),
    ('Electronics', 'Monitor', 300.00, '2024-01-17')
AS sales_data(category, product, amount, sale_date);

-- Using a CTE to calculate total sales per category
WITH CategorySales AS (
    SELECT category, SUM(amount) AS total_category_sales
    FROM sales
    GROUP BY category
)
SELECT category, total_category_sales
FROM CategorySales
ORDER BY total_category_sales DESC;

Electronics    1600.00
Clothing    70.00
Time taken: 0.157 seconds, Fetched 2 row(s)

In this example, CategorySales is our CTE. It calculates the sum of the amount grouped by category. The final SELECT statement then simply queries this temporary CategorySales result set.

Chaining CTEs

One of the most powerful features of CTEs is the ability to chain them. This means a later CTE can refer to an earlier CTE within the same WITH clause. This approach lets you build complex logic step by step.

Consider extending the previous example to find the average sales across all categories and then identify categories whose sales are above this average.

WITH CategorySales AS (
    SELECT category, SUM(amount) AS total_category_sales
    FROM sales
    GROUP BY category
), AverageOverallSales AS (
    SELECT AVG(total_category_sales) AS overall_avg_sales
    FROM CategorySales -- Referencing the first CTE
)
SELECT
    cs.category,
    cs.total_category_sales,
    aos.overall_avg_sales
FROM CategorySales cs
CROSS JOIN AverageOverallSales aos
WHERE cs.total_category_sales > aos.overall_avg_sales
ORDER BY cs.total_category_sales DESC;

Electronics    1600.00    835.000000
Time taken: 0.321 seconds, Fetched 1 row(s)

Here, CategorySales calculates the total sales for each category. Then, AverageOverallSales uses CategorySales to find the overall average. Finally, the main query joins these two CTEs to filter out categories with sales above the average.

Best fit use cases

CTEs are highly beneficial in various real-world scenarios:

Step-by-Step data transformation: When you need to apply a series of transformations like filtering, aggregation, and joining to your data, CTEs let you define each step clearly.
Complex aggregations and analytics: For multi-level aggregations or calculations involving window functions where intermediate results are needed, CTEs offer a clear structure.
Sub-query factorization: If you find yourself writing the same sub-query multiple times, extract it into a CTE for usability.
Anomaly detection and quality checks: You can define CTEs to spot anomalies or specific data patterns and then use these CTEs in your main query to flag or exclude problematic records.
Improving Performance for Repeated Computations: If a complex sub-query is calculated multiple times in a large query, turning it into a CTE can sometimes help Spark optimize its execution, potentially avoiding repeated calculations.

Conclusion

Common Table Expressions are a key feature in modern SQL that greatly improve the developer experience. By allowing modularity, enhancing readability, and promoting re-usability, CTEs help data professionals write cleaner, more maintainable, and often more efficient Spark SQL queries. They turn complex data challenges into clear, manageable steps.

How to Fix Data Skew in Apache Spark with the Salting Technique

Islam Elbanna — Sat, 10 May 2025 23:00:00 GMT

When working with large datasets in Apache Spark, a common performance issue is data skew. This occurs when a few keys dominate the data distribution, leading to uneven partitions and slow queries. It mainly happens during operations that require shuffling, like joins or even regular aggregations.

A practical way to reduce skew is salting, which involves artificially spreading out heavy keys across multiple partitions. In this post, I’ll guide you through this with a practical example.

How Salting Resolves Data Skew Issues

By adding a randomly generated number to the join key and then joining over this combined key, we can distribute large keys more evenly. This makes the data distribution more uniform and spreads the load across more workers, instead of sending most of the data to one worker and leaving the others idle.

Benefits of Salting

Reduced Skew: Spreads data evenly across partitions, preventing a few workers overload and improves utilization.
Improved Performance: Speeds up joins and aggregations by balancing the workload.
Avoids Resource Contention: Reduces the risk of out-of-memory errors caused by large, uneven partitions.

When to Use Salting

During joins or aggregations with skewed keys, use salting when you notice long shuffle times or executor failures due to data skew. It's also helpful in real-time streaming applications where partitioning affects data processing efficiency, or when most workers are idle while a few are stuck in a running state.

Salting Example in Scala

Let's generate some data with an unbalanced number of rows. We can assume there are two datasets we need to join: one is a large dataset, and the other is a small dataset.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

// Simulated large dataset with skew
val largeDF = Seq(
  (1, "txn1"), (1, "txn2"), (1, "txn3"), (2, "txn4"), (3, "txn5")
).toDF("customer_id", "transaction")

// Small dataset
val smallDF = Seq(
  (1, "Ahmed"), (2, "Ali"), (3, "Hassan")
).toDF("customer_id", "name")

Let’s add the salting column to the large datasets, which we use randomization to spreed the values of the large key into smaller partitions


// Step 1: create a salting key in the large dataset
val numBuckets = 3
val saltedLargeDF = largeDF.
    withColumn("salt", (rand() * numBuckets).cast("int")).
    withColumn("salted_customer_id", concat($"customer_id", lit("_"), $"salt"))

saltedLargeDF.show()
+-----------+-----------+----+------------------+
|customer_id|transaction|salt|salted_customer_id|
+-----------+-----------+----+------------------+
|          1|       txn1|   1|               1_1|
|          1|       txn2|   1|               1_1|
|          1|       txn3|   2|               1_2|
|          2|       txn4|   2|               2_2|
|          3|       txn5|   0|               3_0|
+-----------+-----------+----+------------------+

To make sure we cover all possible randomized salted keys in the large datasets, we need to explode the small dataset with all possible salted values


// Step 2: Explode rows in smallDF for possible salted keys
val saltedSmallDF = (0 until numBuckets).toDF("salt").
    crossJoin(smallDF).
    withColumn("salted_customer_id", concat($"customer_id", lit("_"), $"salt")) 

saltedSmallDF.show()
+----+-----------+------+------------------+
|salt|customer_id|  name|salted_customer_id|
+----+-----------+------+------------------+
|   0|          1| Ahmed|               1_0|
|   1|          1| Ahmed|               1_1|
|   2|          1| Ahmed|               1_2|
|   0|          2|   Ali|               2_0|
|   1|          2|   Ali|               2_1|
|   2|          2|   Ali|               2_2|
|   0|          3|Hassan|               3_0|
|   1|          3|Hassan|               3_1|
|   2|          3|Hassan|               3_2|
+----+-----------+------+------------------+

Now we can easily join the two datasets

// Step 3: Perform salted join
val joinedDF = saltedLargeDF.
    join(saltedSmallDF, Seq("salted_customer_id", "customer_id"), "inner").
    select("customer_id", "transaction", "name")

joinedDF.show()
+-----------+-----------+------+
|customer_id|transaction|  name|
+-----------+-----------+------+
|          1|       txn2| Ahmed|
|          1|       txn1| Ahmed|
|          1|       txn3| Ahmed|
|          2|       txn4|   Ali|
|          3|       txn5|Hassan|
+-----------+-----------+------+

Salting Example in Python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, lit, concat
from pyspark.sql.types import IntegerType

# Simulated large dataset with skew
largeDF = spark.createDataFrame([
    (1, "txn1"), (1, "txn2"), (1, "txn3"), (2, "txn4"), (3, "txn5")
], ["customer_id", "transaction"])

# Small dataset
smallDF = spark.createDataFrame([
    (1, "Ahmed"), (2, "Ali"), (3, "Hassan")
], ["customer_id", "name"])

# Step 1: create a salting key in the large dataset
numBuckets = 3
saltedLargeDF = largeDF.withColumn("salt", (rand() * numBuckets).cast(IntegerType())) \
    .withColumn("salted_customer_id", concat(col("customer_id"), lit("_"), col("salt")))

# Step 2: Explode rows in smallDF for possible salted keys
salt_range = spark.range(0, numBuckets).withColumnRenamed("id", "salt")
saltedSmallDF = salt_range.crossJoin(smallDF) \
    .withColumn("salted_customer_id", concat(col("customer_id"), lit("_"), col("salt")))

# Step 3: Perform salted join
joinedDF = saltedLargeDF.join(
    saltedSmallDF,
    on=["salted_customer_id", "customer_id"],
    how="inner"
).select("customer_id", "transaction", "name")

Notes

This code uses spark.range(...) to mimic Scala’s (0 until numBuckets).toDF("salt").
Column expressions are handled using col(...), lit(...), and concat(...).
The cast to integer uses .cast(IntegerType()).

Tuning Tip: Choosing `numBuckets`

If you set numBuckets = 100, each key can be divided into 100 sub-partitions. However, be cautious because using too many buckets can decrease performance, especially for keys with little data. Always test different values based on the skew profile of your dataset.

If you know how to identify the skewed keys, then you can apply the salting for those keys only, and set the salting for other keys as literal 0, e.x.

  // Step 1: create a salting key in the large dataset
      val numBuckets = 3
      val saltedLargeDF = largeDF.
          withColumn("salt", when($"customer_id" === 1, (rand() * numBuckets).cast("int")).otherwise(lit(0))).
          withColumn("salted_customer_id", concat($"customer_id", lit("_"), $"salt"))

  // Step 2: Explode rows in smallDF for possible salted keys
      val saltedSmallDF = (0 until numBuckets).toDF("salt").
          crossJoin(smallDF.filter($"customer_id" === 1)).
          select("customer_id", "salt", "name").
          union(smallDF.filter($"customer_id" =!= 1).withColumn("salt", lit(0)).select("customer_id", "salt", "name")).
          withColumn("salted_customer_id", concat($"customer_id", lit("_"), $"salt"))

Rule of Thumb
Start small (e.g., 10-20) and increase gradually based on observed shuffle sizes and task runtime.

Final Thoughts

Salting is an effective and simple method to manage skew in Apache Spark when traditional partitioning or hints (SKEWED JOIN) are insufficient. With the right tuning and monitoring, this technique can significantly decrease job execution times on highly skewed datasets.

Practical Software

Discover the Latest Features in Apache Spark 4.0

1. Spark Connect: A New Client-Server Protocol

2. Performance and Catalyst Improvements

3. Python UDF Performance Improvements

4. Ecosystem Modernization and Cleanup

Conclusion

Monolith or Microservices: How to Choose the Best Software Architecture for your case

What is Monolithic Architecture?

Pros

Cons

What are Microservices?

Pros

Cons

The Hidden Cost of Microservices: Operational Overhead

When to Choose What?

Conclusion

Selecting the Best File Formats for Apache Spark: Parquet, ORC, CSV and more

Different file formats

Row-Based File Formats

CSV (Comma-Separated Values)

JSON (JavaScript Object Notation)

Apache Avro

Columnar File Formats

Parquet (The Gold Standard for Analytics)

Apache ORC (Optimized Row Columnar)

Summary table

Conclusion

How Spark Connect Enhances the Future of Apache Spark Connectivity

What is Spark Connect?

Why Spark Connect?

How does Spark Connect Work?

Practical example: Using Spark Connect with PySpark

Step 1: Start the Spark Connect Server

Step 2: Connect from a Python Client

Best for the following use cases

Limitations

Spark Connect alternatives

Conclusion

Mastering Apache Spark SQL: Create Complex Queries with Common Table Expressions CTE (WITH Clause)

What is a Common Table Expression (CTE)?

Why Use CTE in Spark SQL?

Practical example

Chaining CTEs

Best fit use cases

Conclusion

How to Fix Data Skew in Apache Spark with the Salting Technique

How Salting Resolves Data Skew Issues

Benefits of Salting

When to Use Salting

Salting Example in Scala

Salting Example in Python

Notes

Tuning Tip: Choosing numBuckets

Final Thoughts

Tuning Tip: Choosing `numBuckets`