<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Practical Software]]></title><description><![CDATA[Tech articles focused on improving software quality and helping engineers through best practices, real-world examples, and insights for writing cleaner, more reliable, and maintainable code.]]></description><link>https://practical-software.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1747489025820/20a64b81-0983-4e4b-841a-56274d2a9ea7.png</url><title>Practical Software</title><link>https://practical-software.com</link></image><generator>RSS for Node</generator><lastBuildDate>Mon, 13 Apr 2026 22:26:47 GMT</lastBuildDate><atom:link href="https://practical-software.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Discover the Latest Features in Apache Spark 4.0]]></title><description><![CDATA[Apache Spark 4.0 is a major update, bringing new APIs, performance improvements, and a more modular design. Here's a summary of what's new and why it matters. In this post, we’ll look at the most exciting new features in Apache Spark 4.0, what they m...]]></description><link>https://practical-software.com/discover-the-latest-features-in-apache-spark-40</link><guid isPermaLink="true">https://practical-software.com/discover-the-latest-features-in-apache-spark-40</guid><category><![CDATA[#apache-spark]]></category><category><![CDATA[Apache Spark Features]]></category><category><![CDATA[Apache Spark Associate Developer]]></category><dc:creator><![CDATA[Islam Elbanna]]></dc:creator><pubDate>Fri, 27 Jun 2025 18:55:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750370234475/a1e8762b-7274-42df-badd-f1ea3e47f10f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Apache Spark 4.0 is a major update, bringing new APIs, performance improvements, and a more modular design. Here's a summary of what's new and why it matters. In this post, we’ll look at the most exciting new features in Apache Spark 4.0, what they mean for developers and data engineers, and how they prepare Spark for the future.</p>
<h2 id="heading-1-spark-connect-a-new-client-server-protocol">1. Spark Connect: A New Client-Server Protocol</h2>
<p><strong>Spark Connect</strong>, introduced in version 3.4, offers a client-server architecture that enables remote connectivity to Spark clusters using the DataFrame API, allowing it to be embedded in modern data applications, IDEs, notebooks, and programming languages. Check out more details at <a target="_blank" href="https://practical-software.com/how-spark-connect-enhances-the-future-of-apache-spark-connectivity">https://practical-software.com/how-spark-connect-enhances-the-future-of-apache-spark-connectivity</a></p>
<p>Spark Connect has seen significant advancements in Spark 4.0, aiming to achieve a near-parity with "Spark Classic" and enhance its capabilities as a decoupled client-server architecture. Here's a breakdown of what's new for Spark Connect in Spark 4.0:</p>
<ul>
<li><p><strong>Enhanced API Coverage and Compatibility:</strong> A major focus has been on expanding the API coverage for Spark Connect to bring it very close to the full functionality of traditional Spark applications, making it much smoother to migrate existing applications to Spark Connect. Switching between using Spark Classic and Spark Connect is now more seamless due to improved compatibility between their Python and Scala APIs. Spark ML functionalities are now supported over Spark Connect, allowing users to leverage Spark's machine learning capabilities remotely.</p>
</li>
<li><p><strong>Multi-Language Support:</strong> Beyond the existing Python and Scala clients, Spark 4.0 introduces new, community-supported Spark Connect clients for <strong>Go, Swift, and Rust</strong>, significantly broadening the range of languages developers can use to interact with Spark clusters. This expanded language support allows developers to utilize Spark in their preferred language, even outside the JVM ecosystem, via the Connect API.</p>
</li>
</ul>
<p>The <code>spark.api.mode</code> configuration in Apache Spark determines whether an application runs in Spark Classic or Spark Connect mode. Setting it to <code>connect</code> enables Spark Connect, which allows client applications to interact with a remote Spark server. This example demonstrates how to configure <code>spark.api.mode</code> in a PySpark application: </p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

spark = SparkSession.builder \
    .appName(<span class="hljs-string">"SparkConnectExample"</span>) \
    .config(<span class="hljs-string">"spark.api.mode"</span>, <span class="hljs-string">"connect"</span>) \
    .master(<span class="hljs-string">"spark://your_spark_master_url"</span>) \
    .getOrCreate()

<span class="hljs-comment"># Your Spark code here</span>
data = spark.read.csv(<span class="hljs-string">"your_data.csv"</span>)
data.show()

spark.stop()
</code></pre>
<h2 id="heading-2-performance-and-catalyst-improvements">2. Performance and Catalyst Improvements</h2>
<p>Spark 4.0 continues to push boundaries in <strong>query optimization and execution</strong>, which reduced the chances of OOM by providing</p>
<ul>
<li><p>Faster joins and shuffle operations.</p>
</li>
<li><p>Improved <strong>adaptive query execution (AQE)</strong>.</p>
</li>
<li><p>Better <strong>codegen for complex queries</strong>, reducing JVM overhead.</p>
</li>
</ul>
<h2 id="heading-3-python-udf-performance-improvements">3. Python UDF Performance Improvements</h2>
<p>Python is the most popular language for Spark, but Python UDFs have been a performance bottleneck for a while. In Spark 4.0, there were major improvements that resulted in a significant speedup for PySpark workloads using UDFs in large pipelines.</p>
<ul>
<li><p><strong>Vectorized UDFs</strong> Using Apache Arrow has broader support</p>
<ul>
<li><p>Traditional UDFs in Spark process data row-by-row, requiring each row to be serialized and sent from the JVM (Spark engine) to the Python process. The Python UDF then executes on one row, and the results are sent back to the JVM. This process incurs significant overhead due to per-row communication and serialization.</p>
</li>
<li><p>With Apache Arrow, Spark can batch rows into a columnar format (Arrow Tables), send entire batches between the JVM and Python at once, and process them using Pandas UDFs (also known as vectorized UDFs). Instead of processing row-by-row, you handle entire Pandas Series or DataFrames in your UDF, which is much faster!</p>
</li>
</ul>
</li>
<li><p>Better <strong>Python-JVM serialization</strong>.</p>
</li>
<li><p>Enhanced <strong>error reporting</strong> for PySpark.</p>
</li>
</ul>
<h2 id="heading-4-ecosystem-modernization-and-cleanup">4. Ecosystem Modernization and Cleanup</h2>
<ul>
<li><p>Dropped support for <strong>legacy Hive features (e.g.</strong> HiveContext, Hive Metastore Dialects and Hive SerDe Support<strong>)</strong>.</p>
</li>
<li><p>Streamlined <strong>dependency management</strong> gives up the monolithic JARs and introduces a more modular packaging system, allowing you to include <strong>only the components you need</strong>.</p>
</li>
</ul>
<p>This means cleaner code bases, smaller builds, and less dependency hell.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Apache Spark 4.0 is not just focused on performance; it’s about <strong>adapting Spark for today's cloud-based, data-driven world</strong>. Whether you're creating a streaming ETL pipeline, an ML workflow, or a large analytics dashboard, these updates aim to make it <strong>faster, more adaptable, and ready for the future</strong>.</p>
<p>Already using Spark 4.0? Share your thoughts and benchmarks in the comments!</p>
]]></content:encoded></item><item><title><![CDATA[Monolith or Microservices: How to Choose the Best Software Architecture for your case]]></title><description><![CDATA[In software engineering, choosing a system architecture is an essential decision that can affect the entire life-cycle of an application, from development and testing to deployment, maintenance, and debugging. There are two main architectures: Monoli...]]></description><link>https://practical-software.com/monolith-or-microservices-how-to-choose-the-best-software-architecture-for-your-case</link><guid isPermaLink="true">https://practical-software.com/monolith-or-microservices-how-to-choose-the-best-software-architecture-for-your-case</guid><category><![CDATA[monolithic architecture]]></category><category><![CDATA[Microservices]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[System Design]]></category><category><![CDATA[architecture]]></category><dc:creator><![CDATA[Islam Elbanna]]></dc:creator><pubDate>Fri, 30 May 2025 11:13:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748332321513/e6d93f80-cf60-4be6-ba80-a8a80674751d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In software engineering, choosing a system architecture is an essential decision that can affect the entire life-cycle of an application, from development and testing to deployment, maintenance, and debugging. There are two main architectures: <strong>Monolithic</strong> and <strong>Microservices</strong>. Each has its own strengths and trade-offs, and understanding each design is essential to building the right one for your needs.</p>
<h1 id="heading-what-is-monolithic-architecture">What is Monolithic Architecture?</h1>
<p>A monolithic architecture is a traditional model for designing software applications. In this approach, all components; User Interface, Business Logic, and Data Access are implemented in a single code-base and deployed as one unit.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748603417469/896d8280-2d66-4f51-90c8-e0e6ab5a7708.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-pros">Pros</h2>
<ul>
<li><p><strong>Simplicity</strong>: Since all are in one place, then it is much easier to develop, test, and deploy, especially for small teams.</p>
</li>
<li><p><strong>Performance</strong>: Since all communications happen within the same process or between multiple processes on the same machine, completing a request or an operation is much faster. This is because it avoids the need to communicate between multiple services, which might involve costly or slow networks.</p>
</li>
<li><p><strong>Ease of debugging</strong>: Since logs and traces are from one running application, it is much easier to trace and debug a request.</p>
</li>
<li><p><strong>Simple infrastructure</strong>: No need to set up complex infrastructure since we only require one set of configurations to deploy a single application.</p>
</li>
<li><p><strong>Less technologies:</strong> Since it is a single system, you can use the same technology everywhere, which can be easier for smaller teams that specialize in a limited set of technologies.</p>
</li>
</ul>
<h2 id="heading-cons">Cons</h2>
<ul>
<li><p><strong>Scalability limitations</strong>: Since the workload varies across different types of system’s functionalities, scaling the system can be very costly. This is because it involves replicating the entire application, which might require some expensive servers, even though most other functionalities don't need scaling.</p>
</li>
<li><p><strong>Tight coupling</strong>: Since the codebase is all in one place, changes in one part may require redeploying the entire system. This carries the risk of deploying unfinished code in other areas and requires more testing to ensure no other parts are affected. Additionally, since all code is accessible from anywhere (even if visibility is controlled), it can lead to more dependencies between modules. To change one part, you might need to modify other parts that rely on that code.</p>
</li>
<li><p><strong>Slower development over time</strong>: As codebase grows, onboarding and maintaining code becomes harder, since the developer needs to be aware of most of the system to be able to change one piece of code.</p>
</li>
<li><p><strong>Technology lock-in</strong>: Difficult to use or experiment with different technologies within one application, even if using a different type of technology would be more suitable for some parts of the system.</p>
</li>
</ul>
<h1 id="heading-what-are-microservices">What are Microservices?</h1>
<p>Microservices divide an application into loosely connected, independently deployed services. Each service handles a specific business function and communicates with other services through predefined APIs.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748603467299/251fb5b0-2606-43a7-97fe-1fb4ad9d3ec2.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-pros-1">Pros</h2>
<ul>
<li><p><strong>Independent scalability</strong>: Based on the workload of each part of the system, you can scale only the services that need it. This can be done automatically to allocate resources to the right places, improving utilization.</p>
</li>
<li><p><strong>Resilience</strong>: Failure in one service doesn’t necessarily bring down the entire system.</p>
</li>
<li><p><strong>Technology flexibility</strong>: Each service can use a different tech stack depending on the use case and the most efficient way to implement it.</p>
</li>
<li><p><strong>Faster deployments</strong>: Teams can develop and release services independently and more quickly because they only need to test the components that depend on their service, rather than the entire system. This is especially true if there are no changes to the API.</p>
</li>
<li><p><strong>Faster on-boarding:</strong> New developers don't need to understand the entire system to start contributing to a service, which saves time when on-boarding a new member.</p>
</li>
</ul>
<h2 id="heading-cons-1">Cons</h2>
<ul>
<li><p><strong>Increased complexity</strong>: Managing multiple services increases operational and architectural overhead.</p>
</li>
<li><p><strong>Distributed systems challenges</strong>: Issues like network latency, fault tolerance, and eventual consistency must be handled.</p>
</li>
<li><p><strong>Deployment complexity</strong>: Requires complex CI/CD pipelines and different configurations for each service, which adds an overhead from the operational side.</p>
</li>
<li><p><strong>Data management</strong>: Each service might require its own type of database, which adds complexity to coordination. While this is technically beneficial, supporting different types of storage can be challenging. It involves managing fault tolerance, backups, and maintaining availability.</p>
</li>
<li><p><strong>Complex debugging and monitoring:</strong> Tracing a single request through multiple services is very challenging and requires creating a custom infrastructure to make logging, debugging, monitoring and alerting easier.</p>
</li>
</ul>
<h2 id="heading-the-hidden-cost-of-microservices-operational-overhead"><strong>The Hidden Cost of Microservices: Operational Overhead</strong></h2>
<p>While Microservices offer flexibility and scalability, they demand a robust <strong>supportive ecosystem</strong>. Without this, the architecture can quickly become a maintenance nightmare. . Here are the key systems around Microservices:</p>
<ul>
<li><p><strong>Service Discovery:</strong> Helps services dynamically locate and communicate with each other in a distributed environment. e.g. <strong>Kubernetes Service Discovery</strong>.</p>
</li>
<li><p><strong>API Gateway and Load Balancing:</strong> Acts as a single entry point for clients, managing routing, load balancing, authentication, and request aggregation. This is important because each service might be deployed on different servers and zones due to auto-scaling. It distributes traffic across multiple service instances to enhance performance and reliability, ensuring that each service or client can identify which server to communicate with in a balanced manner. e.g. <strong>AWS API Gateway</strong>, <strong>Spring Cloud Gateway, gRPC-LB</strong>, <strong>Nginx and Envoy.</strong></p>
</li>
<li><p><strong>Security (Authentication &amp; Authorization):</strong> Ensures secure communication between services and controls access efficiently, so it doesn't impact the overall request latency. e.g. <strong>OAuth2</strong> and <strong>OpenID.</strong></p>
</li>
<li><p><strong>Configuration Management</strong>: Centralizes and dynamically manages configurations across services. It can be connected with a dynamic auto-scaling system, which also monitors and provides fault tolerance to maintain the system's resilience. e.g. <strong>Kubernetes, AWS Systems Manage, Resilience4j</strong> and <strong>Spring Cloud Config.</strong></p>
</li>
<li><p><strong>Centralized Logging &amp; Monitoring:</strong> Aggregates logs and metrics and tracks requests across services to help with debugging, analyzing latency, and monitoring performance. e.g. <strong>ELK Stack (Elasticsearch, Logstash, Kibana)</strong>, <strong>Prometheus + Grafana</strong> and <strong>OpenTelemetry.</strong></p>
</li>
<li><p><strong>Deployment and CI/CD Pipelines:</strong> Automates the testing, building, and deploying of Microservices. It allows for gradual deployment, which helps catch issues by deploying to a small percentage of traffic first. This approach supports a safer continuous delivery process. e.g. <strong>Jenkins</strong> and <strong>GitLab CI/CD.</strong></p>
</li>
<li><p><strong>Messaging queues:</strong> While it is an optional component, it enables decoupled communication between services via events and improves resilience of the system. e.g. <strong>Apache Kafka</strong> and <strong>RabbitMQ</strong>.</p>
</li>
</ul>
<p>Without these systems, developers may struggle to debug issues, track down failures, or even understand how components interact. Microservices don't just decentralize code, but also decentralize responsibility, which can result in chaos if not carefully orchestrated.</p>
<h1 id="heading-when-to-choose-what"><strong>When to Choose What?</strong></h1>
<ul>
<li><p><strong>Go Monolith If</strong>: You are a small team or an early-stage startup, the application is simple and it is not expected to grow quickly, and you need to launch rapidly with minimal infrastructure.</p>
</li>
<li><p><strong>Go Microservices If</strong>: Your application has a clear domain and context, you have the resources to work on different parts of the application at the same time, which need to use different technologies and be scaled independently, and you have the infrastructure and experience to support it.</p>
</li>
</ul>
<h1 id="heading-conclusion"><strong>Conclusion</strong></h1>
<p>Microservices are powerful but not a one-size-fits-all solution. While they offer scalability and flexibility, they also bring significant complexity that requires proper tools and practices to manage. If your team isn't prepared to invest in logging, monitoring, tracing, and service orchestration, starting with a monolith might be the wiser option.</p>
<p>Choose your architecture based on the specific needs and maturity of your product and the resources you have, not just on current trends. The choice of tools should consider factors like cloud vs on-site, team expertise, and specific use cases.</p>
]]></content:encoded></item><item><title><![CDATA[Selecting the Best File Formats for Apache Spark: Parquet, ORC, CSV and more]]></title><description><![CDATA[One of the most important decisions in your Apache Spark pipeline is how you store your data. The data format you choose can dramatically affect performance, storage costs, and query speed. Let’s explore the most common file formats supported by Apac...]]></description><link>https://practical-software.com/selecting-the-best-file-formats-for-apache-spark-parquet-orc-csv-and-more</link><guid isPermaLink="true">https://practical-software.com/selecting-the-best-file-formats-for-apache-spark-parquet-orc-csv-and-more</guid><category><![CDATA[#apache-spark]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Parquet]]></category><category><![CDATA[orc]]></category><category><![CDATA[json]]></category><category><![CDATA[csv]]></category><category><![CDATA[Apache Avro]]></category><dc:creator><![CDATA[Islam Elbanna]]></dc:creator><pubDate>Sat, 24 May 2025 13:36:44 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748093233044/6156c810-a35a-4abb-b03a-29db3e97bd61.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the most important decisions in your Apache Spark pipeline is <strong>how you store your data</strong>. The data format you choose can dramatically affect performance, storage costs, and query speed. Let’s explore the most common file formats supported by Apache Spark, and in which cases they can fit the most.</p>
<h1 id="heading-different-file-formats">Different file formats</h1>
<p>There are different types of data formats commonly used in data processing, especially with tools like <strong>Apache Spark</strong>, broken into <strong>categories</strong> based on their structure and use case:</p>
<h2 id="heading-row-based-file-formats">Row-Based File Formats</h2>
<p>The data is stored <strong>row by row</strong>, and it is easy to write and process linearly, but <strong>less efficient</strong> for analytical queries where only a few columns are needed.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748093737370/fc8d76b6-f66a-428e-afdb-9ceb952ba353.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-csv-comma-separated-values"><strong>CSV</strong> (Comma-Separated Values)</h3>
<p><strong>CSV</strong> is a plain text, row-based format where columns are separated by commas. It is easy to work with but not efficient for big data.</p>
<p><strong>Pros</strong>: CSV is human-readable, simple to write and read, and is used globally.</p>
<p><strong>Cons</strong>: CSV lacks data types, requiring Spark to infer column types from a sample of the CSV file, which adds extra work and may not be accurate. Additionally, CSV has poor compression and struggles with encoding complex data.</p>
<p><strong>Use cases</strong>: Legacy systems, small data exports, debugging, and working with spreadsheets.</p>
<p><strong>Reading</strong> <a target="_blank" href="https://spark.apache.org/docs/latest/sql-data-sources-csv.html"><strong>CSV file</strong></a> <strong>in Apache Spark example:</strong></p>
<pre><code class="lang-scala"># <span class="hljs-type">Pyspark</span> example
df = spark.read.options(delimiter=<span class="hljs-string">","</span>, header=<span class="hljs-type">True</span>).csv(path)

# <span class="hljs-type">Scala</span> example
<span class="hljs-keyword">val</span> df = spark.read.option(<span class="hljs-string">"delimiter"</span>, <span class="hljs-string">","</span>).option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>).csv(path)
</code></pre>
<h3 id="heading-json-javascript-object-notation">JSON (JavaScript Object Notation)</h3>
<p><strong>JSON</strong> is a lightweight, text-based format for exchanging data. It uses <strong>human-readable</strong> text to store and send information, but it can be slow and doesn't enforce a schema.</p>
<p><strong>Pros</strong>: <strong>JSON</strong> is readable and widely supported by many systems, and can store <strong>semi-structural</strong> data.</p>
<p><strong>Cons</strong>: <strong>JSON</strong> is slow to parse, and each row must be a valid JSON for Spark to parse. Additionally, from a storage perspective, JSON produces large files because many boilerplate tokens and key names are repeated in each row, and it lacks schema enforcement.</p>
<p><strong>Use case</strong>: Mainly use <strong>JSON</strong> for debugging or exploring data. It can also be used to integrate with external systems that provide <strong>JSON</strong>, which you can't control, but don’t depend on it as the final storage data format.</p>
<p><strong>Reading</strong> <a target="_blank" href="https://spark.apache.org/docs/latest/sql-data-sources-json.html"><strong>JSON file</strong></a> <strong>in Apache Spark example:</strong></p>
<pre><code class="lang-scala"># <span class="hljs-type">Pyspark</span> example
df = spark.read.json(path)

# <span class="hljs-type">Scala</span> example
<span class="hljs-keyword">val</span> df = spark.read.json(path)
</code></pre>
<h3 id="heading-apache-avro">Apache Avro</h3>
<p><strong>Apache Avro</strong> is a row-based format often used with <strong>Kafka</strong> pipelines and <strong>data exchange</strong> scenarios. It supports <strong>descriptive extendable schema</strong> and is compact for serialization.</p>
<p><strong>Pros</strong>: <strong>Avro</strong> is efficient in storage, since it is in binary format, and has a great schema evolution feature.</p>
<p><strong>Cons</strong>: While <strong>Avro</strong> is efficient in storage, it is not optimized for columnar queries, since you need to scan the whole file to read specific columns.</p>
<p><strong>Use case</strong>: <strong>Avro</strong> is mainly used with real-time streaming systems like <strong>Kafka</strong> because it is easy to serialize and transmit. It also allows for easy schema evolution through a schema registry.</p>
<p>The spark-avro module is external and not included in the spark-submit or spark-shell by default, but <code>spark-avro_VERSION</code> and its dependencies can be directly added to <code>spark-submit</code> using <code>--packages</code></p>
<pre><code class="lang-bash">./bin/spark-submit --packages org.apache.spark:spark-avro_VERSION

./bin/spark-shell --packages org.apache.spark:spark-avro_VERSION
</code></pre>
<p><strong>Reading</strong> <a target="_blank" href="https://spark.apache.org/docs/latest/sql-data-sources-avro.html"><strong>Avro file</strong></a> <strong>in Apache Spark example:</strong></p>
<pre><code class="lang-scala"># <span class="hljs-type">Pyspark</span> example
df = spark.read.format(<span class="hljs-string">"avro"</span>).load(path)

# <span class="hljs-type">Scala</span> example
<span class="hljs-keyword">val</span> df = spark.read.format(<span class="hljs-string">"avro"</span>).load(path)
</code></pre>
<h2 id="heading-columnar-file-formats">Columnar File Formats</h2>
<p>The data is stored <strong>column by column</strong>, making them ideal for <strong>analytics and interactive dashboards</strong> where only a subset of columns is queried.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748093760181/2b68942e-c607-4aa3-9283-8edf71903fae.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-parquet-the-gold-standard-for-analytics">Parquet (The Gold Standard for Analytics)</h3>
<p><strong>Parquet</strong> is a columnar binary format optimized for analytical queries. It’s the most popular format for Spark workloads.</p>
<p><strong>Pros:</strong> Parquet is built for efficient reads with compression and predicate push-down, which makes it fast, compact, ideal for Spark, Hive, Presto.</p>
<p><strong>Cons:</strong> Parquet is slightly slower to write than row-based formats.</p>
<p><strong>Use case:</strong> Parquet is the first choice for Spark and analytical queries, data lakes, cloud storage.</p>
<p><strong>Reading</strong> <a target="_blank" href="https://spark.apache.org/docs/latest/sql-data-sources-parquet.html">Parquet file</a> <strong>in Apache Spark example:</strong></p>
<pre><code class="lang-scala"># <span class="hljs-type">Pyspark</span> example
df = spark.read.parquet(path)

# <span class="hljs-type">Scala</span> example
<span class="hljs-keyword">val</span> df = spark.read.parquet(path)
</code></pre>
<h3 id="heading-apache-orc-optimized-row-columnar">Apache ORC (Optimized Row Columnar)</h3>
<p><strong>ORC</strong> is another columnar format, optimized for the Hadoop ecosystem, especially Hive.</p>
<p><strong>Pros:</strong> <strong>ORC</strong> has a high compression ratio, and is optimized for scan-heavy queries, and supports predicates push-down similar to Parquet.</p>
<p><strong>Cons:</strong> <strong>ORC</strong> has less support outside Hadoop tools, which makes it harder to integrate with other tools.</p>
<p><strong>Use case:</strong> Hive-based data warehouses, HDFS-based systems.</p>
<p><strong>Reading</strong> <strong>ORC file</strong> <strong>in Apache Spark example:</strong></p>
<pre><code class="lang-scala"># <span class="hljs-type">Pyspark</span> example
df = spark.read.format(<span class="hljs-string">"orc"</span>).load(path)

# <span class="hljs-type">Scala</span> example
<span class="hljs-keyword">val</span> df = spark.read.format(<span class="hljs-string">"orc"</span>).load(path)
</code></pre>
<h2 id="heading-summary-table">Summary table</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Format</strong></td><td><strong>Type</strong></td><td><strong>Compression</strong></td><td><strong>Predicate Push-down</strong></td><td><strong>Best Use Case</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Parquet</strong></td><td>Columnar</td><td>Excellent</td><td>✅ Yes</td><td>Big data, analytics, selective queries</td></tr>
<tr>
<td><strong>ORC</strong></td><td>Columnar</td><td>Excellent</td><td>✅ Yes</td><td>Hive-based data lakes</td></tr>
<tr>
<td><strong>Avro</strong></td><td>Row-based</td><td>Good</td><td>❌ No (limited)</td><td>Kafka pipelines, schema evolution</td></tr>
<tr>
<td><strong>JSON</strong></td><td>Row-based</td><td>None</td><td>❌ No</td><td>Debugging, integration</td></tr>
<tr>
<td><strong>CSV</strong></td><td>Row-based</td><td>None</td><td>❌ No</td><td>Legacy formats, ingestion, exploration</td></tr>
</tbody>
</table>
</div><h1 id="heading-conclusion">Conclusion</h1>
<p>Choosing the right file format in Spark is <strong>not just a technical decision</strong>, but it's a <strong>strategic one</strong>. Parquet and ORC are solid choices for most modern workloads, but your use case, tools, and ecosystem should guide your choice.</p>
]]></content:encoded></item><item><title><![CDATA[How Spark Connect Enhances the Future of Apache Spark Connectivity]]></title><description><![CDATA[Apache Spark has been a popular choice for large-scale distributed data processing. However, as data teams move to cloud architectures and separate computes from client interfaces, the traditional tightly coupled Spark driver model has begun to revea...]]></description><link>https://practical-software.com/how-spark-connect-enhances-the-future-of-apache-spark-connectivity</link><guid isPermaLink="true">https://practical-software.com/how-spark-connect-enhances-the-future-of-apache-spark-connectivity</guid><category><![CDATA[Spark Job Server]]></category><category><![CDATA[#apache-spark]]></category><category><![CDATA[Python]]></category><category><![CDATA[REST API]]></category><category><![CDATA[Apache livy]]></category><dc:creator><![CDATA[Islam Elbanna]]></dc:creator><pubDate>Sun, 18 May 2025 15:59:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747566344475/17ffe084-04bd-462e-93a2-ee8dbbe4cfed.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Apache Spark has been a popular choice for large-scale distributed data processing. However, as data teams move to cloud architectures and separate computes from client interfaces, the traditional tightly coupled Spark driver model has begun to reveal its limitations. In this article we will explore the new Spark Connect feature, the future of the remote execution.</p>
<h2 id="heading-what-is-spark-connect">What is Spark Connect?</h2>
<p><a target="_blank" href="https://spark.apache.org/spark-connect/"><strong>Spark Connect</strong></a> is a <strong>decoupled client-server protocol</strong> that lets Spark clients, like Python or Java applications, interact with a Spark driver process over the network. Unlike traditional Spark applications where the client starts and controls the driver, Spark Connect uses a <strong>gRPC-based protocol</strong> to communicate with a <strong>running Spark Connect server</strong>. Think of it as <strong>Spark as a Service</strong> for your data apps and notebooks.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747828041834/ba1b4d46-c514-420b-97ca-faf3726a4bb8.png" alt class="image--center mx-auto" /></p>
<p><strong>Spark Connect</strong> is introduced in <strong>Spark 3.4</strong> and further improved in <strong>3.5</strong>. It changes how clients connect to and interact with a Spark cluster, providing more flexibility, scalability, and language support.</p>
<ul>
<li><p><strong>Spark Connect</strong> is <strong>not a cluster manager</strong>. It's a <em>protocol</em> that allows clients to communicate with a Spark driver <em>remotely</em>, while still using traditional cluster modes underneath (like <strong>YARN</strong> or <strong>Kubernetes</strong>).</p>
</li>
<li><p><strong>Spark Connect</strong> makes <strong>client-side development easier</strong> and is ideal for integrating Spark into tools like <strong>VSCode</strong>, <strong>Jupyter</strong>, or <strong>web apps</strong>.</p>
</li>
<li><p>Decoupling the client from the <strong>Spark cluster</strong> makes it easier to upgrade and scale the cluster separately from the client. This approach removes <strong>dependency</strong> <strong>conflicts</strong> and offers greater flexibility in language support.</p>
</li>
</ul>
<h2 id="heading-why-spark-connect">Why Spark Connect?</h2>
<p>Before Spark Connect, running a Spark application meant you had to <strong>combine the Spark driver with your client logic.</strong> This led to <strong>long startup times,</strong> <strong>dependency conflicts,</strong> and <strong>poor IDE integration.</strong> It was also difficult to use <strong>interactive notebooks or mobile/web-based interfaces</strong> with Spark backend.</p>
<p>With <strong>Spark Connect</strong>, clients are <strong>lightweight</strong> and only need a compatible client library. You can embed Spark inside <strong>VSCode, Jupyter notebooks, web apps, and mobile apps</strong>. This setup allows for easier scaling and faster iteration.</p>
<h2 id="heading-how-does-spark-connect-work">How does Spark Connect Work?</h2>
<ol>
<li><p>A connection is established between the client and the Spark server.</p>
</li>
<li><p>The client converts a <strong>DataFrame</strong> query into an <strong>unresolved logical plan</strong>, which describes what the operation should do, not how it should be executed.</p>
</li>
<li><p>The <strong>unresolved logical plan</strong> is <strong>encoded</strong> and sent to the Spark server.</p>
</li>
<li><p>The Spark server <strong>optimizes</strong> and <strong>executes</strong> the query.</p>
</li>
<li><p>The Spark server sends the <strong>results</strong> back to the client.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747570282005/14c43ceb-2089-4953-9bad-c11013b545f9.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-practical-example-using-spark-connect-with-pyspark">Practical example: Using Spark Connect with PySpark</h2>
<h4 id="heading-step-1-start-the-spark-connect-server">Step 1: Start the Spark Connect Server</h4>
<pre><code class="lang-powershell"><span class="hljs-comment"># This launches the Spark Connect endpoint</span>
<span class="hljs-variable">$</span> ./bin/spark<span class="hljs-literal">-connect</span><span class="hljs-literal">-server</span>
</code></pre>
<h4 id="heading-step-2-connect-from-a-python-client">Step 2: Connect from a Python Client</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

<span class="hljs-comment"># sc:// is the special URI scheme used for Spark Connect</span>
spark = SparkSession.builder.remote(<span class="hljs-string">"sc://localhost:&lt;PORT&gt;"</span>).getOrCreate()

df = spark.read.csv(<span class="hljs-string">"example.csv"</span>, header=<span class="hljs-literal">True</span>)
df.groupBy(<span class="hljs-string">"category"</span>).count().show()
</code></pre>
<h2 id="heading-best-for-the-following-use-cases">Best for the following use cases</h2>
<ul>
<li><p><strong>Interactive Data Science:</strong> Use Jupyter or VSCode to run Spark jobs remotely</p>
</li>
<li><p><strong>CI/CD Pipelines:</strong> Validate jobs in GitHub Actions or GitLab CI</p>
</li>
<li><p><strong>Remote Data Apps:</strong> Build APIs and dashboards powered by Spark</p>
</li>
<li><p><strong>Multi-Tenant Platforms:</strong> Serve multiple users via a single Spark backend</p>
</li>
</ul>
<h2 id="heading-limitations">Limitations</h2>
<ul>
<li><p><strong>Spark Connect</strong> is still the early stages, so some features like complex <strong>UDFs</strong> or <strong>Streaming</strong> might have limited support.</p>
</li>
<li><p>You need to upgrade to at least Spark 3.5+ for a more stable version.</p>
</li>
<li><p>Monitoring and debugging are still developing for Spark Connect.</p>
</li>
</ul>
<h2 id="heading-spark-connect-alternatives">Spark Connect alternatives</h2>
<p><a target="_blank" href="https://github.com/spark-jobserver/spark-jobserver">Spark Job Server</a> and <a target="_blank" href="https://livy.apache.org/">Apache Livy</a> are similar projects that expose Spark jobs through <strong>REST APIs</strong>. It is typically used to manage job submissions from external apps like dashboards and notebooks, enabling <strong>remote</strong> interaction with Spark. However, it differs fundamentally in design, use cases, and maturity.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td><strong>Spark Connect</strong></td><td><strong>Spark Job Server</strong></td><td><strong>Apache Livy</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Type</strong></td><td>Built-in gRPC client-server protocol</td><td>External REST API server</td><td>REST-based Spark session manager</td></tr>
<tr>
<td><strong>Official Status</strong></td><td>✅ Native to Apache Spark (3.4+)</td><td>❌ Community project (not officially maintained)</td><td>🟡 Incubating under Apache (inactive since 2021)</td></tr>
<tr>
<td><strong>Client Language Support</strong></td><td>Python, Scala, Java, Go, Rust, Dotnet</td><td>REST only, language-agnostic</td><td>REST + limited Scala/Python clients</td></tr>
<tr>
<td><strong>Architecture</strong></td><td>Lightweight clients + Spark driver over gRPC</td><td>External server + job runners</td><td>External service managing Spark sessions</td></tr>
<tr>
<td><strong>Latency / Interactivity</strong></td><td>⚡ Very low latency, interactive (DataFrame API)</td><td>High (submit job, poll status)</td><td>Medium-high</td></tr>
<tr>
<td><strong>Streaming Support</strong></td><td>❌ Limited (in progress)</td><td>❌ No</td><td>🟡 Partial (limited with batch-like APIs)</td></tr>
<tr>
<td><strong>Stateful Sessions</strong></td><td>✅ Persistent client-side SparkSession</td><td>✅ Yes (Job Server Contexts)</td><td>✅ Yes (Livy Sessions)</td></tr>
<tr>
<td><strong>Authentication / Security</strong></td><td>SSL/gRPC auth (evolving)</td><td>Manual or custom</td><td>Kerberos, Hadoop-compatible</td></tr>
<tr>
<td><strong>Ease of Deployment</strong></td><td>✅ Easy with Spark 3.5+</td><td>❌ Complex, often fragile</td><td>❌ Tricky to deploy &amp; scale</td></tr>
<tr>
<td><strong>Use Case Fit</strong></td><td>Interactive apps, notebooks, CI/CD</td><td>Ad hoc job submission, dashboards</td><td>Multi-user notebooks, REST access</td></tr>
<tr>
<td><strong>Extensibility / Maintenance</strong></td><td>✅ Actively developed</td><td>❌ Unmaintained / legacy</td><td>🟡 Outdated, low activity</td></tr>
</tbody>
</table>
</div><h2 id="heading-conclusion">Conclusion</h2>
<ul>
<li><p><strong>Spark Connect</strong> is the future of remote Spark native interaction. It's fast and perfect for Developers, Notebooks, and Micro-services.</p>
</li>
<li><p><strong>Livy</strong> and <strong>Spark Job Server</strong> were temporary solutions before Spark had native client-server support. They work well for some REST API-based job orchestration scenarios but are now considered outdated and are not maintained.</p>
</li>
<li><p>If you're starting a new project, go with <strong>Spark Connect</strong>. If you're maintaining an older system, <strong>Livy or Spark Job Server</strong> might still be useful for now.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Mastering Apache Spark SQL: Create Complex Queries with Common Table Expressions CTE (WITH Clause)]]></title><description><![CDATA[Apache Spark SQL uses SQL capabilities to process large-scale structured data. One powerful feature in modern SQL is the WITH clause, supported in Spark SQL as Common Table Expressions (CTE). CTE offer a more organized, readable, and often more effic...]]></description><link>https://practical-software.com/mastering-apache-spark-sql-create-complex-queries-with-common-table-expressions-cte-with-clause</link><guid isPermaLink="true">https://practical-software.com/mastering-apache-spark-sql-create-complex-queries-with-common-table-expressions-cte-with-clause</guid><category><![CDATA[sparksql]]></category><category><![CDATA[#apache-spark]]></category><category><![CDATA[with-statement]]></category><category><![CDATA[SQL]]></category><dc:creator><![CDATA[Islam Elbanna]]></dc:creator><pubDate>Sat, 17 May 2025 12:00:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747483097673/70a714b3-35be-434c-9f98-b6f1595bf76a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Apache Spark SQL uses SQL capabilities to process large-scale structured data. One powerful feature in modern SQL is the <code>WITH</code> clause, supported in Spark SQL as Common Table Expressions (<code>CTE</code>). CTE offer a more organized, readable, and often more efficient way to build complex queries. This article will explain what CTE is, why it is valuable in Spark SQL, and explore its syntax with practical examples.</p>
<h2 id="heading-what-is-a-common-table-expression-cte">What is a Common Table Expression (CTE)?</h2>
<p>A <a target="_blank" href="https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-cte.html">Common Table Expression</a>, or CTE, is a named, temporary result set that you define within a single SQL statement. It's like a temporary, virtual table that only exists while the query is running. A CTE starts with the <code>WITH</code> clause, followed by one or more named sub-queries.</p>
<p><strong>The basic syntax example is:</strong></p>
<pre><code class="lang-sql"><span class="hljs-keyword">WITH</span> expression_name [ ( column_name [ , ... ] ) ] [ <span class="hljs-keyword">AS</span> ] ( <span class="hljs-keyword">query</span> ) [ , ... ]
</code></pre>
<ul>
<li><p><code>expression_name</code>: A unique name you assign to your temporary result set.</p>
</li>
<li><p><code>(column_name, ...)</code>: An optional list of column aliases for the CTE's output. If not provided, Spark SQL will infer column names from the <code>SELECT</code> statement within the CTE.</p>
</li>
<li><p><code>AS (query)</code>: The <code>SELECT</code> statement that defines the logic for your CTE.</p>
</li>
</ul>
<h2 id="heading-why-use-cte-in-spark-sql">Why Use CTE in Spark SQL?</h2>
<p>While you can often achieve similar results using nested sub-queries, CTE brings several significant advantages to Spark SQL development:</p>
<ol>
<li><p><strong>Improves readability:</strong> Complex queries can quickly become difficult to follow and modify due to nested sub-queries. CTEs let you break down common logic into smaller, named, and more manageable parts. Each CTE acts as a logical unit of work, making the entire query easier to understand, debug, and maintain.</p>
</li>
<li><p><strong>Enhances usability:</strong> A key benefit of CTEs is that you can reference them multiple times within the same <code>WITH</code> clause or the final <code>SELECT</code> statement. This helps avoid code duplication and ensures consistency in your intermediate calculations.</p>
</li>
<li><p><strong>Simplifies debugging:</strong> By breaking down the logic into separate blocks, you can easily debug each part of the CTE independently. This helps you find issues much faster than trying to debug a single, complex query.</p>
</li>
<li><p><strong>Potential for Optimization:</strong> While CTEs are defined as temporary result sets, Spark often treats them like logical views. This allows Spark's Catalyst Optimizer to apply optimizations, such as pushing down predicates, across CTE boundaries. This can result in more efficient execution plans, particularly when a CTE is used multiple times. Spark might materialize the result or optimize its execution just once.</p>
</li>
</ol>
<h2 id="heading-practical-example">Practical example</h2>
<p>Suppose we have a <code>sales</code> table and want to find the total sales for each product category.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Sample Data Setup (for demonstration purposes)</span>
<span class="hljs-comment">-- This would typically be a pre-existing table or DataFrame</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">OR</span> <span class="hljs-keyword">REPLACE</span> <span class="hljs-keyword">TEMPORARY</span> <span class="hljs-keyword">VIEW</span> sales <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> <span class="hljs-keyword">VALUES</span>
    (<span class="hljs-string">'Electronics'</span>, <span class="hljs-string">'Laptop'</span>, <span class="hljs-number">1200.00</span>, <span class="hljs-string">'2024-01-15'</span>),
    (<span class="hljs-string">'Electronics'</span>, <span class="hljs-string">'Mouse'</span>, <span class="hljs-number">25.00</span>, <span class="hljs-string">'2024-01-15'</span>),
    (<span class="hljs-string">'Clothing'</span>, <span class="hljs-string">'T-Shirt'</span>, <span class="hljs-number">20.00</span>, <span class="hljs-string">'2024-01-16'</span>),
    (<span class="hljs-string">'Electronics'</span>, <span class="hljs-string">'Keyboard'</span>, <span class="hljs-number">75.00</span>, <span class="hljs-string">'2024-01-16'</span>),
    (<span class="hljs-string">'Clothing'</span>, <span class="hljs-string">'Jeans'</span>, <span class="hljs-number">50.00</span>, <span class="hljs-string">'2024-01-17'</span>),
    (<span class="hljs-string">'Electronics'</span>, <span class="hljs-string">'Monitor'</span>, <span class="hljs-number">300.00</span>, <span class="hljs-string">'2024-01-17'</span>)
<span class="hljs-keyword">AS</span> sales_data(<span class="hljs-keyword">category</span>, product, amount, sale_date);

<span class="hljs-comment">-- Using a CTE to calculate total sales per category</span>
<span class="hljs-keyword">WITH</span> CategorySales <span class="hljs-keyword">AS</span> (
    <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">category</span>, <span class="hljs-keyword">SUM</span>(amount) <span class="hljs-keyword">AS</span> total_category_sales
    <span class="hljs-keyword">FROM</span> sales
    <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">category</span>
)
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">category</span>, total_category_sales
<span class="hljs-keyword">FROM</span> CategorySales
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> total_category_sales <span class="hljs-keyword">DESC</span>;

Electronics    1600.00
Clothing    70.00
Time taken: 0.157 seconds, Fetched 2 row(s)
</code></pre>
<p>In this example, <code>CategorySales</code> is our CTE. It calculates the sum of the <code>amount</code> grouped by <code>category</code>. The final <code>SELECT</code> statement then simply queries this temporary <code>CategorySales</code> result set.</p>
<h2 id="heading-chaining-ctes">Chaining CTEs</h2>
<p>One of the most powerful features of CTEs is the ability to chain them. This means a later CTE can refer to an earlier CTE within the same <code>WITH</code> clause. This approach lets you build complex logic step by step.</p>
<p>Consider extending the previous example to find the average sales across all categories and then identify categories whose sales are above this average.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">WITH</span> CategorySales <span class="hljs-keyword">AS</span> (
    <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">category</span>, <span class="hljs-keyword">SUM</span>(amount) <span class="hljs-keyword">AS</span> total_category_sales
    <span class="hljs-keyword">FROM</span> sales
    <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">category</span>
), AverageOverallSales <span class="hljs-keyword">AS</span> (
    <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">AVG</span>(total_category_sales) <span class="hljs-keyword">AS</span> overall_avg_sales
    <span class="hljs-keyword">FROM</span> CategorySales <span class="hljs-comment">-- Referencing the first CTE</span>
)
<span class="hljs-keyword">SELECT</span>
    cs.category,
    cs.total_category_sales,
    aos.overall_avg_sales
<span class="hljs-keyword">FROM</span> CategorySales cs
<span class="hljs-keyword">CROSS</span> <span class="hljs-keyword">JOIN</span> AverageOverallSales aos
<span class="hljs-keyword">WHERE</span> cs.total_category_sales &gt; aos.overall_avg_sales
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> cs.total_category_sales <span class="hljs-keyword">DESC</span>;

Electronics    1600.00    835.000000
Time taken: 0.321 seconds, Fetched 1 row(s)
</code></pre>
<p>Here, <code>CategorySales</code> calculates the total sales for each category. Then, <code>AverageOverallSales</code> uses <code>CategorySales</code> to find the overall average. Finally, the main query joins these two CTEs to filter out categories with sales above the average.</p>
<h2 id="heading-best-fit-use-cases">Best fit use cases</h2>
<p>CTEs are highly beneficial in various real-world scenarios:</p>
<ul>
<li><p><strong>Step-by-Step data transformation:</strong> When you need to apply a series of transformations like filtering, aggregation, and joining to your data, CTEs let you define each step clearly.</p>
</li>
<li><p><strong>Complex aggregations and analytics:</strong> For multi-level aggregations or calculations involving window functions where intermediate results are needed, CTEs offer a clear structure.</p>
</li>
<li><p><strong>Sub-query factorization:</strong> If you find yourself writing the same sub-query multiple times, extract it into a CTE for usability.</p>
</li>
<li><p><strong>Anomaly detection and quality checks:</strong> You can define CTEs to spot anomalies or specific data patterns and then use these CTEs in your main query to flag or exclude problematic records.</p>
</li>
<li><p><strong>Improving Performance for Repeated Computations:</strong> If a complex sub-query is calculated multiple times in a large query, turning it into a CTE can sometimes help Spark optimize its execution, potentially avoiding repeated calculations.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Common Table Expressions are a key feature in modern SQL that greatly improve the developer experience. By allowing modularity, enhancing readability, and promoting re-usability, CTEs help data professionals write cleaner, more maintainable, and often more efficient Spark SQL queries. They turn complex data challenges into clear, manageable steps.</p>
]]></content:encoded></item><item><title><![CDATA[How to Fix Data Skew in Apache Spark with the Salting Technique]]></title><description><![CDATA[When working with large datasets in Apache Spark, a common performance issue is data skew. This occurs when a few keys dominate the data distribution, leading to uneven partitions and slow queries. It mainly happens during operations that require shu...]]></description><link>https://practical-software.com/apache-spark-fix-data-skew-issue-using-salting-technique</link><guid isPermaLink="true">https://practical-software.com/apache-spark-fix-data-skew-issue-using-salting-technique</guid><category><![CDATA[#apache-spark]]></category><category><![CDATA[Scala]]></category><category><![CDATA[big data]]></category><category><![CDATA[joins]]></category><category><![CDATA[Salting]]></category><category><![CDATA[PySpark]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Islam Elbanna]]></dc:creator><pubDate>Sat, 10 May 2025 23:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1746712436985/07c0aac4-ceb3-4158-951c-c14e2586b177.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When working with large datasets in <strong>Apache Spark</strong>, a common performance issue is <strong>data skew</strong>. This occurs when a few keys <strong>dominate</strong> the data distribution, leading to <strong>uneven</strong> partitions and slow queries. It mainly happens during operations that require <strong>shuffling</strong>, like <strong>joins</strong> or even regular <strong>aggregations</strong>.</p>
<p>A practical way to reduce skew is <strong>salting</strong>, which involves artificially spreading out heavy keys across multiple partitions. In this post, I’ll guide you through this with a practical example.</p>
<h2 id="heading-how-salting-resolves-data-skew-issues">How Salting Resolves Data Skew Issues</h2>
<p>By adding a <strong>randomly</strong> generated number to the join key and then joining over this combined key, we can distribute large keys more evenly. This makes the data distribution more uniform and spreads the load across more workers, instead of sending most of the data to one worker and leaving the others idle.</p>
<h3 id="heading-benefits-of-salting">Benefits of Salting</h3>
<ul>
<li><p><strong>Reduced Skew:</strong> Spreads data evenly across partitions, preventing a few workers overload and improves utilization.</p>
</li>
<li><p><strong>Improved Performance:</strong> Speeds up joins and aggregations by balancing the workload.</p>
</li>
<li><p><strong>Avoids Resource Contention:</strong> Reduces the risk of out-of-memory errors caused by large, uneven partitions.</p>
</li>
</ul>
<h2 id="heading-when-to-use-salting">When to Use Salting</h2>
<p>During joins or aggregations with skewed keys, use salting when you notice long shuffle times or executor failures due to data skew. It's also helpful in real-time streaming applications where partitioning affects data processing efficiency, or when most workers are idle while a few are stuck in a running state.</p>
<h2 id="heading-salting-example-in-scala">Salting Example in Scala</h2>
<p>Let's generate some data with an <strong>unbalanced</strong> number of rows. We can assume there are two datasets we need to join: one is a large dataset, and the other is a small dataset.</p>
<pre><code class="lang-scala"><span class="hljs-keyword">import</span> org.apache.spark.sql.<span class="hljs-type">SparkSession</span>
<span class="hljs-keyword">import</span> org.apache.spark.sql.functions._

<span class="hljs-comment">// Simulated large dataset with skew</span>
<span class="hljs-keyword">val</span> largeDF = <span class="hljs-type">Seq</span>(
  (<span class="hljs-number">1</span>, <span class="hljs-string">"txn1"</span>), (<span class="hljs-number">1</span>, <span class="hljs-string">"txn2"</span>), (<span class="hljs-number">1</span>, <span class="hljs-string">"txn3"</span>), (<span class="hljs-number">2</span>, <span class="hljs-string">"txn4"</span>), (<span class="hljs-number">3</span>, <span class="hljs-string">"txn5"</span>)
).toDF(<span class="hljs-string">"customer_id"</span>, <span class="hljs-string">"transaction"</span>)

<span class="hljs-comment">// Small dataset</span>
<span class="hljs-keyword">val</span> smallDF = <span class="hljs-type">Seq</span>(
  (<span class="hljs-number">1</span>, <span class="hljs-string">"Ahmed"</span>), (<span class="hljs-number">2</span>, <span class="hljs-string">"Ali"</span>), (<span class="hljs-number">3</span>, <span class="hljs-string">"Hassan"</span>)
).toDF(<span class="hljs-string">"customer_id"</span>, <span class="hljs-string">"name"</span>)
</code></pre>
<p>Let’s add the salting column to the large datasets, which we use <strong>randomization</strong> to spreed the values of the large key into smaller partitions</p>
<pre><code class="lang-scala">
<span class="hljs-comment">// Step 1: create a salting key in the large dataset</span>
<span class="hljs-keyword">val</span> numBuckets = <span class="hljs-number">3</span>
<span class="hljs-keyword">val</span> saltedLargeDF = largeDF.
    withColumn(<span class="hljs-string">"salt"</span>, (rand() * numBuckets).cast(<span class="hljs-string">"int"</span>)).
    withColumn(<span class="hljs-string">"salted_customer_id"</span>, concat($<span class="hljs-string">"customer_id"</span>, lit(<span class="hljs-string">"_"</span>), $<span class="hljs-string">"salt"</span>))

saltedLargeDF.show()
+-----------+-----------+----+------------------+
|customer_id|transaction|salt|salted_customer_id|
+-----------+-----------+----+------------------+
|          <span class="hljs-number">1</span>|       txn1|   <span class="hljs-number">1</span>|               <span class="hljs-number">1</span>_1|
|          <span class="hljs-number">1</span>|       txn2|   <span class="hljs-number">1</span>|               <span class="hljs-number">1</span>_1|
|          <span class="hljs-number">1</span>|       txn3|   <span class="hljs-number">2</span>|               <span class="hljs-number">1</span>_2|
|          <span class="hljs-number">2</span>|       txn4|   <span class="hljs-number">2</span>|               <span class="hljs-number">2</span>_2|
|          <span class="hljs-number">3</span>|       txn5|   <span class="hljs-number">0</span>|               <span class="hljs-number">3</span>_0|
+-----------+-----------+----+------------------+
</code></pre>
<p>To make sure we cover all possible randomized salted keys in the large datasets, we need to <strong>explode</strong> the small dataset with all possible salted values</p>
<pre><code class="lang-scala">
<span class="hljs-comment">// Step 2: Explode rows in smallDF for possible salted keys</span>
<span class="hljs-keyword">val</span> saltedSmallDF = (<span class="hljs-number">0</span> until numBuckets).toDF(<span class="hljs-string">"salt"</span>).
    crossJoin(smallDF).
    withColumn(<span class="hljs-string">"salted_customer_id"</span>, concat($<span class="hljs-string">"customer_id"</span>, lit(<span class="hljs-string">"_"</span>), $<span class="hljs-string">"salt"</span>)) 

saltedSmallDF.show()
+----+-----------+------+------------------+
|salt|customer_id|  name|salted_customer_id|
+----+-----------+------+------------------+
|   <span class="hljs-number">0</span>|          <span class="hljs-number">1</span>| <span class="hljs-type">Ahmed</span>|               <span class="hljs-number">1</span>_0|
|   <span class="hljs-number">1</span>|          <span class="hljs-number">1</span>| <span class="hljs-type">Ahmed</span>|               <span class="hljs-number">1</span>_1|
|   <span class="hljs-number">2</span>|          <span class="hljs-number">1</span>| <span class="hljs-type">Ahmed</span>|               <span class="hljs-number">1</span>_2|
|   <span class="hljs-number">0</span>|          <span class="hljs-number">2</span>|   <span class="hljs-type">Ali</span>|               <span class="hljs-number">2</span>_0|
|   <span class="hljs-number">1</span>|          <span class="hljs-number">2</span>|   <span class="hljs-type">Ali</span>|               <span class="hljs-number">2</span>_1|
|   <span class="hljs-number">2</span>|          <span class="hljs-number">2</span>|   <span class="hljs-type">Ali</span>|               <span class="hljs-number">2</span>_2|
|   <span class="hljs-number">0</span>|          <span class="hljs-number">3</span>|<span class="hljs-type">Hassan</span>|               <span class="hljs-number">3</span>_0|
|   <span class="hljs-number">1</span>|          <span class="hljs-number">3</span>|<span class="hljs-type">Hassan</span>|               <span class="hljs-number">3</span>_1|
|   <span class="hljs-number">2</span>|          <span class="hljs-number">3</span>|<span class="hljs-type">Hassan</span>|               <span class="hljs-number">3</span>_2|
+----+-----------+------+------------------+
</code></pre>
<p>Now we can easily join the two datasets</p>
<pre><code class="lang-scala"><span class="hljs-comment">// Step 3: Perform salted join</span>
<span class="hljs-keyword">val</span> joinedDF = saltedLargeDF.
    join(saltedSmallDF, <span class="hljs-type">Seq</span>(<span class="hljs-string">"salted_customer_id"</span>, <span class="hljs-string">"customer_id"</span>), <span class="hljs-string">"inner"</span>).
    select(<span class="hljs-string">"customer_id"</span>, <span class="hljs-string">"transaction"</span>, <span class="hljs-string">"name"</span>)

joinedDF.show()
+-----------+-----------+------+
|customer_id|transaction|  name|
+-----------+-----------+------+
|          <span class="hljs-number">1</span>|       txn2| <span class="hljs-type">Ahmed</span>|
|          <span class="hljs-number">1</span>|       txn1| <span class="hljs-type">Ahmed</span>|
|          <span class="hljs-number">1</span>|       txn3| <span class="hljs-type">Ahmed</span>|
|          <span class="hljs-number">2</span>|       txn4|   <span class="hljs-type">Ali</span>|
|          <span class="hljs-number">3</span>|       txn5|<span class="hljs-type">Hassan</span>|
+-----------+-----------+------+
</code></pre>
<h2 id="heading-salting-example-in-python">Salting Example in Python</h2>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, rand, lit, concat
<span class="hljs-keyword">from</span> pyspark.sql.types <span class="hljs-keyword">import</span> IntegerType

<span class="hljs-comment"># Simulated large dataset with skew</span>
largeDF = spark.createDataFrame([
    (<span class="hljs-number">1</span>, <span class="hljs-string">"txn1"</span>), (<span class="hljs-number">1</span>, <span class="hljs-string">"txn2"</span>), (<span class="hljs-number">1</span>, <span class="hljs-string">"txn3"</span>), (<span class="hljs-number">2</span>, <span class="hljs-string">"txn4"</span>), (<span class="hljs-number">3</span>, <span class="hljs-string">"txn5"</span>)
], [<span class="hljs-string">"customer_id"</span>, <span class="hljs-string">"transaction"</span>])

<span class="hljs-comment"># Small dataset</span>
smallDF = spark.createDataFrame([
    (<span class="hljs-number">1</span>, <span class="hljs-string">"Ahmed"</span>), (<span class="hljs-number">2</span>, <span class="hljs-string">"Ali"</span>), (<span class="hljs-number">3</span>, <span class="hljs-string">"Hassan"</span>)
], [<span class="hljs-string">"customer_id"</span>, <span class="hljs-string">"name"</span>])

<span class="hljs-comment"># Step 1: create a salting key in the large dataset</span>
numBuckets = <span class="hljs-number">3</span>
saltedLargeDF = largeDF.withColumn(<span class="hljs-string">"salt"</span>, (rand() * numBuckets).cast(IntegerType())) \
    .withColumn(<span class="hljs-string">"salted_customer_id"</span>, concat(col(<span class="hljs-string">"customer_id"</span>), lit(<span class="hljs-string">"_"</span>), col(<span class="hljs-string">"salt"</span>)))

<span class="hljs-comment"># Step 2: Explode rows in smallDF for possible salted keys</span>
salt_range = spark.range(<span class="hljs-number">0</span>, numBuckets).withColumnRenamed(<span class="hljs-string">"id"</span>, <span class="hljs-string">"salt"</span>)
saltedSmallDF = salt_range.crossJoin(smallDF) \
    .withColumn(<span class="hljs-string">"salted_customer_id"</span>, concat(col(<span class="hljs-string">"customer_id"</span>), lit(<span class="hljs-string">"_"</span>), col(<span class="hljs-string">"salt"</span>)))

<span class="hljs-comment"># Step 3: Perform salted join</span>
joinedDF = saltedLargeDF.join(
    saltedSmallDF,
    on=[<span class="hljs-string">"salted_customer_id"</span>, <span class="hljs-string">"customer_id"</span>],
    how=<span class="hljs-string">"inner"</span>
).select(<span class="hljs-string">"customer_id"</span>, <span class="hljs-string">"transaction"</span>, <span class="hljs-string">"name"</span>)
</code></pre>
<h3 id="heading-notes">Notes</h3>
<ul>
<li><p>This code uses <code>spark.range(...)</code> to mimic Scala’s <code>(0 until numBuckets).toDF("salt")</code>.</p>
</li>
<li><p>Column expressions are handled using <code>col(...)</code>, <code>lit(...)</code>, and <code>concat(...)</code>.</p>
</li>
<li><p>The cast to integer uses <code>.cast(IntegerType())</code>.</p>
</li>
</ul>
<h2 id="heading-tuning-tip-choosing-numbuckets">Tuning Tip: Choosing <code>numBuckets</code></h2>
<ul>
<li><p>If you set <code>numBuckets = 100</code>, each key can be divided into 100 sub-partitions. However, be cautious because using too many buckets can decrease performance, especially for keys with little data. Always test different values based on the skew profile of your dataset.</p>
</li>
<li><p>If you know how to identify the skewed keys, then you can apply the salting for those keys only, and set the salting for other keys as literal <code>0</code>, e.x.</p>
<ul>
<li><pre><code class="lang-scala">  <span class="hljs-comment">// Step 1: create a salting key in the large dataset</span>
      <span class="hljs-keyword">val</span> numBuckets = <span class="hljs-number">3</span>
      <span class="hljs-keyword">val</span> saltedLargeDF = largeDF.
          withColumn(<span class="hljs-string">"salt"</span>, when($<span class="hljs-string">"customer_id"</span> === <span class="hljs-number">1</span>, (rand() * numBuckets).cast(<span class="hljs-string">"int"</span>)).otherwise(lit(<span class="hljs-number">0</span>))).
          withColumn(<span class="hljs-string">"salted_customer_id"</span>, concat($<span class="hljs-string">"customer_id"</span>, lit(<span class="hljs-string">"_"</span>), $<span class="hljs-string">"salt"</span>))

  <span class="hljs-comment">// Step 2: Explode rows in smallDF for possible salted keys</span>
      <span class="hljs-keyword">val</span> saltedSmallDF = (<span class="hljs-number">0</span> until numBuckets).toDF(<span class="hljs-string">"salt"</span>).
          crossJoin(smallDF.filter($<span class="hljs-string">"customer_id"</span> === <span class="hljs-number">1</span>)).
          select(<span class="hljs-string">"customer_id"</span>, <span class="hljs-string">"salt"</span>, <span class="hljs-string">"name"</span>).
          union(smallDF.filter($<span class="hljs-string">"customer_id"</span> =!= <span class="hljs-number">1</span>).withColumn(<span class="hljs-string">"salt"</span>, lit(<span class="hljs-number">0</span>)).select(<span class="hljs-string">"customer_id"</span>, <span class="hljs-string">"salt"</span>, <span class="hljs-string">"name"</span>)).
          withColumn(<span class="hljs-string">"salted_customer_id"</span>, concat($<span class="hljs-string">"customer_id"</span>, lit(<span class="hljs-string">"_"</span>), $<span class="hljs-string">"salt"</span>))
</code></pre>
</li>
</ul>
</li>
</ul>
<p><strong>Rule of Thumb</strong><br />Start small (e.g., 10-20) and increase gradually based on observed shuffle sizes and task runtime.</p>
<hr />
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Salting is an effective and simple method to manage skew in Apache Spark when traditional partitioning or hints (<code>SKEWED JOIN</code>) are insufficient. With the right tuning and monitoring, this technique can significantly decrease job execution times on highly skewed datasets.</p>
]]></content:encoded></item></channel></rss>