Wrangling Geospatial Data with GeoPandas and Apache Sedona on AWS

Introduction:

This article will walk through two leading solutions for processing and analyzing geospatial data: GeoPandas for efficient in-memory processing on a single machine, and Apache Sedona, which is built to run on distributed compute clusters like Apache SparkFlink, and Databricks.

We will show coding examples of both and how to set up GeoPandas locally and Apache Sedona in an AWS Glue Notebook.

First, let’s talk through why these libraries are important. Modern geospatial datasets consist of intricate geometric vectors (points, lines, polygons) or dense, high-resolution raster grids. Dealing with these datasets required using low-level libraries like Shapely, Fiona, and GEOS that would make the data hard to combine and analyze with non-spatial data science workflows.

GeoPandas wraps these complex geometric operations and file format handling into familiar pandas DataFrames, allowing users to perform sophisticated spatial operations like spatial joins, buffering, and projections using simple Pythonic code. It also integrates seamlessly with other Python libraries like NumPyMatplotlib, and Scikit-learn.

Like GeoPandas, Apache Sedona makes distributed computing of spatial datasets easier. For example, if you had to write a spatial join (finding which points fall inside which polygons) with raw Spark on a dataset of billions of records, you would need to manually write complex code to partition the data, build spatial indexes, and manage the join logic across a cluster of machines. Sedona creates an abstraction from this low-level plumbing, providing spatial DataFrames, spatial functions, spatial indexing, and partitioning, allowing users to focus on the query/ question their solving and not worry about the distributed computing architecture.

How it works:

GeoPandas is built upon two core concepts: the GeoSeries and the GeoDataFrame.

The GeoSeries is responsible for managing geometric data such as points, lines, or polygons, leveraging the geometric functionality provided by the Shapely library. The GeoDataFrame, in turn, is an extension of the standard Pandas DataFrame that requires one designated column to be a GeoSeries, thereby enabling all spatial analysis operations directly on the DataFrame structure. This fusion of tabular data analysis and geometric capabilities makes GeoPandas essential for tasks such as complex geometric analysis, Coordinate Reference System (CRS) transformations, and rapid visualization where data volumes remain within the capacity of the local machine.

Sedona’s core concept is performance at scale. It utilizes distributed Spatial Datasets and adopts a rich set of Spatial SQL functions (ST_*). This allows data engineers to efficiently load, process, and analyze petabytes of spatial data across a cluster of machines using declarative SQL queries. Key operations, such as spatial joins, range queries, and K-Nearest Neighbor (KNN) searches, are optimized through the automatic application of distributed spatial indexing techniques (like R-Tree and Quad-Tree). This indexing allows the processing to be smartly partitioned and executed in parallel, dramatically accelerating operations that would otherwise be computationally prohibitive on a single machine.

When to use Geo Pandas VS Apache Sedona:

Single-Node vs. Distributed Computing

The core distinction between GeoPandas and Apache Sedona lies in their architectural model and ability to scale. GeoPandas is rooted in the Python scientific stack, designed for efficient, in-memory processing on a single machine, while Apache Sedona is built from the ground up to operate on distributed cluster computing systems like Apache Spark, Apache Flink, and Snowflake.

Therefore, GeoPandas is a great tool for interactive, small-scale work like initial data checks, visualization, and developing a proof-of-concept, and it works best on small to medium-sized datasets (typically less than a few gigabytes) that can fit into the memory of a single computer. Apache Sedona is used for big data (up to petabytes) high performance workloads that need to be run on distributed clusters.

Setting Up GeoPandas and Code Walkthrough:

I put together this GitHub Repo: (https://github.com/salmanmian/geopandas_apacheSedona_examplesto walk through setting up GeoPandas locally and Apache Sedona on AWS Glue. I’ll walk through how to set up your environment and some code examples in this article.

  1. Install UV: I used UV to install GeoPandas and all its dependencies. You can use “curl -LsSf https://astral.sh/uv/install.sh | sh” to install UV, or if you want to use PyPI and already have Python installed (this is what I used), you can simply run “pip install UV”.
  2. git clone the repository mentioned above. The pyproject.toml file has all the packages you need to run the scripts (for Python 3.12).
  3. uv sync and activate: You can use the command “uv sync” to create a .venv virtual environment with all the dependencies. Then use “source .venv/bin/activate” in your terminal to activate your virtual environment, and you should be set up to run the notebooks and explore GeoPandas locally!

Note: The README.md file in the repository has more details and may help with common issues while doing the setup.

The Code Setup:

In this code block, we create two GeoPandas datasets: cities_gdf, which is created from scratch and has a geometry type of POINT, as well as world, which is a real-world dataset downloaded from naturalearth.com that has a geometry type of POLYGON.

# Create a simple GeoDataFrame from scratch
cities_data = {
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'population': [8336817, 3979576, 2693976, 2320268, 1680992],
    'latitude': [40.7128, 34.0522, 41.8781, 29.7604, 33.4484],
    'longitude': [-74.0060, -118.2437, -87.6298, -95.3698, -112.0740]
}

# Create point geometries from lat/lon
geometry = [Point(xy) for xy in zip(cities_data['longitude'], cities_data['latitude'])]

# Create GeoDataFrame
cities_gdf = gpd.GeoDataFrame(
    cities_data, 
    geometry=geometry,
    crs='EPSG:4326'  # WGS84 coordinate system
)

# Load world dataset directly from Natural Earth
world = gpd.read_file('https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_0_countries.zip')

GeoPandas allows you to easily filter your data using familiar pandas commands and alter your data. For example, you may want to change the coordinate reference system, which you may need when analyzing different datasets:

# Filter to North America
north_america = world[world['CONTINENT'] == 'North America'].copy()

print(f"Original CRS: {north_america.crs}")
print(f"Original bounds: {north_america.total_bounds}")

# Transform to a projected CRS (Albers Equal Area for North America)
north_america_projected = north_america.to_crs('ESRI:102008')

print(f"\nProjected CRS: {north_america_projected.crs}")
print(f"Projected bounds: {north_america_projected.total_bounds}")

### OUTPUT ###
#Original CRS: EPSG:4326
#Original bounds: [-171.7911106     7.22054149  -12.20855      83.64513   ]

#Projected CRS: ESRI:102008
#Projected bounds: [-6040576.68931711 -3522576.13233246  3971637.22673786  5477009.86478234]

Here we use the Geopandas function to_crs to convert the latitudes and longitudes to a projected coordinate system (Mollweide projection). We then calculate the area for each country using the ‘geometry’ column, which is the GeoSeries special column that contains the geometric shapes of all the rows in this GeoPandas DataFrame.

In the next part of the code, we first create a copy of the original DataFrame so we don’t modify the original. We then use the ‘geometry.centroid’ function to calculate the geometric center of each polygon (centroid). This returns a point geometry for each country. We then filter the data for Africa and plot the data using matplotlib to plot the centroids for each country in Africa.

### AREA Calculation ###
# Calculate area (need projected CRS for accurate results)
world_projected = world.to_crs('ESRI:54009')  # Mollweide projection

# Calculate area in square kilometers
world_projected['area_km2'] = world_projected.geometry.area / 1_000_000

# Top 10 countries by area
top_10_countries = world_projected.nlargest(10, 'area_km2')[['NAME', 'area_km2']]
print("Top 10 Countries by Area:")
print(top_10_countries.to_string(index=False))

### OUTPUT ###
# Top 10 Countries by Area:
#                     NAME     area_km2
#                   Russia 1.695332e+07
#               Antarctica 1.199036e+07
#                  Canada 9.996089e+06
# United States of America 9.526449e+06
#                   China 9.428688e+06
#                  Brazil 8.559708e+06
#               Australia 7.719210e+06
#                   India 3.157537e+06
#               Argentina 2.790478e+06
#              Kazakhstan 2.726129e+06

### CALCULATE CENTROIDS of COUNTRIES ###
world_with_centroids = world.copy()
world_with_centroids['centroid'] = world_with_centroids.geometry.centroid

# Plot Africa with country centroids
africa = world[world['CONTINENT'] == 'Africa'].copy()
africa_centroids = africa.geometry.centroid

fig, ax = plt.subplots(figsize=(12, 10))
africa.plot(ax=ax, color='lightgreen', edgecolor='black')
gpd.GeoSeries(africa_centroids).plot(ax=ax, color='red', markersize=30)

plt.title('African Countries with Centroids', fontsize=14)
plt.axis('off')
plt.tight_layout()
plt.show()
African Countries Plotted as Centroids

Reading and Writing GeoSpatial Data in GeoPandas:

Here we use GeoPandas ‘to_file’ method to write different file types. First, we write the data in GeoJSON, which is a human-readable, widely supported JSON-based format for storing geographic features. In the second step, we write it in shapefile format, which is a standard format created by ESRI. This creates multiple files with extensions (.shp, .shx, .dbf, .prj, and .cpg). We then read the data from the GeoJSON file and create a new GeoPandas DataFrame ‘cities_read’. GeoPandas supports a wide variety of other datatypes as well, like parquet and gpkg.

# Save GeoDataFrame to various formats
output_dir = 'output'
import os
os.makedirs(output_dir, exist_ok=True)

# Save as GeoJSON
cities_gdf.to_file(f'{output_dir}/cities.geojson', driver='GeoJSON')
print("Saved cities.geojson")

# Save as Shapefile
cities_gdf.to_file(f'{output_dir}/cities.shp')
print("Saved cities.shp")

# Read back
cities_read = gpd.read_file(f'{output_dir}/cities.geojson')
print("\nRead back from GeoJSON:")
print(cities_read.head())

### OUTPUT ###
# Saved cities.geojson
# Saved cities.shp

# Read back from GeoJSON:
#           city  population  latitude  longitude                   geometry
# 0     New York     8336817   40.7128   -74.0060    POINT (-74.006 40.7128)
# 1  Los Angeles     3979576   34.0522  -118.2437  POINT (-118.2437 34.0522)
# 2      Chicago     2693976   41.8781   -87.6298   POINT (-87.6298 41.8781)
# 3      Houston     2320268   29.7604   -95.3698   POINT (-95.3698 29.7604)
# 4      Phoenix     1680992   33.4484  -112.0740   POINT (-112.074 33.4484) 

Spatial Indexing:

Here, we explore spatial indexing. First, we have a brute force method where we check every point in the dataset against the polygon. In method 2, we use R-tree spatial indexing to organize the points into a hierarchical tree structure. We then use the ‘.bounds’ function to create a rectangular envelope (min, max values) of the polygon. The R-Tree method then quickly identifies which points intersect the bounding box (log(n) time) and creates a list of possible matches that then need to be checked in detail. In large datasets, this method would be multitudes faster since the brute force method would be O(n) time complexity.

# Method 1: Brute force (no spatial index)
start_time = time.time()
points_in_poly_brute = random_points[random_points.geometry.within(query_poly)]
brute_force_time = time.time() - start_time

# Method 2: Using spatial index
start_time = time.time()
possible_matches_index = list(random_points.sindex.intersection(query_poly.bounds))
possible_matches = random_points.iloc[possible_matches_index]
points_in_poly_indexed = possible_matches[possible_matches.geometry.within(query_poly)]
indexed_time = time.time() - start_time

print(f"\nResults:")
print(f"Points found: {len(points_in_poly_indexed)}")
print(f"\nPerformance Comparison:")
print(f"Brute force time: {brute_force_time:.4f} seconds")
print(f"Spatial index time: {indexed_time:.4f} seconds")
print(f"Speed improvement: {brute_force_time/indexed_time:.1f}x faster")

### OUTPUT ###
# Results:
# Points found: 47

# Performance Comparison:
# Brute force time: 0.0060 seconds
# Spatial index time: 0.0037 seconds
# Speed improvement: 1.6x faster

Here we use the .dissolve function to take the individual countries’ populations and sum them by continent, which is essentially like a SQL “GROUP BY’ but it not only aggregates the numeric data but also merges shapes that share an attribute.

world = gpd.read_file('https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_0_countries.zip')

print(f"Original world data: {len(world)} countries")

# Dissolve by continent (merge all countries in each continent)
continents = world.dissolve(by='CONTINENT', aggfunc='sum')

print(f"After dissolving by continent: {len(continents)} continents")
print(f"\nContinents with total population:")
print(continents[['POP_EST']].sort_values('POP_EST', ascending=False))

### Output ###
# Original world data: 177 countries
# After dissolving by continent: 8 continents

# Continents with total population:
#                               POP_EST
# CONTINENT                            
# Asia                     4.550277e+09
# Africa                   1.306370e+09
# Europe                   7.454125e+08
# North America            5.837560e+08
# South America            4.270667e+08
# Oceania                  4.120487e+07
# Antarctica               4.490000e+03
# Seven seas (open ocean)  1.400000e+02

# Visualize before and after dissolve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

world.plot(ax=ax1, edgecolor='black', linewidth=0.5, cmap='tab20')
ax1.set_title(f'Original: {len(world)} Countries', fontsize=14)
ax1.axis('off')

continents.plot(ax=ax2, column='POP_EST', legend=True, cmap='YlOrRd', 
                edgecolor='black', linewidth=2, legend_kwds={'label': 'Population'})
ax2.set_title(f'Dissolved: {len(continents)} Continents', fontsize=14)
ax2.axis('off')

plt.tight_layout()
plt.show()

Press enter or click to view image in full size

Here, we test 3 different methods to calculate the area of a dataset of polygons. Method 1 uses Python’s ‘iterrows’, which is the slowest as it iterates through the DataFrame row by row and has Python loop overhead as the row is converted to a pandas series object. Method 2 uses Python lambda and ‘.apply’ functions to perform the area function calculation on each element in the dataset. It’s much faster than Method 1 but still involves Python function calls, which have overhead. Method 3 uses ‘.area’ directly, which uses C/Cypthon code under the hood and returns a pandas series of all areas at once. It has the lowest Python overhead and is the most memory efficient.

# Create test dataset
np.random.seed(42)
n_geoms = 1000
test_polys = gpd.GeoDataFrame({
    'id': range(n_geoms),
    'geometry': [Point(np.random.uniform(-10, 10), np.random.uniform(-10, 10)).buffer(0.5) 
                 for _ in range(n_geoms)]
})

# Method 1: Iterative (slow)
start = time.time()
areas_iterative = []
for idx, row in test_polys.iterrows():
    areas_iterative.append(row.geometry.area)
time_iterative = time.time() - start

# Method 2: Apply with lambda (medium)
start = time.time()
areas_apply = test_polys.geometry.apply(lambda x: x.area)
time_apply = time.time() - start

# Method 3: Vectorized (fast)
start = time.time()
areas_vectorized = test_polys.geometry.area
time_vectorized = time.time() - start

print("Performance Comparison for Area Calculation:")
print(f"Iterative (for loop): {time_iterative:.4f} seconds")
print(f"Apply function: {time_apply:.4f} seconds")
print(f"Vectorized operation: {time_vectorized:.4f} seconds")
print(f"\nSpeedup (vectorized vs iterative): {time_iterative/time_vectorized:.1f}x faster")

# Visualize timing
methods = ['Iterative', 'Apply', 'Vectorized']
times = [time_iterative, time_apply, time_vectorized]

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(methods, times, color=['red', 'orange', 'green'], alpha=0.7)
ax.set_ylabel('Time (seconds)', fontsize=12)
ax.set_title(f'Performance Comparison: Area Calculation on {n_geoms} Polygons', fontsize=14)
ax.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, time_val in zip(bars, times):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{time_val:.4f}s', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

### Output ###
# Performance Comparison for Area Calculation:
# Iterative (for loop): 0.0154 seconds
# Apply function: 0.0020 seconds
# Vectorized operation: 0.0005 seconds

# Speedup (vectorized vs iterative): 33.0x faster

There are more features included in the GitHub Repo: https://github.com/salmanmian/geopandas_apacheSedona_examples, including techniques to use simplify tolerances to optimize for performance vs detail, manipulating geometric shapes and creating envelopes from a dataset of points (polygons, rectangles, etc).

Setting up Apache Sedona on AWS Glue and Code Walkthrough:

To set up Apache Sedona, you need the Apache Sedona Jar and the GeoTools wrapper Jar that Sedona uses to handle CRS, perform some spatial operations like ‘ST_transfrom’, and read file formats like shapefiles. I’m using the following:

  • AWS Glue 4.0 (Spark 3.3)
  • Sedona-spark-shaded-3.3_2.12–1.7.0.jar (Spark 3.3, Scala 2.2, Sedona version 1.7)
  • geotools-wrapper-1.7.1–28.5.jar.

I downloaded these jars from Maven:

I then uploaded them to an S3 bucket, which I will point to in my Glue Notebook. If your security group allows outbound HTTPS calls and is attached to a subnet with a route to the internet gateway, you could point Glue directly to the Maven website.

# Glue Environment Configuration
%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 2

# Sedona (for Spark 3.3 / Glue 4.0)
%extra_jars s3://{'Your_Bucket_Name'/sedona-spark-shaded-3.3_2.12-1.7.0.jar,s3://'Your_Bucket_Name'/geotools-wrapper-1.7.1-28.5.jar
%additional_python_modules apache-sedona==1.7.0

In the same GitHub Repo: https://github.com/salmanmian/geopandas_apacheSedona_examples, I created apache_sedona_example.ipynb in which we create an example powergrid (power generation facilities, substations, transmission lines, and demand points) dataset and show how we can use the power of Sedona with Spark to perform different operations, analysis, and optimizations on the data.

Topics Covered:

  • Sedona setup and configuration in AWS Glue
  • Creating synthetic power grid infrastructure data
  • Spatial operations: distance calculations, buffers, spatial joins
  • Advanced analysis: coverage areas, transmission corridors, gap analysis
  • Performance optimization with spatial indexing
  • Scaling demonstrations
  • Exporting geospatial data to S3

I’ll walk through some code examples below, but if you’re interested in digging deeper, please read through the repo.

In this example, we try to find the nearest substation for each demand point in our dataset. We use ST_Transform to convert the CRS from EPSG: 4326 (lat/ long in degrees) to EPSG: 3857 (Web Mercator projection, which puts coordinates in meters). We then use ST_Distance to calculate the straight line distance between the demand points and the substations. Sedona comes built in with spatial indexing, so it only compares nearby substations, avoiding looping through every substation in large datasets.

# Find nearest substation for each demand point using spatial join
nearest_substations = sedona.sql("""
    WITH distances AS (
        SELECT 
            d.name as demand_point,
            d.type as demand_type,
            d.demand_mw,
            s.id as nearest_substation,
            s.voltage_kv,
            ST_Distance(
                ST_Transform(d.geometry, 'EPSG:4326', 'EPSG:3857'),
                ST_Transform(s.geometry, 'EPSG:4326', 'EPSG:3857')
            ) / 1000 as distance_km,
            ROW_NUMBER() OVER (PARTITION BY d.name ORDER BY ST_Distance(d.geometry, s.geometry)) as rank
        FROM demand_points d
        CROSS JOIN substations s
    )
    SELECT 
        demand_point,
        demand_type,
        demand_mw,
        nearest_substation,
        voltage_kv,
        ROUND(distance_km, 2) as distance_km
    FROM distances
    WHERE rank = 1
    ORDER BY distance_km DESC
""")

print("Nearest Substation Analysis:")
nearest_substations.show(truncate=False)

# Calculate statistics
avg_distance = nearest_substations.selectExpr("AVG(distance_km) as avg_km").collect()[0]['avg_km']
max_distance = nearest_substations.selectExpr("MAX(distance_km) as max_km").collect()[0]['max_km']
print(f"\nAverage distance to nearest substation: {avg_distance:.2f} km")
print(f"Maximum distance to nearest substation: {max_distance:.2f} km")

### Output ###
# Nearest Substation Analysis:
# +-----------------+-----------+---------+------------------+----------+-----------+
# |demand_point     |demand_type|demand_mw|nearest_substation|voltage_kv|distance_km|
# +-----------------+-----------+---------+------------------+----------+-----------+
# |City A           |City       |150      |SUB-015           |500       |111.65     |
# |Industrial Zone 1|Industrial |300      |SUB-012           |345       |109.45     |
# |City E           |City       |120      |SUB-013           |345       |85.21      |
# |Industrial Zone 2|Industrial |250      |SUB-014           |345       |61.33      |
# |City B           |City       |200      |SUB-004           |345       |56.59      |
# |City C           |City       |100      |SUB-019           |500       |51.34      |
# |City D           |City       |180      |SUB-015           |500       |23.71      |
# +-----------------+-----------+---------+------------------+----------+-----------+


# Average distance to nearest substation: 71.33 km
# Maximum distance to nearest substation: 111.65 km

Here, we analyze which demand points fall into which service areas. We use ST_Within to see if the demand points are completely inside service areas.

# Check which demand points fall within service areas using ST_Within
demand_coverage = sedona.sql("""
    SELECT 
        d.name,
        d.type,
        d.demand_mw,
        COUNT(DISTINCT s.id) as num_covering_substations,
        COLLECT_LIST(s.id) as covering_substations
    FROM demand_points d
    LEFT JOIN service_areas s 
        ON ST_Within(
            ST_Transform(d.geometry, 'EPSG:4326', 'EPSG:3857'),
            s.service_area_geom
        )
    GROUP BY d.name, d.type, d.demand_mw
    ORDER BY num_covering_substations DESC
""")

print("Demand Point Coverage Analysis:")
demand_coverage.show(truncate=False)

# Calculate coverage statistics
covered_points = demand_coverage.filter("num_covering_substations > 0").count()
total_points = demand_coverage.count()
coverage_pct = (covered_points / total_points) * 100

print(f"\nCoverage Statistics:")
print(f"Demand points covered: {covered_points} / {total_points} ({coverage_pct:.1f}%)")
print(f"Demand points with redundant coverage (2+ substations): {demand_coverage.filter('num_covering_substations >= 2').count()}")


### Output ###
# Demand Point Coverage Analysis:
# +-----------------+----------+---------+------------------------+--------------------+
# |name             |type      |demand_mw|num_covering_substations|covering_substations|
# +-----------------+----------+---------+------------------------+--------------------+
# |City B           |City      |200      |1                       |[SUB-004]           |
# |City C           |City      |100      |1                       |[SUB-019]           |
# |City D           |City      |180      |1                       |[SUB-015]           |
# |City A           |City      |150      |0                       |[]                  |
# |Industrial Zone 2|Industrial|250      |0                       |[]                  |
# |Industrial Zone 1|Industrial|300      |0                       |[]                  |
# |City E           |City      |120      |0                       |[]                  |
# +-----------------

# Coverage Statistics:
# Demand points covered: 3 / 7 (42.9%)
# Demand points with redundant coverage (2+ substations): 0

Here, we do a gap analysis where we evaluate coverage gaps in service areas (how much service area is covered vs uncovered). We use ‘ST_Union_Aggr’ to merge all the service area polygons to get the total coverage (dissolves overlapping areas). We then use ‘ST_Area’ to calculate the area in square meters and convert it to km². Next, we subtract study_area from union_coverage to find the area not covered by the grid dataset.

# Calculate coverage gaps using ST_Union
# First, union all service areas to get total coverage (this removes overlaps)
union_coverage = sedona.sql("""
    SELECT ST_Union_Aggr(service_area_geom) as total_coverage
    FROM service_areas
""")

union_coverage.createOrReplaceTempView("union_coverage")

# Calculate coverage statistics - the gap should equal study_area - union_coverage
coverage_stats = sedona.sql("""
    SELECT 
        ST_Area(ST_Transform(s.geometry, 'EPSG:4326', 'EPSG:3857')) / 1000000 as study_area_km2,
        ST_Area(c.total_coverage) / 1000000 as union_coverage_km2
    FROM study_area_view s
    CROSS JOIN union_coverage c
""")

coverage_stats.createOrReplaceTempView("coverage_stats")

# Calculate gap as simple subtraction
gap_analysis = sedona.sql("""
    SELECT 
        ROUND(study_area_km2, 2) as study_area_km2,
        ROUND(union_coverage_km2, 2) as actual_coverage_km2,
        ROUND(study_area_km2 - union_coverage_km2, 2) as gap_area_km2,
        ROUND((union_coverage_km2 / study_area_km2) * 100, 1) as coverage_pct,
        ROUND(((study_area_km2 - union_coverage_km2) / study_area_km2) * 100, 1) as gap_pct
    FROM coverage_stats
""")

print("Coverage Gap Analysis:")
print("Note: Percentages based on union of service areas (overlaps removed)")
print("      Coverage % + Gap % should equal 100%")
gap_analysis.show(truncate=False)

# Verify the math
verification = sedona.sql("""
    SELECT 
        ROUND((union_coverage_km2 / study_area_km2) * 100, 1) + 
        ROUND(((study_area_km2 - union_coverage_km2) / study_area_km2) * 100, 1) as total_pct
    FROM coverage_stats
""")
print("\nVerification (should be 100.0%):")
verification.show(truncate=False)

# Show overlap statistics
overlap_stats = sedona.sql("""
    SELECT 
        COUNT(*) as num_substations,
        ROUND(SUM(ST_Area(service_area_geom) / 1000000), 2) as total_individual_areas_km2,
        ROUND((SELECT union_coverage_km2 FROM coverage_stats), 2) as actual_coverage_km2,
        ROUND(SUM(ST_Area(service_area_geom) / 1000000) - (SELECT union_coverage_km2 FROM coverage_stats), 2) as overlap_area_km2
    FROM service_areas
""")

print("\nService Area Overlap Statistics:")
print("Note: Individual areas sum to more than actual coverage due to overlaps")
overlap_stats.show(truncate=False)

# Identify substations that contribute most to coverage
substation_coverage_contribution = sedona.sql("""
    SELECT 
        id,
        voltage_kv,
        service_radius_m / 1000 as service_radius_km,
        ROUND(ST_Area(service_area_geom) / 1000000, 2) as individual_area_km2
    FROM service_areas
    ORDER BY individual_area_km2 DESC
    LIMIT 10
""")

print("\nTop 10 Substations by Individual Service Area:")
substation_coverage_contribution.show(truncate=False)

### Output ###
# Coverage Gap Analysis:
# Note: Percentages based on union of service areas (overlaps removed)
#       Coverage % + Gap % should equal 100%
# +--------------+-------------------+------------+------------+-------+
# |study_area_km2|actual_coverage_km2|gap_area_km2|coverage_pct|gap_pct|
# +--------------+-------------------+------------+------------+-------+
# |436253.95     |218665.95          |217588.0    |50.1        |49.9   |
# +--------------+-------------------+------------+------------+-------+


# Verification (should be 100.0%):
# +---------+
# |total_pct|
# +---------+
# |100.0    |
# +---------+


# Service Area Overlap Statistics:
# Note: Individual areas sum to more than actual coverage due to overlaps
# +---------------+--------------------------+-------------------+----------------+
# |num_substations|total_individual_areas_km2|actual_coverage_km2|overlap_area_km2|
# +---------------+--------------------------+-------------------+----------------+
# |25             |289982.25                 |218665.95          |71316.3         |
# +---------------+--------------------------+-------------------+----------------+


# Top 10 Substations by Individual Service Area:
# +-------+----------+-----------------+-------------------+
# |id     |voltage_kv|service_radius_km|individual_area_km2|
# +-------+----------+-----------------+-------------------+
# |SUB-015|500       |75.0             |17558.13           |
# |SUB-017|500       |75.0             |17558.13           |

Conclusion:

GeoPandas and Apache Sedona are great tools when wrangling with GeoSpatial data. Using GeoPandas for quick analysis or proof of concepts and Sedona for big data workloads helps engineers concentrate on solving the business questions instead of dealing with the low-level setup. There’s a lot more functionality not covered in this article, which users can explore through the official project websites (https://sedona.apache.org/latesthttps://geopandas.org/en/stable/) and work on their own projects. Happy exploring!