Learning by Teaching

Learning by Teaching

Or how I learned Spatial SQL in Microsoft SQL Server by teaching it to my team.

Studies have shown time and time again that one of the most effective ways to learn a subject quickly is to study it with the intention of teaching others what you have learned. Indeed, learning with the intention to teach the material encourages the use of effective learning strategies, improves learning motivation, and has even been shown to improve metacognitive processing.

I experienced these effects a few months ago when I was assigned to teach my team how to use the spatial query functionality of our new deployment of Microsoft SQL Server.

Background

First, a little background: My name is Alex, and I am a geospatial analyst and Full Stack Developer working for a large Canadian federal government agency. My team works to produce maps and analyses on a variety of applications, both internal and program-related. My role on the team has become one of a data engineer, as my background is far more code-heavy than my more research-oriented team members. As such, when it comes time for my team to adopt new technology, I am the first one they call upon.

My team works with what I would not describe as "big" data, but rather, "inconveniently large" data. that is to say, data that is too large to analyze efficiently on our workstations, but not large enough to justify the use of distributed processing technologies like Apache Spark or Dask. Therefore, reducing our data to just the amount necessary for our area of interest is of great importance.

Our previous database, Netezza Performance Server, while quite fast, is not capable of easily ingesting or querying spatial data. After months of bugging IT and procurement, we were eventually given access to our agency's SQL Server cluster. I had used Spatial SQL in my previous work, but only with PostgreSQL and its excellent PostGIS extension. however, Postgres was not made available to us, forcing us to learn the SQL Server way of querying spatial data, one for which I had little experience.

Learning by Teaching

Spatial queries in PostgreSQL are well-documented, performant, and full of countless examples across sites like stack overflow. In contrast, documentation on spatial queries in SQL Server, while good, is lacking in terms of practical examples. Indeed, finding examples of common spatial queries like point-in-polygon and selecting within a given distance was considerably more difficult than when I had worked with PostgreSQL in the past.

However, we didn't have access to PostgreSQL, so I had to make do. I suggested to management that I address these shortcomings by documenting my findings in a manual so that future employees would have a body of useful information to reference. The process was refreshingly efficient, and in the process, I myself learned even more than I had initially expected.

Side Effects

In the process of learning the various spatial functions in SQL Server, I found myself learning a lot more than I had anticipated about TSQL, the SQL Server dialect of SQL. I learned about TSQL Variables, the spatial indexing methodology used by SQL Server, flow control in TSQL, and the many subtle syntax differences from PostgreSQL.

I also learned about the many limitations of SQL Server, such as the inability to reproject data once it has been loaded, the inability to select distinct on geometry data, and the difficulty in loading spatial data from programming languages such as R and Python. But these too, were valuable lessons to learn, as it is far better to know them ahead of time than to be caught blindsided while facing a deadline.

Blogging to learn

I started this blog because, as an experience geospatial analyst, and junior web developer, I wanted to replicate the efficient learning I had experienced writing the SQL Server manual while sharing what I gained with others. The blog and its readers will make a fine proxy for my team, and hopefully what I write can help someone who uses SQL Server discover more about the capabilities of geospatial data.

Passing on the knowledge to you

As thanks to you, the reader, for getting this far, I will now pass on some of what I learned in this process to you, in the form of some practical queries. The single largest barrier I had to learning these queries was a lack of practical examples, and I hope I can address that. I will assume that you have some experience with SQL and GIS already, and are familiar with the basic concepts. I will start with some basics, and in an upcoming article, I will go more in-depth.

Point in Polygon Join

The bread-and-butter of geospatial analysis, the point-in-polygon join has too many applications to even count. In short, it selects all points that intersect with a given boundary.

Let's assume you had a polygon table of neighbourhood boundaries, and a point geometry table of car accident locations, and you wanted to determine which neighbourhood each car accident occurred in.

SELECT 
    [ACCIDENTS].[ID],
    [ACCIDENTS].[SEVERITY],
    [ACCIDENTS].[TOTAL_DAMAGE],
    [NEIGHBOURHOODS].[ID],
    [NEIGHBOURHOODS].[NAME]
FROM
    [ACCIDENTS]
INNER JOIN
    [NEIGHBOURHOODS]
ON
   [ACCIDENTS].[GEOMETRY].STIntersects([NEIGHBOURHOODS].[GEOMETRY]) = 1

Select by Location

Seems simple enough, right? now what if we wanted to select unique records that meet a given spatial relationship, without concern for which features from the other table meet the spatial relationship? for example, imagine we want to select all neighbourhoods that had witnessed a car accident in the last 24 hours, but did not want any duplication of rows for neighbourhoods that witnessed more than one accident? This kind of query is similar to the "Select by Location" tool found in software such as ArcGIS Pro and QGIS. unlike PostgreSQL, in SQL Server, we can not SELECT DISTINCT on geometry. In SQL Server, we can approach this in two ways. First, we can attempt this query by using a UnionAggregate, like so:

DECLARE @AGG_ACCIDENTS geometry

SET @AGG_ACCIDENTS = (
    SELECT
        geometry::UnionAggregate([GEOMETRY])
    FROM
        [ACCIDENTS]
    WHERE
        -- the last 24 hours
        [RECORD_DATE] >= DATEADD(day,-1,GETDATE())

SELECT
    [NEIGHBOURHOODS].[ID],
    [NEIGHBOURHOODS].[NAME],
    [NEIGHBOURHOODS].[GEOMETRY]
FROM
    [NEIGHBOURHOODS]
WHERE   
    [NEIGHBOURHOODS].[GEOMETRY].STIntersects(@AGG_ACCIDENTS) = 1

We can also use a window function and a join, like so:

WITH TBL AS (
    SELECT
        [NEIGHBOURHOODS].[ID],
        [NEIGHBOURHOODS].[NAME],
        [NEIGHBOURHOODS].[GEOMETRY],
        ROW_NUMBER() OVER (
            PARTITION BY
                [NEIGHBOURHOODS].[ID]
            ORDER BY
                -- just use the primary key here unless you
                -- want to prioritize another field
                -- for some reason
                [NEIGHBOURHOODS].[ID] DESC
    ) AS IDX
    FROM
        [NEIGHBOURHOODS]
    INNER JOIN
        [ACCIDENTS]
    ON
        [NEIGHBOURHOODS].[GEOMETRY].STIntersects([ACCIDENTS].[GEOMETRY]) = 1
)

SELECT
    *
FROM
    TBL
WHERE
    IDX = 1

in general, these methods are less performant than a simple spatial join, so only use them if you really need to select distinct geometries in your final output.

Stay tuned for more

I hope you enjoyed my article, I will be following it up in the future with a more in-depth dive into spatial queries in SQL Server.