The Retail module in our Analytics Toolbox for Google BigQuery now includes Twin Areas analysis, an essential tool in Site Selection analytics. Learn more.
When it comes to retail site selection applications a common task consists of combining data from different sources to compare the characteristics of target locations and hone in on the most appropriate areas for footprint expansion or consolidation.
In a previous blogpost we explained how to build a similarity score with respect to an existing site (e.g. the location of your top performing store) for a set of target locations. This approach is an essential tool for Site Planners looking to open new stores relocate existing points of sale or consolidate their current network.
Our Analytics Toolbox for BigQuery a set of User Defined Functions and procedures for running advanced spatial analytics natively in BigQuery has been further enhanced with a new twin areas procedure adding to the existing retail-specific procedures we have developed including revenue prediction commercial hotspots and white space analysis.
Our Approach for Finding the Most Similar Locations to Your Top Performing Stores
In this post we tackle Twin Areas analysis in more depth. This approach consists of three main steps:
- First to select the most relevant variables given the characteristics of your business (e.g. population income etc.) coming from either our Data Observatory or from your own data tables;
- Secondly to gridify and enrich the location of an existing site (from now on referred to as the origin location) and of all the target sites using the selected data sources. The process of gridification both for the origin and target locations is required in order to be able to compare areas of the same size and relies on the use of spatial indexes (either quadkey or h3) using the available procedures in our Analytics Toolbox.
- Finally to derive a similarity score between the origin and each target location by ranking the distance between the origin and each target cell in the variable space where the selected variables have been standardized and transformed to account for pair-wise correlations.
The similarity score for each target cell is computed with respect to the score of the average cell in the target areas. Under this scoring rule a target cell with a larger score is more similar to the origin cell than a target cell with a lower score; moreover this score is positive if (and only if) the target cell is more similar to the origin than the mean cell.
How to Run the Twin Areas Analysis in Google BigQuery
To demonstrate how to run the Twin Areas analysis in BigQuery we selected as potential origin locations the position of the top 10 performing liquor stores in 2019 from the publicly available Iowa Liquor sales dataset.
We start by using the GRIDIFY_ENRICH procedure from the data module in CARTO’s Analytics Toolbox. This procedure is used to first gridify a set of geometries (point data in this case) to a quadkey grid with zoom 15 and then to enrich each selected location with data from a subscription to one of the datasets available in the Data Observatory including the total population (total_pop_3409f36f) and the number of households (households_d7d24db5) at the Census Block Group level from the ACS Sociodemographics dataset as well as from a custom dataset which contains the count of road links (count_qualified) per zip code. The result can be found in the table cartobq.docs.twin_areas_origin_enriched.
The map below shows both the locations of the selected stores (left) as well as the enriched grid for the population variable (right):
Similarly we can use this procedure to gridify and enrich the target areas for which we will use the Census Tracts polygons in Texas in the main urban areas. The result can be found in the table cartobq.docs.twin_areas_target_enriched.
Once we have gridified and enriched the origin and target areas we can run the FIND_TWIN_AREAS procedure for a given origin location here selected as the store with the highest revenue. The result of the analysis can be found in the table cartobq.docs.twin_areas.
As shown by the code above the procedure requires the user to specify through a SQL query the origin and target cells as well as to provide some inputs for the calculation of the similarity score namely:
- The prefix for the tables where the outputs of the procedure are stored including the results table for a given origin ID and the name of the Principal Component Analysis (PCA) model used to transform the selected variables before computing the distance between the origin and target cell. The PCA method is used to avoid that pairwise correlations between variables might (wrongly) affect the computation of the distance. Since the eigenvectors of the covariance matrix are computed only taking into account the data in the target areas if a model with the specified name already exists this is used to retrieve the principal component scores both for the origin and target areas otherwise the model is generated by the call of the procedure.
- The percentage of variance retained when extracting the principal components (by default this value is set to 90%). By specifying this value the user can decide how much of the variability in the original variables to include in the computation of the distance (smaller values of this parameter imply that only the major modes of variability will be accounted for when computing the distance).
- The maximum number of twin areas results. By default this parameter is set to NULL and all the target areas for which the similarity score is positive are returned.
The results of this procedure for a selected origin location can be seen in this map:
This map shows the similarity skill score for all the target cells with a positive score in Texas: larger scores indicate areas more similar to the origin location.
Traditionally discovering new areas for retail business expansion represented a difficult and lengthy process which required on-site market analysis and local expertise. Using our Twin Areas tool retailers and CPG brands can now easily discover the best locations to expand or optimize their network without any strong prior knowledge of the area thus optimizing their site planning process. The approach takes full advantage of our comprehensive data catalog and the analytical capabilities of CARTO’s cloud native Location Intelligence platform.
Try Conducting Your Own Twin Areas Analysis in BigQuery with CARTO
These procedures are now available to CARTO users as part of the Analytics Toolbox for BigQuery. They can be run directly from the BigQuery console or from your SQL or Python Notebooks using the BigQuery client. Additionally they will soon power the white space analysis engine integrated into CARTO's Site Selection solution.