A walkthrough of how to use publicly available COVID-19 data sets for spatial analysis using CARTOframes and our Data Observatory.
For the past few months we have been making our platform freely available for those working on COVID-19 analysis regularly adding public data sets from a wide range of providers to our Data Observatory (DO) and featuring use cases from a multitude of affected industries in order to support businesses governments and the spatial community in the battle against this pandemic.
In this post adapted from a recent webinar given by our Founder & CSO Javier de la Torre and one of our Data Scientists Miguel Álvarez we walk through the process of how you can use COVID-19 public data within Spatial Data Science.
For reference and to help you replicate this type of analysis the code and notebook referenced throughout this post can be found here.
To set up CARTOframes we first need to install the library and then set the credentials of our account.
Next we can explore the DO to determine what types of data sets are directly available to us without the need to source clean and normalize. Within the DO we have a new category entitled 'covid19' covering all available datasets relating to the pandemic.
For this example we are interested in human mobility metrics in the city of New York so let's take a look at Safegraph's data in a bit more detail including the columns description and geographic coverage.
Please note that Safegraph’s data is publicly available to researchers non-profits and governments around the world working on COVID-19 related projects. In order to have access to their data you first need to sign their Consortium Agreement.
'Due to the COVID-19 pandemic people are currently engaging in social distancing. In order to understand what is actually occurring at a census block group level SafeGraph is offering a temporary Social Distancing Metrics product.'
We can further describe the data set in order to view the variables within as well as check the first 10 rows of the aggregated data. The 'completely_home_device_count' variable is the one which we are most interested in since this will give us an indication of how many people have stayed at home to work. We can see that the temporal resolution of the data is daily with a spatial resolution at the census block level.
Now we have determined that this is the data that we want to use next we need to download it. However since we are only interested in the City of New York over a specific time period we need to filter the dataset by bounding box and date using a SQL query. The bounding box has been determined using bboxfinder.com with the week commencing 16th February before the pandemic hit as a baseline.
We can visualize the geographic coverage of this data as shown below.
The first analysis we can perform is to process the data in order to build a time series (refer to the notebook for the full code).
We can see a general trend both in the chart and visualization in more people staying at home which is what we would expect. At this stage though the data is noisy due to the high granularity of the census blocks and daily sample rate.
Since we are interested in the change with respect to the pre COVID baseline we defined earlier we first need to aggregate the data to reduce the noise temporally at the week level and spatially at the Neighborhood Tabulation Area (NTA) level (an area which New York City uses itself for statistics).
Alongside this aggregation we calculated a new metric 'completely_home_device_pct_diff' giving the difference between each day's percentage of time at home compared to the baseline.
With less noise it is now possible to see distinct patterns between different neighborhoods and identify spatial patterns. For example at the end of March we can see a disparity between the neighborhoods in Queens and Manhattan during a time when many in Manhattan left the city to stay at second homes. Towards the end of May we see more movement outside of home especially in the Bronx and Brooklyn.
To take the analysis further and to try and explain some of the differences in the trends that we are seeing we can enrich with additional data sets. Since we already have a data set downloaded with distinct geometries one of the great features of CARTOframes is the ability to keep exploring the DO to enrich the data frame we are already working with.
For this example we enriched with sociodemographic data from Applied Geographic Solutions to determine if there is a correlation between the increase in % of devices at home and average income.
Since this is a premium data set it can be subscribed to from within CARTOframes for use in this and other analyses.
If we look at the variables available within the sociodemographic dataset the column names don't give us much of an indication as to the type of data.
To help determine what data would be best for us to use in this case we can get a description of each column header.
Since we are interested in how income influences working from home we picked 'Average household Income' to enrich with our pre existing data.
Now we can enrich our original dataframe with the average household income of each NTA as easily as shown below.
We then calculated the correlation between the average household income and the percentage of people staying at home with regards to the baseline. Selecting the week starting May 10th and using a log scale for the average income we can see that the increase of people staying at home is larger in those areas that have a higher income.
This again shows us that COVID-19 is not affecting everyone equally with demographics in this example playing a key role.
To try out this or other analysis for yourself there are a number of actions you can take:
- Sign up for a free account
- Watch the full webinar
- Request a demo from one of our experts
- Explore the data sets available in our Data Observatory
- Continue your learning in our Help Center
Want to see this analysis in action?