No data? No problem with the Data Observatory


This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
No data? No problem with the Data Observatory

As you might have already heard the Data Observatory just launched to help provide CartoDB users with a universe of data. One of the reasons we built the Data Observatory is because getting the third-party data you need is oftentimes the hardest part of analyzing your own data. Data wrangling shouldn't be such a big roadblock to mapping and analyzing your world.

Some of the most desirable and most used datasets are also the hardest to handle. Datasets like the US Census are rich with information but can be large and distributed. What most people want though is just a snapshot of some boundaries and particular measures. For instance if you were telling a story about unemployment in a city you might want to grab Census Tracts for that city the number of unemployed people for each tract and the number of people 16 and over for each tract (as the denominator). All of this is no easy feat given the size of the datasets provided by the US Census.

For today's post I want to look at how we can use CartoDB's Data Observatory to answer questions when we don't already have any data at all. Further I want to only focus on just the region that I'm interested in and avoid spending time working with data and geometries that are not used in my analysis.

{% include icons/icon-demo.svg %}


Request a live demo
Let's explore San Francisco's current housing price boom.

We'll pull data for San Francisco's real estate reality provided by Zillow and available in the Data Observatory. For the boundaries we will work with the US Census generated ZIP code Tabulation Areas (ZCTA) boundaries which correspond to ZIP codes for most purposes. To go beyond displaying one measure we will pull the most recent median housing value per square foot and compare it to the same locations from three years ago to show the explosive growth in the region.

Getting Appropriate Boundaries

To get started we'll need the boundaries that we're interested in visualizing. Depending on the scale of your map (national state county) you will want different geometries. Looking at Zillow they provide median home values in particular geographical regions the most convenient of which is ZIP codes.

To find the available boundaries we can use the function OBS_GetAvailableBoundaries which will list all of the IDs for boundaries available through the Data Observatory at the location we need them. Since we're focused on San Francisco we just pass the approximate location of the city and filter by descriptions which contain the word 'zip' like so:


  OBS_GetAvailableBoundaries(CDB_LatLng(37.7749  -122.4194))
  description ILIKE '%zip%'

The results look like this:

Get All Available Boundaries at a location

This will produce a temporary table of all boundaries we can use that intersect that point. We want the ZCTA boundary since we're focused on ZIP codes. I'll also opt for the water-clipped version since we don't need to do any analysis with the geometries--we are only using them for visualization this time. The boundary_id that we need is: us.census.tiger.zcta5_clipped. You can see the data associated with these boundaries in this ID: they come from US Census Tiger geometries are ZCTA boundaries and are clipped (meaning water-clipped instead of statistical).

Now that we have our boundary ID we can get our boundaries with the Data Observatory function OBS_GetBoundariesByPointAndRadius which requires a center a radius (in meters) and the boundary ID we're interested in. I'll populate an empty table called sf_housing which has a column for geometries (the_geom) and another for storing the text id of the geometry.


INSERT INTO sf_housing(the_geom  geoid)
FROM OBS_GetBoundariesByPointAndRadius(
      CDB_LatLng(37.7749  -122.4194)  -- San Francisco  CA
    25000 * 1.609   -- 25 mile radius (= 25 km * conversion to miles)
    'us.census.tiger.zcta5_clipped' -- use water-clipped geometries for visualization

Checking out our table we'll find it filled with all of the ZCTA boundaries in the region we're interested in but there's no data for the geometries yet!

Getting Housing Value Data

Our next step is getting the housing price data for each of the ZIP code areas. The Data Observatory makes that possible through the wonderfully versatile OBS_GetMeasureById. The reason that the Data Observatory splits these the work into these two pieces is so that you have complete flexibility about what data you blend and what data you ignore. With hundreds of measures and growing this flexibility will keep your work easy.

To find the measures that would be of interest we'll use the OBS_Search function through CartoDB's SQL tray like so:


SELECT * FROM OBS_Search('zillow')

In CartoDB it will look like this:

list all measures that mention zillow

This gives back several IDs of interest but let's choose us.zillow.AllHomes_MedianValuePerSqft. This entry is just what it says median home value per square foot for "All Homes" (i.e. single-family condominium and co-operative homes). Let's look at the most recent timestamp (March 2016).

We'll put all of this information into OBS_GetMeasureById and populate a new column with the results:


UPDATE sf_housing
SET median_value_per_sqft_2016 = OBS_GetMeasureById(
       geoid  -- specify the place
       'us.zillow.AllHomes_MedianValuePerSqft'  -- specify the measure
       'us.census.tiger.zcta5'  -- specifies level of geometries
       '2016-03' -- specifies when in time

Visualizing this in CartoDB we get the following map:

Now we can easily see the median housing value per square foot easily visualized on a map -- and we didn't have to track down the datasets to do so!

Calculating Historical Changes

Now that we have recent housing price data for San Francisco let's compare it to data from March 2013 to look at the change in home value over time. To do this we can calculate the current price divided by the historical price to get the ratio of change. Subtracting one from this ratio and multiplying by 100% gives us the percentage change of 2016 as compared to 2013.

We can accomplish this with the Data Observatory as follows (after creating a new numeric column):


UPDATE sf_housing_test
SET percent_change_median_value_2016_2013 =
   100.0 * (median_value_per_sqft_2016 /
                  'us.census.tiger.zcta5' '2013-03') - 1)

Now that we have our data let's map it! In under 5 minutes we have created a pretty cool map of housing value change in SF starting with an empty dataset!

We went from having no data to creating a value-added maps in under five minutes. And this is just the beginning of what can be done with the Data Observatory.

What's next?

We hope you are starting to see the power of the Data Observatory. We see the Data Observatory as a source of enrichment and a way to make your analyses more powerful in CartoDB.

There's so much more to explore so watch our blog and Twitter for more. Also read our documentation and checkout [our catalog of available data] to get started.

What maps and datasets will you make from an empty table? Share your "something from nothing" maps with us!

Happy data mapping!