Jul 7, 2020

Borja Muñoz

and

CARTO Contributor

Jul 7, 2020

Borja Muñoz

and

Jul 7, 2020

Simpler Data Science Workflows with CARTO + Databricks

Cloud Native

mins read

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.

Simpler Data Science Workflows with CARTO + Databricks

In our continuous efforts to make sure CARTO works well with other technologies we’re always assessing how we can improve our customers’ workflows so that they can easily interact with CARTO from other applications and platforms they already use.

Databricks is a widely used platform for data engineering analytics data science and machine learning. It provides collaborative notebooks that support multiple languages and interactive visualizations. In our quest to empower Data Scientists with the best data and analysis we knew integrating our platform with Databricks was the way to go.

The CARTO platform can be integrated with Databricks notebooks through our Direct SQL Connection feature and CARTOframes our Python package. These options are complementary and your choice will depend on your specific use case.

In this blog post we’re going over both options to help you understand when you should choose one or the other.

Data Engineering

Sometimes we have data in our platform whose structure is optimized for a given use case but it is not well suited for other applications. We need to be able to transform the data but changing the structure is sometimes a challenge within relational databases. On the other hand this problem is really easy to solve using a platform like Databricks.

To be able to transform the data we first need to read the data from the CARTO database. In order to do that we must access the database using JDBC to read the CARTO dataset as a Spark dataframe as explained in our Help Center.

Some datasets or database schemas come in wide format so in this case we are going to perform a transformation that is not easy to do with a relational database engine. This transformation is called wide-to-long or sometimes melt. A wide-to-long transformation allows you to structure the database schema in a way that is easy to apply filter operations and make your schema more searchable.

We are going to connect to CARTO from Databricks using the direct connection explained in the above link.

remote_table = spark.read.format("jdbc")\
    .option("url"  "jdbc:postgresql://<ADDRESS_CERTIFICATE>:5432/cartodb") \
    .option("dbtable"  "corona_usa_wide") \
    .option("user"  "<YOUR_USERNAME>") \
    .option("password"  "<YOUR_API_KEY>") \
    .option("ssl"  True) \
    .option("sslmode"  "verify-full" ) \
    .option("sslcert"  "/databricks/driver/client.crt" ) \
    .option("sslkey"  "/databricks/driver/client.key.pk8" ) \
    .option("sslrootcert"  "/databricks/driver/server_ca.pem") \
    .load()

Once we have retrieved the table (we will use one that gives you the COVID-19 cases at the state level in the US) we want to use in the transformation we will have a dataframe like the below one:

remote_table.limit(5).toPandas()

A screenshot of the dataframe before transformation

Here as you can see we have the dates of the dataset represented as columns instead of rows. Before doing the transformation we will split the dataframe into data and geometries to make life easier for us later we will join both dataframes.

data = remote_table.drop('the_geom')
geoms = remote_table.select('cartodb_id'  'the_geom')

Now we have the data itself without the geometries we are going to apply the wide-to-long transformation and join the geometries which will result in a dataframe like the following:

data_columns = ['_20200610'  '_20200611'  '_20200612'  '_20200613'  '_20200614'  '_20200615'  '_20200616']
data_melted = melt(data  ['cartodb_id'  'name']  data_columns  'date'  'cases')
data_melted_with_geoms = data_melted.alias('data').join(geoms.alias('geoms')  F.col('data.cartodb_id') == F.col('geoms.cartodb_id')).select('data.cartodb_id'  'data.name'  'data.date'  'data.cases'  'geoms.the_geom').dropDuplicates().sort(['name'  'date'])

A screenshot of the dataframe after transformation

Finally you can save the changes made to CARTO using again the direct connection.

data_melted_with_geoms.write.format("jdbc")\
  .option("url"  "jdbc:postgresql://<ADDRESS_CERTIFICATE>.carto.com:5432/cartodb") \
  .option("dbtable"  "corona_usa_long") \
  .option("user"  "<YOUR_USERNAME>") \
  .option("password"  "<YOUR_API_KEY>") \
  .option("ssl"  True) \
  .option("sslmode"  "verify-full" ) \
  .option("sslcert"  "/databricks/driver/client.crt" ) \
  .option("sslkey"  "/databricks/driver/client.key.pk8" ) \
  .option("sslrootcert"  "/databricks/driver/server_ca.pem") \
  .save()

We recommend this approach (Direct SQL Connection) when scalability is an issue because you can distribute data and operations within your cluster.

Visualization

Our CARTOframes Python package allows you to perform spatial analysis enrich your datasets and create stunning visualizations right in your notebook. This package can be installed within your Databricks notebook as you would do with any other Python package using PyPI.

%sh pip install cartoframes

Once you have installed the package you can set up your credentials for accessing CARTO and you can easily access your datasets.

from cartoframes.auth import set_default_credentials
set_default_credentials(username='<YOUR_USERNAME>'  api_key='<YOUR_API_KEY>')

You can also create a GeoPandas dataframe from our platform and visualize it in a very easy way.

from cartoframes import read_carto
starbucks_locations = read_carto("starbucks_locations")
Map(Layer(starbucks_locations))

‍

Then you can apply geoprocessing operations using for example the geocoding data services or you can enrich your datasets using the information from CARTO Data Observatory.

from cartoframes.data.services import Isolines
iso_service = Isolines()
starbucks_locations_isochrones  isochrones_metadata = iso_service.isochrones(starbucks_locations  [300]  mode='walk')
Map(Layer(starbucks_locations_isochrones))

We recommend using CARTOframes if your main goal is to visualize geospatial data or scalability is not an issue because you are not working with a big dataset.

How to get started

These features are available right now to all enterprise customers. To start using the Direct SQL Connection just go to the settings page in your dashboard and follow the instructions to set up a new connection. To get started with CARTOframes you can read the Quickstart guide. If you encounter any issues just reach out to your Customer Success Manager.

Want to connect CARTO to your existing stack?

Click here to discover more

Don’t forget to share this post on Linkedin!

About the author

More Posts from

Borja Muñoz

About the author

Provided by our community, industry experts, or the CARTO Team, these blog posts cover the entire spectrum of spatial analysis. From location intelligence to GIS, spatial data science, industry trends, and much more, we’ve crafted relevant content to accompany you at every stage of your journey, whether you have a technical or business background. With our Blog, you are one step closer to taking spatial analysis to the next level.

About the author

About the author

About the author

About the author