Presentation code

Overview

This presentation takes two datasets provided and leveraging both snowflake and DBT demonstrates the benefits of pivoting to a modern platform. Additionally I have introduced a presentation layer using both Gitab and R to further touch on the flexiblity of using a rapid development.

The presentation is outlines in the following way

  • Approach:
  • Data
  • Results
  • A/B Testing
  • Discussion points

Approach

DBT

Not wanting to reinvent the wheel I followed the standard user guide here focusing on ensuring a reproducible pattern was established.

Database structure

I decided on a two step approach for the database structure where the initial data was inserted in its entirety into the RAW/BIKES layer then I constructed a “Raw Data Store” RAW/RDS that was to be used as the source layer.

This was to provide clear and definable separation between working in the snowflake UI and within a development environment in DBT. It also has a number of associated benefits:

  • Assuming you wanted to automate the ingestion it would protect downstream dependencies
  • json handling
  • DBT mapping
  • Developer friendly documentation
RAW
----/BIKES
---------/CITI_BIKES  
---------/WEATHER
----/RDS
--------/RDS_BIKES
--------/RDS_BIKES_SAMPLE
--------/RDS_WEATHER
--------/RDS_WEATHER_SAMPLE

ANALYTICS
---------/ADS
-------------/BIKES_METRICS
-------------/RIDER_WEATHER
-------------/WEATHER_METRICS

With the plan as above DBT will pass the data from as

Snowflake

The snowflake database then takes the shape as follows below with an initial setup piece and establishing connections to the S3 buckets that is captured with a few scripts.

Gitlab

The codebase is captured here. As it is a single developer working on the project I set up a simple structure of allowing for direct code commits and pushes against a feature branch and self approval into main.

Environmentally there is no consideration to promotion of code through environments but structurally the separation could be introduced by extending the native dbt_user protections and moving the jobs onto protected branches.

Data

The raw data was located in two sets under the database raw:

bikes: https://s3.amazonaws.com/tripdata/index.html

weather: s3://snowflake-workshop-lab/weather-nyc

Bikes

Taking a sample in R to prepare the load we were left with the following structure

create or replace table citi
(tripduration integer,
  starttime timestamp,
  stoptime timestamp,
  start_station_id integer,
  start_station_name string,
  start_station_latitude float,
  start_station_longitude float,
  end_station_id integer,
  end_station_name string,
  end_station_latitude float,
  end_station_longitude float,
  bikeid integer,
  membership_type string,
  usertype string,
  birth_year integer,
  gender integer);

Looking a little closer (after some googling) I found there was a path declared online for the s3 bucket in the following format.AsIs(

URL s3://snowflake-workshop-lab/citibike-trips

This allowed me to input all 61 million rows… which may have been excessive for the exercise but I digress.

Weather

The weather dataset followed the exact same pattern apart from leveraging a standard JSON pattern. Which snowflake lets us deal with really cleanly

{
  "city": {
    "coord": {
      "lat": 43.000351,
      "lon": -75.499901
    },
    "country": "US",
    "findname": "NEW YORK",
    "id": 5128638,
    "langs": [
      {
        "abbr": "NY"
      },
      {
...
      },
      {
        "zh": "纽约州"
      }
    ],
    "name": "New York",
    "zoom": 1
  },
  "clouds": {
    "all": 90
  },
  "main": {
    "humidity": 81,
    "pressure": 1016,
    "temp": 285.53,
    "temp_max": 286.15,
    "temp_min": 285.15
  },
  "time": 1478206834,
  "weather": [
    {
      "description": "overcast clouds",
      "icon": "04d",
      "id": 804,
      "main": "Clouds"
    }
  ],
  "wind": {
    "deg": 280,
    "speed": 6.2
  }
}

All I actually want from here is weather so lets get temp, wind speed, and the weather

Results

Now with all the data loaded into snowflake via DBT I wanted to connect something that would be able to display it in a meaningful way. So creating a connection into R I made a few graphics to understand the posed questions.

I wanted to use highcharter so I connected the snowflake instance to R for more fluid datacleaning ahead of presentation


install.packages(c("DBI", "dplyr","dbplyr","odbc", "Rcpp"))
library(DBI)
library(dplyr)
library(dbplyr)
library(odbc)
library(Rcpp)


con <- DBI::dbConnect(odbc::odbc(),
                      Driver       = "/opt/snowflake/snowflakeodbc/lib/universal/libSnowflake.dylib",
                      Server       = "jm03779.ap-southeast-2.snowflakecomputing.com",
                      UID          =  rstudioapi::askForPassword("Database user"),
                      PWD          =  rstudioapi::askForPassword("Database password"),
                      Database     = "Analytics",
                      Warehouse    = "compute_wh",
                      Schema       = "ads"
)


DBI::dbGetQuery(con,"USE ROLE TRANSFORMER;")
DBI::dbGetQuery(con,"USE WAREHOUSE TRANSFORMING;")

base_bike <- DBI::dbGetQuery(con,"SELECT * FROM ANALYTICS.ADS.BIKES_METRICS;")
base_weather <- DBI::dbGetQuery(con,"SELECT * FROM ANALYTICS.ADS.WEATHER_METRICS;")
base_join <- DBI::dbGetQuery(con,"SELECT * FROM ANALYTICS.ADS.RIDER_WEATHER;")

#totals 
base_bike %>%
  na.omit() %>%
  summarise(sum(TRIPS_COUNT), sum(DISTANCE_TOTAL), mean(DURATION_AVERAGE))

Number of trips

The total number of trips over the period was 61,468,359

Distance travelled

The total distance traveleld was (rounded for obvious reasons) 111,215,593. This was achieved using the haversine function as hinted in snowflake.

{{ 
    config(materialized='table', database='analytics') 
}}

-- this aggregates by hour the metrics of distance, time and totals 

select 
    date_trunc('hour', starttime) as date
    , count(*) as trips_count
    , sum(tripduration) as duration_total
    , avg(tripduration) as duration_average
    , sum(haversine(
            start_station_latitude
            , start_station_longitude
            , end_station_latitude
            , end_station_longitude
            )) as distance_total
    , avg(haversine(
            start_station_latitude
            , start_station_longitude
            , end_station_latitude
            , end_station_longitude
            )) as distance_average

 from {{ source('rds', 'rds_bikes') }} 
 group by 1

Average trip duration

Now this is a bit more interesting of a question. In terms of average duration the average of what? Each trip? The dataset was aggregated at the hour layer so the presentation layer passes a DURATION_AVERAGE column which is the average of averages. Which is 916.8 Trip Duration (seconds)

A/B testing

Trying to understand if weather has an impact on ridership we need to combine the datasets and run an analysis overtop of it.

The approach I have decided to go for is using R to run a linear regression and checking for significance.

The data

We had two datasets weather and bikes data. I decided to aggregate the weather into hourly buckets and took the most prevaling type, for example if there is 2 occurances of rain and one of clear it would call the hour rain.

-- this assigns a weather for any given hour, could add tempature, wind etc if needed

select 
    weather
    , date
from
(
    select
        weather 
        , date_trunc('hour', time) as date
        , count(*) as n
        , row_number() over (partition by date order by n desc) rank_number
    from {{ source('rds', 'rds_weather') }}  
    group by 1, 2

) where rank_number = 1

Now I aggregated the riders into hour buckets as well and joined up the dataset. So question remains does the weather impact the riders.

So we have the following weather which I then grouped into ‘good’ and ‘bad’:

          WEATHER   WEATHER_TYPE
1         "Clear"   "Good"
2          "Mist"   "Bad"
3        "Clouds"   "Good"
4          "Rain"   "Bad"
5          "Haze"   "Bad"
6  "Thunderstorm"   "Bad"
7           "Fog"   "Bad"
8       "Drizzle"   "Bad"
9          "Snow"   "Bad"
10        "Smoke"   "Bad"
11       "Squall"   "Bad"

Here we just want to look at the weather as there are only 24 hours in the day to get a sense of is the weather good or bad in NYC? Also doing this highlighed a data quality issue in the analysis and narrowed the date range down to 2017-03-01.

So now lets consider the two averages are there more trips when its good weather?

It looks like it does. Now to validate this I would need to construct a boolean variable off of this and check its significance as an explanation. Though I am going to stop right here.

Discussion

  • Separating DBT loading and running function to deliver a more stable solution.
  • Using a full code based solution you can see and replicate 100% of what I have completed.
  • Native connections into R, Python will allow a positioning as a ‘data plumber’ and appeal to the new world.