We were recently asked by one of our customers to take a look at Matillion ETL for Redshift, with a view to potentially replacing their older and significantly more expensive ETL solution. We performed a ‘thin slice’ PoC to see if we could use it to build and maintain the customer’s product dimension.
Production Dimension build overview
The source data that feeds into building the product dimension arrives as csv files on an sFTP server. Effectively then, the tasks breakdown as:
1) Matillion ETL for Redshift facilitates moving the data from sFTP to S3
2) Data files are then loaded into a landing schema in Redshift
3) We perform SCD and append transformations to load and maintain the product dimension
The product dimension is made up of SCD type 2 attributes, but we need some flexibility in the event that the customer decides to add more attributes of SCD type 1 (which is a possibility). We also can expect to have multiple file sets land throughout a day, so the SCD logic needs to be able to handle multiple changes against the same business key without losing them. The current target dimension table DDL looks like this:
create table sirocco.dim_product
The source data does not come with a timestamp denoting when the change took place, so we’ll stamp that on as the data comes into the data warehouse.
About Matillion ETL for Redshift
Matillion ETL for Redshift is designed specifically for Redshift as a target database. It provides a rich set of transformation components which leverage the Redshift processing engine, and also offers the ability to set and use functionality such as compression, sort and distribution keys. In addition to this, Matillion ETL for Redshift supports uploading JDBC drivers so you can make SQL calls to your non Redshift source databases (e.g. Oracle, MySQL) to aid in extracting data out and loading into Redshift.
The software is delivered in the form of an AMI which you can launch into your VPC and uses a Web based interface, so there’s no requirement to install client software. The interface is intuitive and easy to navigate, and uses flow based concepts when designing transformation and orchestration tasks:
Matillion ETL for Redshift also supports creating python scripts which you can call in your orchestration tasks. As per the documentation you call the python standard library as well as use boto, which provides API calls to much of the AWS estate.
Initial Setup and security
Set up and configuration is well documented on the Matillion website so I’m just going to cover some high level points here.
Before launching the AMI, it is recommended to create an EC2 role which you attach at launch time. This contains the necessary privileges to contact the redshift service, use S3, SNS and SQS service.
In the interests of time I’ve selected full access on some of these – this may not be necessary, and it’s worth looking at refining these down to just the privileges you really need.
AMI launch details
I followed the instructions documented at https://redshiftsupport.matillion.com/customer/portal/articles/1991962-launching-the-product
I launched the AMI using an m3 medium instance (seeing as I was the only user for the PoC), and launched it into a private subnet which had NAT access out to the internet. As the instance was running in VPC which had a Direct Connect link to office premises, I allowed traffic in on ports 22 (ssh) and 80 (http) from the local network only.
Environment and project configuration
Once the instance had launched, the Matillion/Tomcat service had started up so it was simply a case of browsing to the instance IP. All being well, you are greeted with the Create Project screen:
You can then proceed to fill in the details for your target environment. If the IAM role is correctly attached to the instance, when you select the drop down on the cluster field it should return you a list of your Redshift clusters. You can of course use the Manage button to manually add cluster details as well.
After completing the initial set up, I decided to take a step back and enable security – this way we can restrict who can connect to the server and manipulate the ETL tasks. Whilst it is possible to configure AD authentication, for now I just went with setting up local security. This required a change to two files on the Matillion server, /usr/share/emerald/WEB-INF/security.fragment and /etc/tomcat8/tomcat-users.xml to create the logins and passwords. After making the changes and restarting the Tomcat service, browsing the server URL now prompts for a login and password.
Step 1: Getting data into Redshift
Our first task then is to move the files from the sFTP server and into S3, from where we can load into Redshift. There’s a number of approaches you can consider for doing this, all depending on the complexity of your requirements. For the PoC we’ll keep it relatively straight forward. Using the S3 put object Orchestration component, we can fetch data from an sFTP source and place it in our target S3 bucket and folder:
Once these are set up, we next need to load the data from S3 to Redshift. To do this, we use the S3 Load orchestration component. The properties for this component allow you specify everything that you would if using the database copy command:
Once we have all our components defined, we can then put these into an Orchestration task:
Finally we’ll give this a test run, and monitor the task execution:
We check the contents of the tables in the database to confirm the contents were loaded correctly:
Just to revisit a made point earlier, the DDL for the landing.product_reference_lnd table looks like this:
CREATE TABLE landing.product_reference_lnd
load_date timestamp DEFAULT getdate()
The copy command is set to only load the columns that are present in the file, thus the command is generated as :
So the load_date column is auto populated when the data is loaded –
Step 2: Creating the transformations tasks
Now that the data is in the landing schema, we can create the transformation tasks to push the data through staging and into the target dimension table.
Essentially, we want to perform these steps:
i) Load the current incoming set of deltas into staging, de-dupe and assign surrogate key
The table staging.dim_product_stg was created with an identity column (see the DDL below)
ii) Check target dimension to see if any of the business keys in the incoming set already have existing rows – if they do, pull these into staging
iii) Load the current set of incoming deltas into stg_02
iv) Perform the SCD transformation
Whilst you might be forgiven for thinking there is a magic SCD function, this is actually a SQL script component – this gives you some flexibility for writing a SQL statement. The content I created in here is:
This applies a set of analytic functions to the data, which help us align the start and end dates for each BK and related records, repoint the current record indicator and choose the correct value to display based on SCD type. I should point out at this stage that I used the SQL component purely in the interests of time for showcasing this PoC – Matillion ETL for Redshift does make available all the analytic functions I’ve used here as components, so don’t feel that you need to write long SQL statements to achieve this kind of complexity.
v) Update the existing rows in the target dimension table with changes to valid from and valid to, and current indicators
vi) Insert new rows into the dimension
As you can deem from this flow it does a left join between the two input tables, and filters for where the right hand side is null to establish that the row is new.
vii) truncate the landing tables ready for the next set of deltas
We use the truncate tables orchestration task to perform this:
Supporting table structures used for this PoC:
CREATE TABLE staging.dim_product_stg
product_sk_id bigint DEFAULT “identity”(148139, 0, ‘1,1’::text),
CREATE TABLE staging.dim_product_stg_02
CREATE TABLE staging.dim_product_stg_03
Step 3: Orchestration
We can then roll all these transformation tasks up into a single orchestration task, and run it through. As the dimension table is empty, this will be an initial load.
The end result in our dimension table:
So it’s important to note that we’ve preserved the load time as an attribute called row_change_date. This is because for the first record of any bk set, we want the valid_from date to default to 01-JAN-1900. Next we’ll test a delta on some of the products displayed above to see the effect of this.
Step 4: Test with Delta changes
To save a little time, we’ll insert some fake delta rows into the product_references landing table to simulate a number of changes against two of the products in the list, product 16 (Speedweeder) and 3324 (This House Believes sign). For product 16, we’ll set the brand with a category of ‘Home and Garden’. At a later time, we’ll increase the price from 6 to 7.5.
For product 3324 we’ll set a category of ‘Home Furnishings’ and increase both the price and weight.
insert into landing.product_reference_lnd (order_id,product_id,product_sku,description,brand,price,weight) values (10432,16,33556,’Speedweeder’,’Home and Garden’,6.0,10);
insert into landing.product_reference_lnd (order_id,product_id,product_sku,description,brand,price,weight) values (77644,3324,59311,’This House Believes sign’,’Home Furnishings’,2.5,12);
–pause a few seconds
insert into landing.product_reference_lnd (order_id,product_id,product_sku,description,brand,price,weight) values (10432,16,33556,’Speedweeder’,’Home and Garden’,7.5,10);
This gives us the following delta rows:
Next we run our build product dimension orchestration task again:
Finally, we can check the dim_product dimension to ensure our changes were correctly loaded:
We can see that for each product the valid_from and valid_to rows have been correctly adjusted in line with the changes we pushed through, as well as the SCD2 changes against product_category, price and weight. Finally, the current indicator has been moved to indicate the most recent and current record.
Matillion ETL for Redshift comes with the right blend of tools and flexible functionality to make both greenfield DW and existing ETL migration projects fast and painless. It was very quick to get up and running with little administrative overhead, and the support website provides a wealth of useful information and documentation.