Aim: Migrate AWS Redshift to Azure SQL Data Warehouse
This article walks you through migrating AWS Redshift to Azure SQL Data Warehouse by using Azure Data Factory –
- Verify Prerequisites
- Configure Azure Data Factory Parameters
- Create Azure Blob Storage (for Staging)
- Create S3 bucket on AWS portal
- Create Azure Data Warehouse
- Create Azure Data factory
- Test connection
- Complete Deployment
- Monitor and Verify output
Azure Data Factory is a managed cloud service that is built for complex hybrid ETL/ELT, and data integration projects. Azure Data Factory helps orchestrate the data from source endpoint to destination endpoint. For this lift and shift of data you will need to:
- Setup Azure Blob storage for staging
- Build a pipeline to copy data from Amazon Redshift to Azure Data Warehouse
Below will be the speed of loading data as suggested by Microsoft:
- Loading data into Azure SQL Data Warehouse at 1.2 Gbps.
- Loading data into Azure Blob storage at 1.0 Gbps
- Loading data into Azure Data Lake Store at 1.0 Gbps
- AWS key id and AWS access key, to access the database or table that need to be migrated from the customer.
- Azure Blob storage (for staging)
- AWS S3 Bucket (To unload data from AWS Redshift)
- Azure SQL Data Warehouse (Data Storage and analytics)
- Azure Data Factory (Orchestration)
Configure Azure Data Factory parameters
Following parameters need to be configured in Azure data factory:
- Datasets (Set of source data and/or set of destination data – if required)
- Connections (Linked services – Source & Destination endpoints)
- Pipeline (Orchestrating the connection using datasets and the endpoints)
Create Azure Blob storage
Create Azure Blob storage for staging and configure following parameters:
- Resource Group
- Storage Account Name
- Deployment Model
- Account Kind
- Access Tier
- Secure transfer required
- Allow access from
- Hierarchical Namespace
- Soft delete
- Data Factory
On the “review and create” screen, go through the parameters and click on create to create the Blob storage. After setting up the blob storage, create a container by selecting blob type under storage as per below screenshot. The container name will be automatically browsed by Data Factory Link Service. (Screenshot below)
Create S3 bucket on AWS portal
There are three parameters that need to be configured while creating link with S3 bucket:
- AWS Access Key ID – IAM>Access Key>Create New Access Key
- AWS Secret Key – IAM>Access Key>Create New Access Key
- Service URL – The default service URL for AWS S3 is https://s3.amazonaws.com
Create Azure Data Warehouse:
The first step to create an Azure Data Warehouse is to create a SQL server to work on the backend of a SQL data warehouse. The following parameters need to be provided as input:
- SQL Data Warehouse
- Database name
- Resource Group
- Select Source
- Performance level (performance is directly proportional to the cost)
- Create a new server
- New Server
- Server Name
- Server Admin Login
- Check the checkbox allow for azure services to access server
Create Azure Data Factory
Configure the required parameters to create data factory
- Resource Group
- GIT URL
- Repo Name
- Branch Name
- Root folder
Post filling these parameters, click on create to create the data factory. Once the Azure data factory is created, select Author and Monitor to configure and create pipeline. This will bring you to the new page for Data Factory per the below screenshot.
The Let’s get started screen gives multiple options:
- Create Pipeline
- Create Pipeline from template
- Copy Data
- Configure SSIS Integration
- Setup Code Repository
Click on Copy Data to create the transaction. The steps involved in creating the transaction are:
- Properties – Here you enter the name and description for the copy data task. It also gives the ability to schedule the task
- Source – Allows you to specify the data story for copy task. You can use an existing data store connection or specify a new data store.
Click on Create New connection and follow the below steps
- Select Amazon Redshift and click on continue to proceed
- Fill up the details below for AWS Redshift
- Host Name
- Database Name
- Click on test connection for testing communication and then finish to add the link
Destination – Once the link is created go to the Source database and select the table that you want to copy (in the screenshot below we have used customers’ table). This will generate a preview of the data while giving the ability to view schema of the table upon clicking schema tab.
After selecting the table, add the destination endpoint link. In the New Linked Service window select Azure SQL Data Warehouse to copy the data. The details get populated automatically from your subscription which is linked to the Azure portal.
The next step is to map the tables. Table mapping is done between source and destination table while ensuring that that the schema of source dataset is matching with the schema of destination dataset.
- Settings – Settings gives up more options for configuring data movement. Two important parameters that need to be filled are:
- Staging –Blob storage
- Unload – S3 Bucket
Linking of both these connections need to be done in the same way as AWS redshift and Azure Data warehouse, only the filling parameters have to be changed as per the earlier configuration of the same. Post configuring these settings you need to test the connection for both staging and unloading to ensure a smooth lift and shift migration.
Once these steps are completed you can go to the summary tab to get a summary of the dataflow with a graphical view.
After checking the summary, click on finish to complete deployment. To check the status of the deployment, go to the monitor tab. The Data Warehouse tab allows you to verify the output.
Things to keep in mind
- You might face problem while creating linking with AWS Redshift in Data Factory. To troubleshoot this issue, please check the Redshift Cluster, click on ‘Modify cluster’ and check if VPC is assigned to the cluster and check other such parameters or review the error log to modify recommended changes.
- The procedure will be used to copy each table from Redshift to Azure SQL Warehouse and the schema for both the dataset should match to proceed further, as there would be data (column mapping) as a stage to perform during pipeline building.
- Data will be appended in the destination for each copy data task is triggered (meaning old data will still be there and the new data will be added below it in the table).