Data Flow with Azure Data Factory
/Content
What is Data Flow in Azure Data Factory?
Getting Started
Supported Datasets
Transformations
Conceptual Relationships
Resources
1. What is Data Flow in Azure Data Factory?
Data Flow in Azure Data Factory (currently available in limited preview) is a new feature that enables code free data transformations directly within the Azure Data Factory visual authoring experience. Previously, data transformations were only possible within an ADF pipeline by orchestrating the execution of external business logic by a separate computational resource (e.g. Notebook with Azure Databricks, Hive with HDInsight, U-SQL with Azure Data Lake Analytics, Stored Procedure with Azure SQL DB/DWH, etc).
Now with Data Flows, developers can visually build data transformations within Azure Data Factory itself and have them represented as step based directed graphs which can be executed as an activity via a data pipeline. Note: The actual underlying execution engine that performs the transformations (e.g. SELECT, AGGREGATE, FILTER) is an Azure Databricks cluster as the Data Flow is compiled into an Apache Spark executable.
Terminology Check: Data Flow in the context of Azure Data Factory is not to be confused with Dataflows in Power BI or Data Flow in SSIS.
2. Getting Started
Before proceeding, you will need to gain access to the limited preview (http://aka.ms/dataflowpreview). Once the Azure Subscription GUID has been whitelisted, you will see a new entry available when creating a Data Factory resource.
1. Create a new Data Factory. Set Version to V2 with data flow (preview).
2. Create an Azure Databricks account. This will be required in order to link our Data Flow to an Azure Databricks cluster as the underlying transformation execution engine. If you are unfamiliar with Azure Databricks, check out this blog post.
3. Create a Databricks Cluster. Once the Databricks account has been successfully created, log on by navigating to the resource within the Azure portal and click Launch Workspace. In order to create a Databricks cluster, From the home screen click Clusters > Create Cluster. Note: Azure Data Factory Data Flow currently only supports Databricks Runtime 5.0.
Set the Cluster Name.
Set the Data Runtime Version to 5.0.
Uncheck Enable Autoscaling.
Set Workers to 1.
Click Create Cluster.
4. Generate an Azure Databricks Access Token. This will be required by Azure Data Factory to securely authenticate with the Databricks API. From the Azure Databricks home page, click the User icon in the top right hand corner of the screen, select User Settings, click Generate New Token and click Generate. Copy the token as this will be required in step 6 when we create an Azure Databricks Linked Service.
5. Launch Azure Data Factory. Return back to the Azure Portal, navigate to the Azure Data Factory resource and click Author & Monitor.
6. Create an Azure Databricks Linked Service. Once Azure Data Factory has loaded, expand the side panel and navigate to Author > Connections and click New (Linked Service). Toggle the type to Compute, select Azure Databricks and click Continue. Populate the form as per the steps below and click Test Connection and Finish.
Set the Linked Service Name (e.g. AzureDatabricks1).
Select your Azure Subscription from the drop down menu.
Select your Databricks Workspace from the drop down menu.
Select Existing Interactive Cluster.
Paste your Access Token (generated in Step 4).
Select your Azure Databricks Cluster from the drop down menu.
7. Create a Data Flow. Hover over Data Flows beneath Factory Resources, click on the ellipsis (…) and select Add Dataflow. Toggle Debug ON, select the Azure Databricks Linked Service, select the associated Azure Databricks Cluster and click Start. If successful, you should see the green icon next to the cluster name indicating that the resource is running. Note: This can take ~ 5 minutes if the cluster is spinning up from a cold start.
By this point you should have the basics in place to build and debug Data Flows.
Azure Databricks
Azure Databricks Cluster (Runtime 5.0)
Azure Databricks Access Token
Azure Data Factory V2 with Data Flow (preview)
Azure Databricks Linked Service
Azure Data Factory Data Flow
3. Supported Datasets
Data Flows in Azure Data Factory currently support 5 types of datasets when defining a source or a sink. The supported set include: Azure Blob Storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Data Warehouse, and Azure SQL Database.
Note: If you require connectivity beyond those supported by a Data Flow, a Copy Activity can be used in conjunction via a Data Pipeline.
4. Transformations
The table below summarises the list of transformations currently available. Transforms can be added to a Data Flow by clicking on the ‘+’ icon next to a node. This action is available once an initial source has been defined.
Type | Icon | Transform | Description |
---|---|---|---|
Dataset | Source | Source for your dataflow. | |
Sink | Destination for your dataflow. | ||
Multiple Inputs/Outputs | New Branch | Create a new flow branch with the same data. | |
Join | Join data from two streams based on a condition. | ||
Conditional Split | Route data into different streams based on conditions. | ||
Union | Collect data from multiple streams. | ||
Lookup | Lookup additional data from another stream. | ||
Schema Modifier | Derived Column | Compute new columns based on existing ones. | |
Aggregate | Calculate aggregations on the stream. | ||
Surrogate Key | Adds a surrogate key column to output stream from a specific value. | ||
Pivot | Row values transformed into individual columns. | ||
Unpivot | Column values transformed into individual rows. | ||
Row Modifier | Exists | Check the existence of data in another stream. | |
Select | Choose columns to flow to the next stream. | ||
Filter | Filter rows in the stream based on a condition. | ||
Sort | Order data in the stream based on column(s). |
5. Conceptual Relationships
While a Data Flow is a top level resource within Azure Data Factory, the execution of a Data Flow is orchestrated by a Data Pipeline. This is accomplished by including a Data Flow Activity and associating that activity with the Data Flow itself as well as an Azure Databricks Linked Service. Once published, the pipeline can be triggered to run either on-demand (manual) or on a recurring basis via a schedule, tumbling-window or based on an event (e.g. Blob Created, Blob Deleted).
6. Resources
For continued learning and points of reference, check out the resources below.