Data Flow with Azure Data Factory

Content

What is Data Flow in Azure Data Factory?
Getting Started
Supported Datasets
Transformations
Conceptual Relationships
Resources

1. What is Data Flow in Azure Data Factory?

Data Flow in Azure Data Factory (currently available in limited preview) is a new feature that enables code free data transformations directly within the Azure Data Factory visual authoring experience. Previously, data transformations were only possible within an ADF pipeline by orchestrating the execution of external business logic by a separate computational resource (e.g. Notebook with Azure Databricks, Hive with HDInsight, U-SQL with Azure Data Lake Analytics, Stored Procedure with Azure SQL DB/DWH, etc).

Now with Data Flows, developers can visually build data transformations within Azure Data Factory itself and have them represented as step based directed graphs which can be executed as an activity via a data pipeline. Note: The actual underlying execution engine that performs the transformations (e.g. SELECT, AGGREGATE, FILTER) is an Azure Databricks cluster as the Data Flow is compiled into an Apache Spark executable.

Terminology Check: Data Flow in the context of Azure Data Factory is not to be confused with Dataflows in Power BI or Data Flow in SSIS.

2. Getting Started

Before proceeding, you will need to gain access to the limited preview (http://aka.ms/dataflowpreview). Once the Azure Subscription GUID has been whitelisted, you will see a new entry available when creating a Data Factory resource.

1. Create a new Data Factory. Set Version to V2 with data flow (preview).

2. Create an Azure Databricks account. This will be required in order to link our Data Flow to an Azure Databricks cluster as the underlying transformation execution engine. If you are unfamiliar with Azure Databricks, check out this blog post.

3. Create a Databricks Cluster. Once the Databricks account has been successfully created, log on by navigating to the resource within the Azure portal and click Launch Workspace. In order to create a Databricks cluster, From the home screen click Clusters > Create Cluster. Note: Azure Data Factory Data Flow currently only supports Databricks Runtime 5.0.

Set the Cluster Name.
Set the Data Runtime Version to 5.0.
Uncheck Enable Autoscaling.
Set Workers to 1.
Click Create Cluster.

4. Generate an Azure Databricks Access Token. This will be required by Azure Data Factory to securely authenticate with the Databricks API. From the Azure Databricks home page, click the User icon in the top right hand corner of the screen, select User Settings, click Generate New Token and click Generate. Copy the token as this will be required in step 6 when we create an Azure Databricks Linked Service.

5. Launch Azure Data Factory. Return back to the Azure Portal, navigate to the Azure Data Factory resource and click Author & Monitor.

6. Create an Azure Databricks Linked Service. Once Azure Data Factory has loaded, expand the side panel and navigate to Author > Connections and click New (Linked Service). Toggle the type to Compute, select Azure Databricks and click Continue. Populate the form as per the steps below and click Test Connection and Finish.

Set the Linked Service Name (e.g. AzureDatabricks1).
Select your Azure Subscription from the drop down menu.
Select your Databricks Workspace from the drop down menu.
Select Existing Interactive Cluster.
Paste your Access Token (generated in Step 4).
Select your Azure Databricks Cluster from the drop down menu.

7. Create a Data Flow. Hover over Data Flows beneath Factory Resources, click on the ellipsis (…) and select Add Dataflow. Toggle Debug ON, select the Azure Databricks Linked Service, select the associated Azure Databricks Cluster and click Start. If successful, you should see the green icon next to the cluster name indicating that the resource is running. Note: This can take ~ 5 minutes if the cluster is spinning up from a cold start.

By this point you should have the basics in place to build and debug Data Flows.

Azure Databricks
- Azure Databricks Cluster (Runtime 5.0)
- Azure Databricks Access Token
Azure Data Factory V2 with Data Flow (preview)
- Azure Databricks Linked Service
- Azure Data Factory Data Flow

3. Supported Datasets

Data Flows in Azure Data Factory currently support 5 types of datasets when defining a source or a sink. The supported set include: Azure Blob Storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Data Warehouse, and Azure SQL Database.

Note: If you require connectivity beyond those supported by a Data Flow, a Copy Activity can be used in conjunction via a Data Pipeline.

4. Transformations

The table below summarises the list of transformations currently available. Transforms can be added to a Data Flow by clicking on the ‘+’ icon next to a node. This action is available once an initial source has been defined.

Type	Transform	Description
Dataset	Source	Source for your dataflow.
Dataset	Sink	Destination for your dataflow.
Multiple Inputs/Outputs	New Branch	Create a new flow branch with the same data.
	Join	Join data from two streams based on a condition.
	Conditional Split	Route data into different streams based on conditions.
	Union	Collect data from multiple streams.
	Lookup	Lookup additional data from another stream.
Schema Modifier	Derived Column	Compute new columns based on existing ones.
	Aggregate	Calculate aggregations on the stream.
	Surrogate Key	Adds a surrogate key column to output stream from a specific value.
	Pivot	Row values transformed into individual columns.
	Unpivot	Column values transformed into individual rows.
Row Modifier	Exists	Check the existence of data in another stream.
	Select	Choose columns to flow to the next stream.
	Filter	Filter rows in the stream based on a condition.
	Sort	Order data in the stream based on column(s).

5. Conceptual Relationships

While a Data Flow is a top level resource within Azure Data Factory, the execution of a Data Flow is orchestrated by a Data Pipeline. This is accomplished by including a Data Flow Activity and associating that activity with the Data Flow itself as well as an Azure Databricks Linked Service. Once published, the pipeline can be triggered to run either on-demand (manual) or on a recurring basis via a schedule, tumbling-window or based on an event (e.g. Blob Created, Blob Deleted).

6. Resources

For continued learning and points of reference, check out the resources below.

Taygan

Taygan

BLOG

Taygan

Data Flow with Azure Data Factory

Content

1. What is Data Flow in Azure Data Factory?

2. Getting Started

3. Supported Datasets

4. Transformations

5. Conceptual Relationships

6. Resources

Taygan