Azure Purview
/Azure Purview is the arrival of Microsoft’s long anticipated evolution of Azure Data Catalog, a unified data governance service that enables organizations to manage and govern data in a central location, while empowering users with the ability to perform data discovery across the entire data estate, with data sources that can exist on-premises, multi-cloud, or via software-as-a-service.
Headlines features include:
Automated metadata extraction, lineage identification, and data classification.
Unified map of your entire data estate overlaid with business context.
The ability to glean insights and visually assess data across your organization.
Note: Once Azure Purview has been deployed, there are two methods of interfacing with the service:
The latter providing developers the option of consuming the service programmatically, allowing purview actions to form part of a pipeline, an automated process, or even build your own custom user experience.
Content
1. Service Tiers
When provisioning an Azure Purview account, you will notice under “Configuration” there is a list of feature modules that can be selected, categorised under Catalog and Data Insights. While Azure Purview is under public preview, all three modules (C0, C1, and D0) are currently free. Note: This will be subject to change once the service becomes generally available.
The modules currently available include:
C0 - Base functionality included with the platform (search, browse, classify, automated scanning).
C1 - Additional set of catalog features including business glossary and data lineage visualization.
D0 - Glean key insights with a birds eye view of your data landscape.
The table below provides a summary view of Azure Purview features by module.
Feature | C0 | C1 | D0 |
---|---|---|---|
Source registration | ✔ | ||
Automated Metadata and Lineage Extraction | ✔ | ||
Classification | ✔ | ||
Apache Atlas API | ✔ | ||
Data Discovery (Search and Browse) | ✔ | ||
Business Glossary | ✔ | ||
Data Lineage Visualization | ✔ | ||
Catalog Insights (Asset, Scan, Glossary) | ✔ | ||
Sensitive Data Insights | ✔ |
2. Capacity
Similar to feature modules, the provisioned capacity (aka platform size) is something that will need to be considered during the Azure Purview account creation process. It is important to note, capacity of the data map cannot currently be changed once the account is provisioned.
Platform size is measured in “Capacity Units”.
A Capacity Unit is a provisioned set of resources to keep your Data Map up and running.
There are currently two options to select from: 4 or 16.
One capacity unit is able to support approximately 1 API call per second.
This capacity is used by user experiences in Azure Purview Studio or Apache Atlas APIs.
Resource Limits by Platform Size
Resource | 4 CUs | 16 CUs |
---|---|---|
API Calls, per account | 10M APIs/month | 40M APIs/month |
Storage per account | 10GB | 40GB |
3. Getting Started (Create a Purview Account)
Sign-in to the Azure portal with your Azure account.
Configure your subscription by registering the following Resource providers.
Microsoft.Purview
Microsoft.Storage
Microsoft.EventHub
Create an Azure Purview account.
For more information, check out how to create an Azure Purview account: Azure Portal or Azure PowerShell.
4. Concepts
The list below is a summary of the key concepts within the Azure Purview service and how they relate to one another. Scroll down below to see a visual representation of this mental model.
Collection
A group of related data sources. To create a hierarchy of collections, assign higher-level collections as a parent to lower-level collections. For example, the parent collection is the company - Contoso, the lower-level collection is the business unit - Finance, Sales, Marketing, etc.
A Collection can have zero or many Sources.
Source
A source defines the connection information needed for Azure Purview to connect to external resources (e.g. Azure SQL Database, Azure Data Lake Storage, Power BI, etc).
A Source can have zero or one Collection.
A Source can have zero or many Assets.
A Source can have zero or many Scans.
Asset
An asset is an instance of an asset type. An asset can be discovered using the Azure Purview search service after the metadata has been indexed. For example, a source of Azure SQL Database, has an asset type of Azure SQL Table, with an instance of dbo.Sales.
An Asset has one Source.
An Asset can have zero or many Terms.
An Asset can have zero or many Classifications.
Scan
Scanning is the process by which the catalog connects directly to a data source on a user-specified schedule. In Purview there are three levels of scanning: L1. Basic - Filename, Size, Fully Qualified Name, etc. L2. Schema - For structured file types and database tables. L3. Classification - Sample is subject to system and custom classification rules.
A Scan has one Source.
A Scan has one Credential.
A Scan has one Scan Rule Set.
Scan Rule Set
A container for grouping a set of scan rules together. For example, a Scan Rule Set for Source Type Azure Data Lake Storage will specify File Types (e.g. CSV, PARQUET, etc) in scope for schema extraction and classification and Classification Rules (System or Custom) that will be run on the dataset.
A Scan Rule Set has one Source Type.
A Scan Rule Set can be used by zero or many Scans.
A Scan Rule Set has one or many File Types.
A Scan Rule Set has zero or many Classification Rules.
Glossary
A consistent and curated collection of business terms and definitions (i.e. the business vocabulary for an organization).
A Glossary has zero or many Terms.
Term
An entry in the glossary that conveys a business concept with a meaningful description. For example, a Term [State Province ID] with a Term Description “Unique identification number for the state or province.” could be associated to the column StateProvinceID within the Person.Address table schema (i.e. Asset).
A Term can be associated with zero or many Assets.
A Term can be associated with zero or many Columns within a Schema of an Asset.
A Term is based on one Term Template.
Term Template
A term template determines the fields (or attributes) that can be used during the creation of a term.
A Term Template can be used by zero or many Terms.
Classification
A tag applied to a data asset at the table, column, or file level, that identifies what data exists in the asset.
A Classification is associated with zero or many Classification Rules.
A Classification can be associated with zero or many Assets.
A Classification can be associated with zero or many Columns within a Schema of an Asset.
Classification Rule
While Azure Purview provides a default set of classification rules which are used by the scanner to automatically detect and tag certain data types, we can add our own classification rules using regular expression to identify patterns within the data stored in a data field or identify patterns within the name of a column.
A Classification Rule belongs to one Classification.
Sensitivity Labels
Type of annotation that allows you to classify how sensitive certain data is in your organization. Note: This functionality needs to be enabled outside of Azure Purview within Microsoft Information Protection. For example, an Azure Blob Storage asset has a classification of [Social Security Number] the sensitivity label for that asset is “Secret”.
A Sensitivity Label can be associated with zero or many Assets.
A Sensitivity Label can be associated with zero or many Columns within a Schema of an Asset.
Incremental Scan
This will scan assets that have been modified or created since the last scan run.
Full Scan
This will scan all the assets that the scan has been scoped to.
5. Concept Map
6. Purview Studio
The following items are the key areas within Purview Studio, the UI-based method of interfacing with the Azure Purview service.
Home
Quick actions, recently accessed items, owned items, and useful links.
Sources
Create collections, register data sources and set up scans.
Glossary
Manage glossary terms, search glossary terms, manage term templates and custom attributes, import and export terms using .CSV.
Insights
Get insights on your data.
Management Center
Metadata management - Classifications, Data sources, Integration run time, Metrics, Access control, Credentials, Scan rule sets, Data factories and Data share connections.
Knowledge Center
Discover videos and tutorials about Purview.
7. Search & Data Discovery
After a data source is registered with Azure Purview and a scan has been performed, the extracted metadata is indexed by the search service and surfaced by the search experience for easy discovery.
Bringing the search box into focus will initially display:
Search History - Previously searched terms, sorted in order of most recently accessed.
Recently Accessed - Previously viewed assets, sorted in order of most recently accessed.
Once you begin typing, the service will show the following (where applicable):
Your Recent Searches - Recent searches that fuzzy match your search term.
Search Suggestions - Suggested searches that fuzzy match your search term.
Asset Suggestions - Suggested assets with direct links that fuzzy match your search term.
Once the search is executed, Azure Purview passes the search phrase to a modified search API called “advanced” (api/atlas/v2/search/advanced) which leverages Azure Search behind the scenes to return the most relevant results.
Filters - Filter results by Asset Type, Classification, Contact, (Sensitivity) Label, or Glossary Term.
Total Number of Search Results - Maximum of 25 results per page.
Sort By - The ability to sort by Relevance or Name.
Search Results - List of assets returned by the search API related to the search phrase.
Page Navigation - Navigate forward or backward using Previous or Next, or jump to a specific page.
Example below how the JSON response from the search API is represented in each tile within the Azure Purview Studio search results. For a deeper understanding of how the service handles types (e.g. Asset Type vs. Entity Type), see the Apache Atlas Type System documentation.
8. Lineage
The ability to capture lineage is a key feature of the Azure Purview service. Within a lineage visualisation, each asset is represented by a rectangular box (e.g. SQL Table, CSV File), while each process is represented by a rounded-edge box (e.g. Azure Data Factory Copy Activity, SSIS Package Activity). For more information, check out the Azure Purview Lineage User Guide.
Automated lineage capture is currently supported for:
Data Factory (Copy Activity, Data Flow Activity, Execute SSIS Package Activity)
Azure Data Share (Share Snapshot)
Teradatadata (Stored Procedures)
Power BI (Datasets, Dataflows, Reports, Dashboards)
Custom lineage is also supported via the Atlas hooks and REST API.
Note: Focusing on any of the assets will reveal additional properties and quick links to “Switch to asset” and “Open in…” where applicable.
9. Glossary
The glossary is a consistent and curated collection of business terms and definitions (aka data dictionary) that can be attributed to Assets or Columns (found within Schema tab of an Asset).
Glossary UI
Action Menu Items - New Term, Manage Term Templates, Edit, Import Terms, Export Terms, Delete, Refresh.
Filter By - Keyword, Term Template (System Default or Custom), Status (Draft, Approved, Expired, Alert), Contact.
Total Number of Glossary Results
Jump To - Jump to a section in the Glossary.
View - List view or Table view (more condense, additional detail).
Glossary Search Results - List of glossary terms that match the search criteria.
Example below how the each tile within the Azure Purview Studio glossary results are represented.
10. Security
In order to interact with the Azure Purview service, users must be assigned one or more of the pre-defined Purview roles. For more information, see Catalog permissions. Note: The initial creator of the Azure Purview account will by default be extended the capabilities within Purview Data Curator and Purview Data Source Administrator. All other accounts need to have a role assigned.
Role | Role Description |
---|---|
Purview Data Reader | Read all content except for scan bindings. |
Purview Data Curator | Read all content except for scan bindings, edit information about assets, edit and apply classification definitions and glossary terms. |
Purview Data Source Administrator | Does not have access to the Purview Portal (need to also be a Data Reader or Data Curator), manage all aspects of scanning data into Azure Purview but no read/write access to content in Azure Purview. |
11. Sources
Data sources can be registered and logically grouped under collections within the Sources page. Once registered, sources are represented as a visual data map with action buttons that enable editing, source details, and the ability to setup and trigger a scan.
Azure Purview currently supports the following data sources:
Azure Blob Storage
Azure Data Lake Storage Gen1/Gen2
Azure Cosmos DB (SQL API)
Azure Data Explorer (Kusto)
SQL Server
Azure SQL Database Managed Instance
Azure SQL Database
Azure Synapse Analytics
Power BI
…with additional data sources coming soon:
Teradata
SAP ECC
SAP S/4 HANA
Hive Metastore
For more information check out Supported data sources and file types and the FAQ.
12. Data Insights
Finally, Data Insights empowers data source administrators, business users, data stewards, data officers, and security administrators with a single pane glass view into the catalog. Azure Purview currently provides insights across Assets, Scans, Glossary Terms, Classifications, Sensitivity Labeling, and File Extensions. List below of the metrics and data visualisations currently available.
Assets
Metric: Number of Sources
Metric: Number of Discovered Assets
Metric: Number of Classified Assets
Visualisation: Asset Count per Source Type (Filters: [Classification Category], [Classification])
Visualisation: Size Trend of File Type within Source Types (Filters: [Source Type], [File Type])
Visualisation: Files not associated with a Resource Set
Scans
Metric: Number of Scans
Metric: Number of Successful Scans
Metric: Number of Canceled Scans
Metric: Number of Failed Scans
Visualisation: Number of Scans by Status
Glossary
Metric: Number of Glossary Terms
Visualisation: Top Glossary Terms and Count of Assets
Visualisation: Glossary Terms by Term Status
Classification Insights
Metric: Number of Subscriptions
Metric: Number of Unique Classifications Found
Metric: Number of Sources Classified
Metric: Number of Files Classified
Metric: Number of Tables Classified
Visualisation: Top Sources with Classified Data
Visualisation: Top Classification Categories by Sources (Filters: [Classification], [Subscription], [Source Type])
Visualisation: Top Classification for Files (Filters: [Classification], [Subscription], [Source Type])
Visualisation: Top Classification for Tables (Filters: [Classification], [Subscription], [Source Type])
Visualisation: Classified Data (Filters: [Classification], [Subscription], [Source Type])
Sensitivity Labels
Metric: Number of Subscriptions
Metric: Number of Unique Labels Found
Metric: Number of Sources Labeled
Metric: Number of Files Labeled
Metric: Number of Tables Labeled
Visualisation: Top Sources with Labeled Data
Visualisation: Top Labels Applied Across Sources
Visualisation: Top Labels Applied on Files
Visualisation: Top Labels Applied on Tables
Visualisation: Labeling Activity
File Extension Insights
Metric: Number of Unique File Extensions Found
Visualisation: Top File Extensions (Filters: [File Extension], [Sources], [Content Scanning])
13. Resources
Overview
Video
Microsoft Mechanics: Azure Purview - Mike Flasko, Partner Director of Product, Azure Data
Shape Your Future with Azure Data and Analytics - Satya Nadella, Amy Hood, Judson Althoff, Julia White, Rohan Kumar
Microsoft Technical Community
Map your data estate with Azure Purview - Vishal Anil, Product Manager, Azure Data
Classify your data using Azure Purview - Kavya Chandra, Senior Program Manager, Azure Data
Break free of operational silos with a consistent business glossary - Naga Yenamandra, Product Manager, Azure Data
Track the lineage of your organization’s data with Azure Purview - Chandru Sugunan, Senior Program Manager, Azure Data
Enable effortless discovery of data by business and technical data consumers - Chandru Sugunan, Senior Program Manager
Get a bird’s eye view of your data estate with Azure Purview - Hilary Pike, Principal Program Manager, Azure Data
Get the most out of Azure Purview Data Insights - Sunetra Virdi, Product Manager, Azure Data
Microsoft Information Protection and Azure Purview: Better Together - Sanjay Kidambi, Product Marketing Lead for M365
Other
14. Site Map
Table summary of pages across the Azure Purview web application.
Parent | Title | URI |
---|---|---|
Home | Home | https://web.purview.azure.com/resource/<resource-name> |
Home | Browse assets | <home>/main/catalog/browseassettypes |
Home | Search assets | <home>/main/catalog/search |
Sources | Sources | <home>/main/datasource/registeredSources |
Glossary | Glossary | <home>/main/catalog/glossary |
Insights | Assets | <home>/main/catalog/insights/catalogAnalytics |
Insights | Scans | <home>/main/catalog/insights/catalogScanAnalytics |
Insights | Glossary | <home>/main/catalog/insights/catalogGlossaryAnalytics |
Insights | Classification | <home>/main/sensitivity/classificationSummary |
Insights | Sensitivity labels | <home>/main/sensitivity/labelSummary |
Insights | File extensions | <home>/main/sensitivity/fileExtension |
Management Center | General > Account information | <home>/main/catalog/management/accountInformation |
Management Center | General > Data sources | <home>/main/datasource/management/dataSource/dataSourceRedirect |
Management Center | General > Scan rule sets | <home>/main/datasource/management/scanners/scannersList |
Management Center | General > Integration runtimes | <home>/main/datasource/integrationRuntimes |
Management Center | General > Metrics | <home>/main/catalog/management/metrics |
Management Center | Metadata management > Classifications | <home>/main/catalog/management/classifications |
Management Center | Metadata management > Classification rules | <home>/main/datasource/management/classificationRules |
Management Center | Security and access > Access control | <home>/main/catalog/management/admins |
Management Center | Security and access > Credentials | <home>/main/datasource/management/credentials |
Knowledge Center | Knowledge Center | <home>/main/catalog/knowledgecenter |