DP-100: Designing and Implementing a Data Science Solution on Azure - Exam Prep

A collection of resources and learning material to help prepare for and pass exam DP-100: Designing and Implementing a Data Science Solution on Azure. Passing this exam will result in becoming a certified Azure Data Scientist. This exam focuses on how to implement and run machine learning workloads on Azure, in particular, using the Azure Machine Learning Service. 

Suggested Approach

  1. Skills Measured - In order to aid study, a copy of the skills measured sourced from the official exam page with key phrases highlighted below. Note: Always ensure to refer to the latest skills outline available directly from the official exam home page as the content changes from time to time.

  2. Microsoft Learn - Collection of free content that can be consumed on-demand and aligned to the Azure Data Scientist role. Working through this content will establish a solid foundation to build upon.

  3. Study Notes - Finally, refer to either your own set of notes or those compiled below. After completing your initial round of learning, spend additional time in any areas where you may not feel a satisfactory level of confidence prior to taking the exam.

Resource Link
Certification Microsoft Certified: Azure Data Scientist Associate
Exam Exam DP-100: Designing and Implementing a Data Science Solution on Azure
Microsoft Learn Azure Data Scientist Learning Paths
Skills Outline DP-100 Exam Skills Outline

Suggested Learning Paths

Skills Measured

1. Set up an Azure Machine Learning workspace | 30-35%

Create an Azure Machine Learning workspace

  • create an Azure Machine Learning workspace [1]
  • configure workspace settings [1]
  • manage a workspace by using Azure Machine Learning Studio [1]

Manage data objects in an Azure Machine Learning workspace

  • register and maintain data stores [1]
  • create and manage datasets [1]

Manage experiment compute contexts

  • create a compute instance [1]
  • determine appropriate compute specifications for a training workload [1]
  • create compute targets for experiments and training [1]
2. Run experiments and train models | 25-30%

Create models by using Azure Machine Learning Designer

  • create a training pipeline by using Azure Machine Learning designer [1]
  • ingest data in a designer pipeline [1]
  • use designer modules to define a pipeline data flow [1]
  • use custom code modules in designer [1]

Run training scripts in an Azure Machine Learning workspace

  • create and run an experiment by using the Azure Machine Learning SDK [1]
  • consume data from a data store in an experiment by using the Azure Machine Learning SDK [1]
  • consume data from a dataset in an experiment by using the Azure Machine Learning SDK [1]
  • choose an estimator for a training experiment [1]

Generate metrics from an experiment run

  • log metrics from an experiment run [1]
  • retrieve and view experiment outputs [1]
  • use logs to troubleshoot experiment run errors [1]

Automate the model training process

  • create a pipeline by using the SDK [1] [2]
  • pass data between steps in a pipeline [1] [2]
  • run a pipeline [1]
  • monitor pipeline runs [1]
3. Optimize and manage models | 20-25%

Use Automated ML to create optimal models

  • use the Automated ML interface in Azure Machine Learning studio [1]
  • use Automated ML from the Azure Machine Learning SDK [1]
  • select scaling functions and pre-processing options [1] [2]
  • determine algorithms to be searched
  • define a primary metric [1]
  • get data for an Automated ML run [1]
  • retrieve the best model [1]

Use Hyperdrive to tune hyperparameters

  • select a sampling method [1] [2]
  • define the search space [1] [2]
  • define the primary metric [1]
  • define early termination options [1] [2]
  • find the model that has optimal hyperparameter values [1]

Use model explainers to interpret models

  • select a model interpreter [1] [2]
  • generate feature importance data [1]

Manage models

  • register a trained model [1] [2]
  • monitor model history [1] [2]
  • monitor data drift [1] [2]
4. Deploy and consume models | 20-25%

Create production compute targets

  • consider security for deployed services [1] [2]
  • evaluate compute options for deployment [1]

Deploy a model as a service

  • configure deployment settings [1]
  • consume a deployed service [1]
  • troubleshoot deployment container issues [1]

Create a pipeline for batch inferencing

  • publish a batch inferencing pipeline [1] [2]
  • run a batch inferencing pipeline and obtain outputs [1]

Publish a designer pipeline as a web service

  • create a target compute resource [1]
  • configure an Inference pipeline [1]
  • consume a deployed endpoint [1]

Study Notes

1. Set up an Azure Machine Learning workspace | 30-35%
from azureml.core import Workspace
ws = Workspace.create(
    name='aml-workspace',
    subscription_id='123456-abc-123...',
    resource_group='aml-resources',
    create_resource_group=True,
    location='eastus',
    sku='enterprise'
)
{
    "subscription_id": "<subscription-id>",
    "resource_group": "<resource-group>",
    "workspace_name": "<workspace-name>"
}

Connect to Workspace using a Configuration File
By default, the from_config method looks for a file named config.json in the folder containing the Python code file, but you can specify another path if necessary.

from azureml.core import Workspace

ws = Workspace.from_config()
from azureml.core import Workspace

ws = Workspace.get(
    name='aml-workspace',
    subscription_id='1234567-abcde-890-fgh...',
    resource_group='aml-resources')
from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
    datastore_name='blob_data',
    container_name='data_container',
    account_name='az_store_acct',
    account_key='123456abcde789…')
from azureml.core import Dataset

blob_ds = ws.get_default_datastore()
csv_paths = [(blob_ds, 'data/files/current_data.csv'),
             (blob_ds, 'data/files/archive/*.csv')]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
tab_ds = tab_ds.register(workspace=ws, name='csv_table')
from azureml.core import Dataset

blob_ds = ws.get_default_datastore()
file_ds = Dataset.File.from_files(path=(blob_ds, 'data/files/images/*.jpg'))
file_ds = file_ds.register(workspace=ws, name='img_files')
from azureml.core.compute import ComputeTarget, ComputeInstance
from azureml.core.compute_target import ComputeTargetException

compute_name = "compute-instance"

try:
    instance = ComputeInstance(workspace=ws, name=compute_name)
except ComputeTargetException:
    compute_config = ComputeInstance.provisioning_configuration(
        vm_size='STANDARD_D3_V2',
        ssh_public_access=False)
    instance = ComputeInstance.create(ws, compute_name, compute_config)
    instance.wait_for_completion(show_output=True)
from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, AmlCompute

# Load the workspace from the saved config file
ws = Workspace.from_config()

# Specify a name for the compute (unique within the workspace)
compute_name = 'aml-cluster'

# Define compute configuration
compute_config = AmlCompute.provisioning_configuration(
    vm_size='STANDARD_DS12_V2',
    min_nodes=0,
    max_nodes=4,
    vm_priority='dedicated')

# Create the compute
aml_cluster = ComputeTarget.create(ws, compute_name, compute_config)
aml_cluster.wait_for_completion(show_output=True)
2. Run experiments and train models | 25-30%
from azureml.core import Experiment
experiment = Experiment(workspace=ws, name="my-experiment"
from azureml.core import Dataset

my_datastore = Datastore.get(workspace, 'my_datastore')
from azureml.core import Dataset
dataset_name = 'my-dataset'

# Get a dataset by name
ds = Dataset.get_by_name(workspace=workspace, name=dataset_name)

# Load dataset into pandas DataFrame
df = ds.to_pandas_dataframe()

Generic Estimator
This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn. Encapsulates a run configuration and a script configuration in a single object. Running the estimator produces a model in the output directory specified in your training script.

from azureml.train.estimator import Estimator
from azureml.core import Experiment

# Create an estimator
estimator = Estimator(source_directory='experiment_folder',
                      entry_script='training_script.py',
                      compute_target='local',
                      conda_packages=['scikit-learn']
                      )

# Create and run an experiment
experiment = Experiment(workspace = ws, name = 'training_experiment')
run = experiment.submit(config=estimator)
from azureml.train.sklearn import SKLearn
from azureml.core import Experiment

script_params = {
    '--kernel': 'linear',
    '--penalty': 1.0
}

estimator = SKLearn(
    source_directory=project_folder,
    script_params=script_params,
    compute_target=compute_target,
    entry_script='train_iris.py',
    pip_packages=['joblib==0.13.2']
)

# Create and run an experiment
experiment = Experiment(workspace = ws, name = 'training_experiment')
run = experiment.submit(config=estimator)
Type Function Example Note
Scalar values run.log(name, value, description='' run.log("accuracy", 0.95) Log a numerical or string value.
Lists run.log_list(name, value, description='') run.log_list("accuracies", [0.6, 0.7, 0.87]) Log a list of values.
Row run.log_row(name, description=None, **kwargs) run.log_row("Y over X", x=1, y=0.4) Creates a metric with multiple columns as described in kwargs.
Table run.log_table(name, value, description='') run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]}) Log a dictionary object.
Images run.log_image(name, path=None, plot=None) run.log_image("ROC", plot=plt) Log an image.
Tag a run run.tag(key, value=None) run.tag("selected", "yes") Tag the run with a string key and optional string value.
Upload file or directory run.upload_file(name, path_or_stream) run.upload_file("best_model.pkl", "./model.pkl") Upload a file.
Option Description
Run.start_logging Add logging functions to your training script and start an interactive logging session in the specified experiment. start_logging creates an interactive run for use in scenarios such as notebooks. Any metrics that are logged during the session are added to the run record in the experiment.
ScriptRunConfig Add logging functions to your training script and load the entire script folder with the run. ScriptRunConfig is a class for setting up configurations for script runs. With this option, you can add monitoring code to be notified of completion or to get a visual widget to monitor.
Designer logging Add logging functions to a drag-&-drop designer pipeline by using the Execute Python Script module. Add Python code to log designer experiments.
# Get an experiment object from Azure Machine Learning
experiment = Experiment(workspace=ws, name="train-within-notebook")

# Create a run object in the experiment
run =  experiment.start_logging()

# Log the algorithm parameter alpha to the run
run.log('alpha', 0.03)
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment

# Step to run a Python script
step1 = PythonScriptStep(name = 'prepare data',
                         source_directory = 'scripts',
                         script_name = 'data_prep.py',
                         compute_target = 'aml-cluster',
                         runconfig = run_config)

# Step to run an estimator
step2 = EstimatorStep(name = 'train model',
                      estimator = sk_estimator,
                      compute_target = 'aml-cluster')

# Construct the pipeline
train_pipeline = Pipeline(workspace = ws, steps = [step1,step2])

# Create an experiment and run the pipeline
experiment = Experiment(workspace = ws, name = 'training-pipeline')
pipeline_run = experiment.submit(train_pipeline)
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep

# Get a dataset for the initial data
raw_ds = Dataset.get_by_name(ws, 'raw_dataset')

# Define a PipelineData object to pass data between steps
data_store = ws.get_default_datastore()
prepped_data = PipelineData('prepped',  datastore=data_store)

# Step to run a Python script
step1 = PythonScriptStep(name = 'prepare data',
                         source_directory = 'scripts',
                         script_name = 'data_prep.py',
                         compute_target = 'aml-cluster',
                         runconfig = run_config,
                         # Specify dataset as initial input
                         inputs=[raw_ds.as_named_input('raw_data')],
                         # Specify PipelineData as output
                         outputs=[prepped_data],
                         # Also pass as data reference to script
                         arguments = ['--folder', prepped_data])

# Step to run an estimator
step2 = EstimatorStep(name = 'train model',
                      estimator = sk_estimator,
                      compute_target = 'aml-cluster',
                      # Specify PipelineData as input
                      inputs=[prepped_data],
                      # Pass as data reference to estimator script
                      estimator_entry_script_arguments=['--folder', prepped_data])
3. Optimize and manage models | 20-25%
from azureml.train.automl import AutoMLConfig
from azureml.core.experiment import Experiment

automl_config=AutoMLConfig(
    task='classification',
    primary_metric='AUC_weighted',
    experiment_timeout_minutes=30,
    blacklist_models=['XGBoostClassifier'],
    training_data=train_data,
    label_column_name=label,
    n_cross_validations=2)

ws = Workspace.from_config()

# Choose a name for the experiment and specify the project folder.
experiment_name = 'automl-classification'
project_folder = './sample_projects/automl-classification'

experiment = Experiment(ws, experiment_name)
run = experiment.submit(automl_config, show_output=True)

AutoMLConfig
Use the whitelist_models and blacklist_models parameters of AutoMLConfig class to include or exclude models.

Retrieve the Best Model

best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)
Method Description
Grid Grid sampling can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space.
Random Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values.
Bayesian Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection.

Grid Sampling

from azureml.train.hyperdrive import GridParameterSampling, choice

param_space = {
                 '--batch_size': choice(16, 32, 64),
                 '--learning_rate': choice(0.01, 0.1, 1.0)
              }

param_sampling = GridParameterSampling(param_space)

Random Sampling

from azureml.train.hyperdrive import RandomParameterSampling, choice, normal

param_space = {
                 '--batch_size': choice(16, 32, 64),
                 '--learning_rate': normal(10, 3)
              }

param_sampling = RandomParameterSampling(param_space)

Bayesian Sampling

from azureml.train.hyperdrive import BayesianParameterSampling, choice, uniform

param_space = {
                 '--batch_size': choice(16, 32, 64),
                 '--learning_rate': uniform(0.5, 0.1)
              }

param_sampling = BayesianParameterSampling(param_space)
from azureml.train.hyperdrive import HyperDriveConfig

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                          hyperparameter_sampling=param_sampling, 
                          policy=early_termination_policy,
                          resume_from=warmstart_parents_to_resume_from, 
                          resume_child_runs=child_runs_to_resume,
                          primary_metric_name="accuracy", 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                          max_total_runs=100,
                          max_concurrent_runs=4)
Policy Description
Bandit You can use a bandit policy to stop a run if the target performance metric underperforms the best run so far by a specified margin
Median Stopping A median stopping policy abandons runs where the target performance metric is worse than the median of the running averages for all runs.
Truncation Selection A truncation selection policy cancels the lowest performing X% of runs at each evaluation interval based on the truncation_percentage value you specify for X.

Bandit Policy

from azureml.train.hyperdrive import BanditPolicy

early_termination_policy = BanditPolicy(slack_amount = 0.2,
                                        evaluation_interval=1,
                                        delay_evaluation=5)

Median Stopping Policy

from azureml.train.hyperdrive import MedianStoppingPolicy

early_termination_policy = MedianStoppingPolicy(evaluation_interval=1,
                                                delay_evaluation=5)

Truncation Selection Policy

from azureml.train.hyperdrive import TruncationSelectionPolicy

early_termination_policy = TruncationSelectionPolicy(truncation_percentage=10,
                                                     evaluation_interval=1,
                                                     delay_evaluation=5)
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']

print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['accuracy'])
print('\n learning rate:',parameter_values[3])
print('\n keep probability:',parameter_values[5])
print('\n batch size:',parameter_values[7])
Policy Description
MimicExplainer An explainer that creates a global surrogate model that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (for example, linear or tree-based).
TabularExplainer An explainer that acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture.
PFIExplainer A Permutation Feature Importance explainer that analyzes feature importance by shuffling feature values and measuring the impact on prediction performance.

MimicExplainer

from interpret.ext.blackbox import MimicExplainer
from interpret.ext.glassbox import DecisionTreeExplainableModel

mim_explainer = MimicExplainer(model=loan_model,
                             initialization_examples=X_test,
                             explainable_model = DecisionTreeExplainableModel,
                             features=['loan_amount','income','age','marital_status'], 
                             classes=['reject', 'approve'])

TabularExplainer

from interpret.ext.blackbox import TabularExplainer

tab_explainer = TabularExplainer(model=loan_model,
                             initialization_examples=X_test,
                             features=['loan_amount','income','age','marital_status'],
                             classes=['reject', 'approve'])

PFIExplainer

from interpret.ext.blackbox import PFIExplainer

pfi_explainer = PFIExplainer(model = loan_model,
                             features=['loan_amount','income','age','marital_status'],
                             classes=['reject', 'approve'])
from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient

client = ExplanationClient.from_run(run)

# get model explanation data
explanation = client.download_model_explanation()
# or only get the top k (e.g., 4) most important features with their importance values
explanation = client.download_model_explanation(top_k=4)

global_importance_values = explanation.get_ranked_global_values()
global_importance_names = explanation.get_ranked_global_names()
print('global importance values: {}'.format(global_importance_values))
print('global importance names: {}'.format(global_importance_names))
from azureml.core import Model

classification_model = Model.register(workspace=ws,
                       model_name='classification_model',
                       model_path='model.pkl', # local path
                       description='A classification model')
from azureml.datadrift import DataDriftDetector

monitor = DataDriftDetector.create_from_datasets(
    workspace=ws,
    name='dataset-drift-detector',
    baseline_data_set=train_ds,
    target_data_set=new_data_ds,
    compute_target='aml-cluster',
    frequency='Week',
    feature_list=['age','height', 'bmi'],
    latency=24)
4. Deploy and consume models | 20-25%
from azureml.core.webservice import AciWebservice

aci_config = AciWebservice.deploy_configuration(
    ssl_enabled=True, ssl_cert_pem_file="cert.pem", ssl_key_pem_file="key.pem", ssl_cname="www.contoso.com")
Compute Target Used For GPU Support FPGA Support
Local Testing
Compute Instance Testing
Compute Cluster Batch Inference Y
Azure Kubernetes Service (AKS) Real-time Inference Y Y
Compute Target Deployment Configuration Example
Local deployment_config = LocalWebservice.deploy_configuration(port=8890)
Azure Container Instance (ACI) deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)
Azure Kubernetes Service (AKS) deployment_config = AksWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)
import requests
import json

headers = {'Content-Type': 'application/json'}

if service.auth_enabled:
    headers['Authorization'] = 'Bearer '+service.get_keys()[0]
elif service.token_auth_enabled:
    headers['Authorization'] = 'Bearer '+service.get_token()[0]

print(headers)

test_sample = json.dumps({'data': [
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
]})

response = requests.post(service.scoring_uri, data=test_sample, headers=headers)
print(response.status_code)
print(response.elapsed)
print(response.json())
published_pipeline = pipeline_run.publish_pipeline(name='Batch_Prediction_Pipeline', description='Batch pipeline', version='1.0')
rest_endpoint = published_pipeline.endpoint
import requests

response = requests.post(rest_endpoint, headers=auth_header, json={"ExperimentName": "Batch_Prediction"})
run_id = response.json()["Id"]
from azureml.core.compute import AksCompute, ComputeTarget

# Use the default configuration (you can also provide parameters to customize this).
# For example, to create a dev/test cluster, use:
# prov_config = AksCompute.provisioning_configuration(cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
prov_config = AksCompute.provisioning_configuration()

aks_name = 'myaks'
# Create the cluster
aks_target = ComputeTarget.create(workspace = ws,
                                    name = aks_name,
                                    provisioning_configuration = prov_config)

# Wait for the create process to complete
aks_target.wait_for_completion(show_output = True)

Configure a Real-Time Inference Pipeline

When you select Create inference pipeline, several things happen:

  • The trained model is stored as a Dataset module in the module palette. You can find it under My Datasets.

  • Training modules like Train Model and Split Data are removed.

  • The saved trained model is added back into the pipeline.

  • Web Service Input and Web Service Output modules are added. These modules show where user data enters the pipeline and where data is returned.