DP-100: Designing and Implementing a Data Science Solution on Azure - Exam Prep
/A collection of resources and learning material to help prepare for and pass exam DP-100: Designing and Implementing a Data Science Solution on Azure. Passing this exam will result in becoming a certified Azure Data Scientist. This exam focuses on how to implement and run machine learning workloads on Azure, in particular, using the Azure Machine Learning Service.
Suggested Approach
Skills Measured - In order to aid study, a copy of the skills measured sourced from the official exam page with key phrases highlighted below. Note: Always ensure to refer to the latest skills outline available directly from the official exam home page as the content changes from time to time.
Microsoft Learn - Collection of free content that can be consumed on-demand and aligned to the Azure Data Scientist role. Working through this content will establish a solid foundation to build upon.
Study Notes - Finally, refer to either your own set of notes or those compiled below. After completing your initial round of learning, spend additional time in any areas where you may not feel a satisfactory level of confidence prior to taking the exam.
Resources
Resource | Link |
---|---|
Certification | Microsoft Certified: Azure Data Scientist Associate |
Exam | Exam DP-100: Designing and Implementing a Data Science Solution on Azure |
Microsoft Learn | Azure Data Scientist Learning Paths |
Skills Outline | DP-100 Exam Skills Outline |
Suggested Learning Paths
Skills Measured
Create models by using Azure Machine Learning Designer
- create a training pipeline by using Azure Machine Learning designer [1]
- ingest data in a designer pipeline [1]
- use designer modules to define a pipeline data flow [1]
- use custom code modules in designer [1]
Run training scripts in an Azure Machine Learning workspace
- create and run an experiment by using the Azure Machine Learning SDK [1]
- consume data from a data store in an experiment by using the Azure Machine Learning SDK [1]
- consume data from a dataset in an experiment by using the Azure Machine Learning SDK [1]
- choose an estimator for a training experiment [1]
Generate metrics from an experiment run
- log metrics from an experiment run [1]
- retrieve and view experiment outputs [1]
- use logs to troubleshoot experiment run errors [1]
Automate the model training process
Use Automated ML to create optimal models
- use the Automated ML interface in Azure Machine Learning studio [1]
- use Automated ML from the Azure Machine Learning SDK [1]
- select scaling functions and pre-processing options [1] [2]
- determine algorithms to be searched
- define a primary metric [1]
- get data for an Automated ML run [1]
- retrieve the best model [1]
Use Hyperdrive to tune hyperparameters
- select a sampling method [1] [2]
- define the search space [1] [2]
- define the primary metric [1]
- define early termination options [1] [2]
- find the model that has optimal hyperparameter values [1]
Use model explainers to interpret models
Manage models
Create production compute targets
Deploy a model as a service
- configure deployment settings [1]
- consume a deployed service [1]
- troubleshoot deployment container issues [1]
Create a pipeline for batch inferencing
- publish a batch inferencing pipeline [1] [2]
- run a batch inferencing pipeline and obtain outputs [1]
Publish a designer pipeline as a web service
Study Notes
from azureml.core import Workspace
ws = Workspace.create(
name='aml-workspace',
subscription_id='123456-abc-123...',
resource_group='aml-resources',
create_resource_group=True,
location='eastus',
sku='enterprise'
)
Workspace Configuration File (config.json)
{
"subscription_id": "<subscription-id>",
"resource_group": "<resource-group>",
"workspace_name": "<workspace-name>"
}
Connect to Workspace using a Configuration File
By default, the from_config method looks for a file named config.json in the folder containing the Python code file, but you can specify another path if necessary.
from azureml.core import Workspace
ws = Workspace.from_config()
from azureml.core import Workspace
ws = Workspace.get(
name='aml-workspace',
subscription_id='1234567-abcde-890-fgh...',
resource_group='aml-resources')
from azureml.core import Workspace, Datastore
ws = Workspace.from_config()
# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
datastore_name='blob_data',
container_name='data_container',
account_name='az_store_acct',
account_key='123456abcde789…')
from azureml.core import Dataset
blob_ds = ws.get_default_datastore()
csv_paths = [(blob_ds, 'data/files/current_data.csv'),
(blob_ds, 'data/files/archive/*.csv')]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
tab_ds = tab_ds.register(workspace=ws, name='csv_table')
from azureml.core import Dataset
blob_ds = ws.get_default_datastore()
file_ds = Dataset.File.from_files(path=(blob_ds, 'data/files/images/*.jpg'))
file_ds = file_ds.register(workspace=ws, name='img_files')
from azureml.core.compute import ComputeTarget, ComputeInstance
from azureml.core.compute_target import ComputeTargetException
compute_name = "compute-instance"
try:
instance = ComputeInstance(workspace=ws, name=compute_name)
except ComputeTargetException:
compute_config = ComputeInstance.provisioning_configuration(
vm_size='STANDARD_D3_V2',
ssh_public_access=False)
instance = ComputeInstance.create(ws, compute_name, compute_config)
instance.wait_for_completion(show_output=True)
from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
# Load the workspace from the saved config file
ws = Workspace.from_config()
# Specify a name for the compute (unique within the workspace)
compute_name = 'aml-cluster'
# Define compute configuration
compute_config = AmlCompute.provisioning_configuration(
vm_size='STANDARD_DS12_V2',
min_nodes=0,
max_nodes=4,
vm_priority='dedicated')
# Create the compute
aml_cluster = ComputeTarget.create(ws, compute_name, compute_config)
aml_cluster.wait_for_completion(show_output=True)
from azureml.core import Experiment
experiment = Experiment(workspace=ws, name="my-experiment"
from azureml.core import Dataset
my_datastore = Datastore.get(workspace, 'my_datastore')
from azureml.core import Dataset
dataset_name = 'my-dataset'
# Get a dataset by name
ds = Dataset.get_by_name(workspace=workspace, name=dataset_name)
# Load dataset into pandas DataFrame
df = ds.to_pandas_dataframe()
Generic Estimator
This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn. Encapsulates a run configuration and a script configuration in a single object. Running the estimator produces a model in the output directory specified in your training script.
from azureml.train.estimator import Estimator
from azureml.core import Experiment
# Create an estimator
estimator = Estimator(source_directory='experiment_folder',
entry_script='training_script.py',
compute_target='local',
conda_packages=['scikit-learn']
)
# Create and run an experiment
experiment = Experiment(workspace = ws, name = 'training_experiment')
run = experiment.submit(config=estimator)
from azureml.train.sklearn import SKLearn
from azureml.core import Experiment
script_params = {
'--kernel': 'linear',
'--penalty': 1.0
}
estimator = SKLearn(
source_directory=project_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train_iris.py',
pip_packages=['joblib==0.13.2']
)
# Create and run an experiment
experiment = Experiment(workspace = ws, name = 'training_experiment')
run = experiment.submit(config=estimator)
Type | Function | Example | Note |
---|---|---|---|
Scalar values | run.log(name, value, description='' | run.log("accuracy", 0.95) | Log a numerical or string value. |
Lists | run.log_list(name, value, description='') | run.log_list("accuracies", [0.6, 0.7, 0.87]) | Log a list of values. |
Row | run.log_row(name, description=None, **kwargs) | run.log_row("Y over X", x=1, y=0.4) | Creates a metric with multiple columns as described in kwargs. |
Table | run.log_table(name, value, description='') | run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]}) | Log a dictionary object. |
Images | run.log_image(name, path=None, plot=None) | run.log_image("ROC", plot=plt) | Log an image. |
Tag a run | run.tag(key, value=None) | run.tag("selected", "yes") | Tag the run with a string key and optional string value. |
Upload file or directory | run.upload_file(name, path_or_stream) | run.upload_file("best_model.pkl", "./model.pkl") | Upload a file. |
Option | Description |
---|---|
Run.start_logging | Add logging functions to your training script and start an interactive logging session in the specified experiment. start_logging creates an interactive run for use in scenarios such as notebooks. Any metrics that are logged during the session are added to the run record in the experiment. |
ScriptRunConfig | Add logging functions to your training script and load the entire script folder with the run. ScriptRunConfig is a class for setting up configurations for script runs. With this option, you can add monitoring code to be notified of completion or to get a visual widget to monitor. |
Designer logging | Add logging functions to a drag-&-drop designer pipeline by using the Execute Python Script module. Add Python code to log designer experiments. |
# Get an experiment object from Azure Machine Learning
experiment = Experiment(workspace=ws, name="train-within-notebook")
# Create a run object in the experiment
run = experiment.start_logging()
# Log the algorithm parameter alpha to the run
run.log('alpha', 0.03)
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
# Step to run a Python script
step1 = PythonScriptStep(name = 'prepare data',
source_directory = 'scripts',
script_name = 'data_prep.py',
compute_target = 'aml-cluster',
runconfig = run_config)
# Step to run an estimator
step2 = EstimatorStep(name = 'train model',
estimator = sk_estimator,
compute_target = 'aml-cluster')
# Construct the pipeline
train_pipeline = Pipeline(workspace = ws, steps = [step1,step2])
# Create an experiment and run the pipeline
experiment = Experiment(workspace = ws, name = 'training-pipeline')
pipeline_run = experiment.submit(train_pipeline)
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
# Get a dataset for the initial data
raw_ds = Dataset.get_by_name(ws, 'raw_dataset')
# Define a PipelineData object to pass data between steps
data_store = ws.get_default_datastore()
prepped_data = PipelineData('prepped', datastore=data_store)
# Step to run a Python script
step1 = PythonScriptStep(name = 'prepare data',
source_directory = 'scripts',
script_name = 'data_prep.py',
compute_target = 'aml-cluster',
runconfig = run_config,
# Specify dataset as initial input
inputs=[raw_ds.as_named_input('raw_data')],
# Specify PipelineData as output
outputs=[prepped_data],
# Also pass as data reference to script
arguments = ['--folder', prepped_data])
# Step to run an estimator
step2 = EstimatorStep(name = 'train model',
estimator = sk_estimator,
compute_target = 'aml-cluster',
# Specify PipelineData as input
inputs=[prepped_data],
# Pass as data reference to estimator script
estimator_entry_script_arguments=['--folder', prepped_data])
from azureml.train.automl import AutoMLConfig
from azureml.core.experiment import Experiment
automl_config=AutoMLConfig(
task='classification',
primary_metric='AUC_weighted',
experiment_timeout_minutes=30,
blacklist_models=['XGBoostClassifier'],
training_data=train_data,
label_column_name=label,
n_cross_validations=2)
ws = Workspace.from_config()
# Choose a name for the experiment and specify the project folder.
experiment_name = 'automl-classification'
project_folder = './sample_projects/automl-classification'
experiment = Experiment(ws, experiment_name)
run = experiment.submit(automl_config, show_output=True)
AutoMLConfig
Use the whitelist_models and blacklist_models parameters of AutoMLConfig class to include or exclude models.
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)
Method | Description |
---|---|
Grid | Grid sampling can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space. |
Random | Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values. |
Bayesian | Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection. |
Grid Sampling
from azureml.train.hyperdrive import GridParameterSampling, choice
param_space = {
'--batch_size': choice(16, 32, 64),
'--learning_rate': choice(0.01, 0.1, 1.0)
}
param_sampling = GridParameterSampling(param_space)
Random Sampling
from azureml.train.hyperdrive import RandomParameterSampling, choice, normal
param_space = {
'--batch_size': choice(16, 32, 64),
'--learning_rate': normal(10, 3)
}
param_sampling = RandomParameterSampling(param_space)
Bayesian Sampling
from azureml.train.hyperdrive import BayesianParameterSampling, choice, uniform
param_space = {
'--batch_size': choice(16, 32, 64),
'--learning_rate': uniform(0.5, 0.1)
}
param_sampling = BayesianParameterSampling(param_space)
from azureml.train.hyperdrive import HyperDriveConfig
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
hyperparameter_sampling=param_sampling,
policy=early_termination_policy,
resume_from=warmstart_parents_to_resume_from,
resume_child_runs=child_runs_to_resume,
primary_metric_name="accuracy",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=100,
max_concurrent_runs=4)
Policy | Description |
---|---|
Bandit | You can use a bandit policy to stop a run if the target performance metric underperforms the best run so far by a specified margin |
Median Stopping | A median stopping policy abandons runs where the target performance metric is worse than the median of the running averages for all runs. |
Truncation Selection | A truncation selection policy cancels the lowest performing X% of runs at each evaluation interval based on the truncation_percentage value you specify for X. |
Bandit Policy
from azureml.train.hyperdrive import BanditPolicy
early_termination_policy = BanditPolicy(slack_amount = 0.2,
evaluation_interval=1,
delay_evaluation=5)
Median Stopping Policy
from azureml.train.hyperdrive import MedianStoppingPolicy
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1,
delay_evaluation=5)
Truncation Selection Policy
from azureml.train.hyperdrive import TruncationSelectionPolicy
early_termination_policy = TruncationSelectionPolicy(truncation_percentage=10,
evaluation_interval=1,
delay_evaluation=5)
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']
print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['accuracy'])
print('\n learning rate:',parameter_values[3])
print('\n keep probability:',parameter_values[5])
print('\n batch size:',parameter_values[7])
Policy | Description |
---|---|
MimicExplainer | An explainer that creates a global surrogate model that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (for example, linear or tree-based). |
TabularExplainer | An explainer that acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture. |
PFIExplainer | A Permutation Feature Importance explainer that analyzes feature importance by shuffling feature values and measuring the impact on prediction performance. |
MimicExplainer
from interpret.ext.blackbox import MimicExplainer
from interpret.ext.glassbox import DecisionTreeExplainableModel
mim_explainer = MimicExplainer(model=loan_model,
initialization_examples=X_test,
explainable_model = DecisionTreeExplainableModel,
features=['loan_amount','income','age','marital_status'],
classes=['reject', 'approve'])
TabularExplainer
from interpret.ext.blackbox import TabularExplainer
tab_explainer = TabularExplainer(model=loan_model,
initialization_examples=X_test,
features=['loan_amount','income','age','marital_status'],
classes=['reject', 'approve'])
PFIExplainer
from interpret.ext.blackbox import PFIExplainer
pfi_explainer = PFIExplainer(model = loan_model,
features=['loan_amount','income','age','marital_status'],
classes=['reject', 'approve'])
from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient
client = ExplanationClient.from_run(run)
# get model explanation data
explanation = client.download_model_explanation()
# or only get the top k (e.g., 4) most important features with their importance values
explanation = client.download_model_explanation(top_k=4)
global_importance_values = explanation.get_ranked_global_values()
global_importance_names = explanation.get_ranked_global_names()
print('global importance values: {}'.format(global_importance_values))
print('global importance names: {}'.format(global_importance_names))
from azureml.core import Model
classification_model = Model.register(workspace=ws,
model_name='classification_model',
model_path='model.pkl', # local path
description='A classification model')
from azureml.datadrift import DataDriftDetector
monitor = DataDriftDetector.create_from_datasets(
workspace=ws,
name='dataset-drift-detector',
baseline_data_set=train_ds,
target_data_set=new_data_ds,
compute_target='aml-cluster',
frequency='Week',
feature_list=['age','height', 'bmi'],
latency=24)
from azureml.core.webservice import AciWebservice
aci_config = AciWebservice.deploy_configuration(
ssl_enabled=True, ssl_cert_pem_file="cert.pem", ssl_key_pem_file="key.pem", ssl_cname="www.contoso.com")
Compute Target | Used For | GPU Support | FPGA Support |
---|---|---|---|
Local | Testing | ||
Compute Instance | Testing | ||
Compute Cluster | Batch Inference | Y | |
Azure Kubernetes Service (AKS) | Real-time Inference | Y | Y |
Compute Target | Deployment Configuration Example |
---|---|
Local | deployment_config = LocalWebservice.deploy_configuration(port=8890) |
Azure Container Instance (ACI) | deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1) |
Azure Kubernetes Service (AKS) | deployment_config = AksWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1) |
import requests
import json
headers = {'Content-Type': 'application/json'}
if service.auth_enabled:
headers['Authorization'] = 'Bearer '+service.get_keys()[0]
elif service.token_auth_enabled:
headers['Authorization'] = 'Bearer '+service.get_token()[0]
print(headers)
test_sample = json.dumps({'data': [
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
]})
response = requests.post(service.scoring_uri, data=test_sample, headers=headers)
print(response.status_code)
print(response.elapsed)
print(response.json())
published_pipeline = pipeline_run.publish_pipeline(name='Batch_Prediction_Pipeline', description='Batch pipeline', version='1.0')
rest_endpoint = published_pipeline.endpoint
import requests
response = requests.post(rest_endpoint, headers=auth_header, json={"ExperimentName": "Batch_Prediction"})
run_id = response.json()["Id"]
from azureml.core.compute import AksCompute, ComputeTarget
# Use the default configuration (you can also provide parameters to customize this).
# For example, to create a dev/test cluster, use:
# prov_config = AksCompute.provisioning_configuration(cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
prov_config = AksCompute.provisioning_configuration()
aks_name = 'myaks'
# Create the cluster
aks_target = ComputeTarget.create(workspace = ws,
name = aks_name,
provisioning_configuration = prov_config)
# Wait for the create process to complete
aks_target.wait_for_completion(show_output = True)
Configure a Real-Time Inference Pipeline
When you select Create inference pipeline, several things happen:
The trained model is stored as a Dataset module in the module palette. You can find it under My Datasets.
Training modules like Train Model and Split Data are removed.
The saved trained model is added back into the pipeline.
Web Service Input and Web Service Output modules are added. These modules show where user data enters the pipeline and where data is returned.