Notebooks
A
Azure
Auto Ml Classification Credit Card Fraud Local

Auto Ml Classification Credit Card Fraud Local

how-to-use-azuremlazure-mldata-sciencenotebooklocal-run-classification-credit-card-fraudmachine-learningazure-machine-learningautomated-machine-learningdeep-learningazuremlazure-ml-notebooksazure

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

Impressions

Automated Machine Learning

**Classification of credit card fraudulent transactions with local run **

Contents

  1. Introduction
  2. Setup
  3. Train
  4. Results
  5. Test
  6. Explanation
  7. Acknowledgements

Introduction

In this example we use the associated credit card dataset to showcase how you can use AutoML for a simple classification problem. The goal is to predict if a credit card transaction is considered a fraudulent charge.

This notebook is using the local machine compute to train the model.

If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the configuration notebook first if you haven't already to establish your connection to the AzureML Workspace.

In this notebook you will learn how to:

  1. Create an experiment using an existing workspace.
  2. Configure AutoML using AutoMLConfig.
  3. Train the model.
  4. Explore the results.
  5. Test the fitted model.
  6. Explore any model's explanation and explore feature importance in azure portal.
  7. Create an AKS cluster, deploy the webservice of AutoML scoring model and the explainer model to the AKS and consume the web service.

Setup

As part of the setup you have already created an Azure ML Workspace object. For Automated ML you will need to create an Experiment object, which is a named object in a Workspace used to run experiments.

[ ]

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

[ ]

Load Data

Load the credit card dataset from a csv file containing both training features and labels. The features are inputs to the model, while the training labels represent the expected output of the model. Next, we'll split the data using random_split and extract the training data for the model.

[ ]

Train

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

PropertyDescription
taskclassification or regression
primary_metricThis is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted
enable_early_stoppingStop the run if the metric score is not showing improvement.
n_cross_validationsNumber of cross validation splits.
training_dataInput dataset, containing both features and label column.
label_column_nameThe name of the label column.

You can find more information about primary metrics here

[ ]

Call the submit method on the experiment object and pass the run configuration. Depending on the data and the number of iterations this can run for a while. In this example, we specify show_output = True to print currently running iterations to the console.

[ ]
[ ]

Results

Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

Note: The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details

[ ]

Analyze results

Retrieve the Best Model

Below we select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last invocation. Overloads on get_output allow you to retrieve the best run and fitted model for any logged metric or for a particular iteration.

[ ]

Print the properties of the model

The fitted_model is a python object and you can read the different properties of the object.

Tests

Now that the model is trained, split the data in the same way the data was split for training (The difference here is the data is being split locally) and then run the test data through the trained model to get the predicted values.

[ ]
[ ]

Calculate metrics for the prediction

Now visualize the data on a scatter plot to show what our truth (actual) values are compared to the predicted values from the trained model that was returned.

[ ]

Explanation

In this section, we will show how to compute model explanations and visualize the explanations using azureml-interpret package. We will also show how to run the automl model and the explainer model through deploying an AKS web service.

Besides retrieving an existing model explanation for an AutoML model, you can also explain your AutoML model with different test data. The following steps will allow you to compute and visualize engineered feature importance based on your test data.

Run the explanation

Download the engineered feature importance from artifact store

You can use ExplanationClient to download the engineered feature explanations from the artifact store of the best_run. You can also use azure portal url to view the dash board visualization of the feature importance values of the engineered features.

[ ]

Download the raw feature importance from artifact store

You can use ExplanationClient to download the raw feature explanations from the artifact store of the best_run. You can also use azure portal url to view the dash board visualization of the feature importance values of the raw features.

[ ]

Retrieve any other AutoML model from training

[ ]

Setup the model explanations for AutoML models

The fitted_model can generate the following which will be used for getting the engineered explanations using automl_setup_model_explanations:-

  1. Featurized data from train samples/test samples
  2. Gather engineered name lists
  3. Find the classes in your labeled column in classification scenarios

The automl_explainer_setup_obj contains all the structures from above list.

[ ]
[ ]

Initialize the Mimic Explainer for feature importance

For explaining the AutoML models, use the MimicWrapper from azureml-interpret package. The MimicWrapper can be initialized with fields in automl_explainer_setup_obj, your workspace and a surrogate model to explain the AutoML model (fitted_model here). The MimicWrapper also takes the automl_run object where engineered explanations will be uploaded.

[ ]

Use Mimic Explainer for computing and visualizing engineered feature importance

The explain() method in MimicWrapper can be called with the transformed test samples to get the feature importance for the generated engineered features. You can also use azure portal url to view the dash board visualization of the feature importance values of the engineered features.

[ ]

Use Mimic Explainer for computing and visualizing raw feature importance

The explain() method in MimicWrapper can be called with the transformed test samples to get the feature importance for the original features in your data. You can also use azure portal url to view the dash board visualization of the feature importance values of the original/raw features.

[ ]

Initialize the scoring Explainer, save and upload it for later use in scoring explanation

[ ]

Acknowledgements

This Credit Card fraud Detection dataset is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/ and is available at: https://www.kaggle.com/mlg-ulb/creditcardfraud

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project Please cite the following works: • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015 • Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon • Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE o Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi) • Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier • Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing