Credit Scorecards With XGBoost And W&B
Vehicle Loan Default Prediction with XGBoost
In this notebook we'll train a XGBoost model to classify whether submitted loan applications will default or not. Using boosting algorithms such as XGBoost increases the performance of a loan assesment, whilst retaining interpretability for internal Risk Management functions as well as external regulators.
This notebook is based on a talk from Nvidia GTC21 by Paul Edwards at ScotiaBank who presented how XGBoost can be used to construct more performant credit scorecards that remain interpretable. They also kindly shared sample code which we will use throughout this notebook, credit to Stephen Denton from Scotiabank for sharing this code publicly.
Click here to view and interact with a live W&B Dashboard built with this notebook
In this notebook
In this colab we'll cover how Weights and Biases enables regulated entities to
- Track and version their data ETL pipelines (locally or in cloud services such as S3 and GCS)
- Track experiment results and store trained models
- Visually inspect multiple evaluation metrics
- Optimize performance with hyperparameter sweeps
Track Experiments and Results
We will track all of the training hyperparameters and output metrics in order to generate an Experiments Dashboard like the one below:
Run a Hyperparameter Sweep to Find the Best HyperParameters
Weights and Biases also enables you to do hyperparameter sweeps, either with our own Sweeps functionality or with our Ray Tune integration. See our docs for a full guide of how to use more advanced hyperparameter sweeps options.
Setup
Data
AWS S3, Google Cloud Storage and W&B Artifacts
Weights and Biases Artifacts enable you to log end-to-end training pipelines to ensure your experiments are always reproducible.
Data privacy is critical to Weights & Biases and so we support the creation of Artifacts from reference locations such as your own private cloud such as AWS S3 or Google Cloud Storage. Local, on-premises of W&B are also available upon request.
By default, W&B stores artifact files in a private Google Cloud Storage bucket located in the United States. All files are encrypted at rest and in transit. For sensitive files, we recommend a private W&B installation or the use of reference artifacts.
##Â Artifacts Reference Example Create an artifact with the S3/GCS metadata
The artifact only consists of metadata about the S3/GCS object such as its ETag, size, and version ID (if object versioning is enabled on the bucket).
run = wandb.init()
artifact = wandb.Artifact('mnist', type='dataset')
artifact.add_reference('s3://my-bucket/datasets/mnist')
run.log_artifact(artifact)
Download the artifact locally when needed
W&B will use the metadata recorded when the artifact was logged to retrieve the files from the underlying bucket.
artifact = run.use_artifact('mnist:latest', type='dataset')
artifact_dir = artifact.download()
See Artifact References for more on how to use Artifacts by reference, credentials setup etc.
Login to W&B
Login to Weights and Biases
Vehicle Loan Dataset
We will be using a simplified version of the Vehicle Loan Default Prediction dataset from L&T which has been stored in W&B Artifacts.
Create function to pickle functions
Download Data from W&B Artifacts
We will download our dataset from W&B Artifacts. First we need to create a W&B run object, which we will use to download the data. Once the data is downloaded it will be one-hot encoded. This processed data will then be logged to the same W&B as a new Artifact. By logging to the W&B that downloaded the data, we tie this new Artifact to the raw dataset Artifact
Download the subset of the vehicle loan default data from W&B, this contains train.csv and val.csv files as well as some utils files.
One-Hot Encode the Data
Log Processed Data to W&B Artifacts
Get Train/Validation Split
Here we show an alternative pattern for how to create a wandb run object. In the cell below, the code to split the dataset is wrapped with a call to wandb.init() as run.
Here we will:
- Start a wandb run
- Download our one-hot-encoded dataset from Artifacts
- Do the Train/Val split and log the params used in the split
- Log the new
trndatandvaldatdatasets to Artifacts - Finish the wandb run automatically
Inspect Training Dataset
Get an overview of the training dataset
Log Dataset with W&B Tables
With W&B Tables you can log, query, and analyze tabular data that contains rich media such as images, video, audio and more. With it you can understand your datasets, visualize model predictions, and share insights, for more see more in our W&B Tables Guide
Modelling
Fit the XGBoost Model
We will now fit an XGBoost model to classify whether a vehicle loan application will result in a default or not
Training on GPU
If you'd like to train your XGBoost model on your GPU, simply change set the following in the parameters you pass to XGBoost:
'tree_method': 'gpu_hist'
1) Initialise a W&B Run
2) Setup and Log the Model Parameters
Log the xgboost training parameters to the W&B run config
3) Let's select the data for train/validation
4) Fit the model, log results to W&B and save model to W&B Artifacts
To log all our xgboost model parameters we used the WandbCallback. This will . See the W&B docs, including documentation for other libraries that have integrated W&B including LightGBM and more.
5) Log Additional Train and Evaluation Metrics to W&B
6) Log the ROC Curve To W&B
Finish the W&B Run
Now that we've trained a single model, lets try and optimize its performance by running a Hyperparameter Sweep.
HyperParameter Sweep
Weights and Biases also enables you to do hyperparameter sweeps, either with our own Sweeps functionality or with our Ray Tune integration. See our docs for a full guide of how to use more advanced hyperparameter sweeps options.
Click Here to check out the results of a 1000 run sweep generated using this notebook