CHANINN BEAR FIRST ML PROJECT PART 3 ML
Build Your First Machine Learning Project - Part 3 | Machine Learning Algorithms
In this notebook, we'll prepare the Bear data set for machine learning model building.
What We'll Cover:
- Data Loading - Load the bear dataset using Modin (
modin.pandas) and Snowpark (snowflake-snowpark-python) - Data Preparation - Scale features and prepare data for model training using
scikit-learn - Model Training - Train multiple machine learning models using
scikit-learn:- Logistic Regression (
LogisticRegression) - Random Forest (
RandomForestClassifier) - Support Vector Machine (
SVC)
- Logistic Regression (
- Performance Comparison - Compare models using accuracy and MCC metric (
scikit-learn) - Model Interpretability - Analyze feature importance and model coefficients to understand predictions (
Altair)
Notebook Setup
Notebook Settings
- Click on the three dots on the top-right hand corner and select "Notebook settings"
- In the "Notebook settings" modal that appears, by default the General tab is activated, click on "Run on container" and under "Compute pool" choose a CPU compute node.
- From the "Notebook settings" modal, click on the "External access" tab, select a policy that allows the notebook external access (i.e. this will allow access to data stored on GitHub).
Install Prerequisite Libraries
Snowflake Notebooks includes common Python libraries by default. To add more, use the Packages dropdown in the top right.
Let's add the following package:
modin- Perform data operations (read/write) and wrangling just like pandas with the Snowpark pandas APIscikit-learn- Perform data splits and build machine learning modelssnowflake-ml-python- a collection of ML functionalities from Snowflake. Here, we'll use model metrics logging functionality.
Note: When using an AI/ML container, Snowpark and relevant machine learning packages comes pre-installed.
1. Establish Snowflake Connection
We'll start by getting an active session via the get_active_session() method.
2. Data Operations
In this section, we'll proceed to loading, preparing the features/class, explore missing data and data splitting.
2.1. Load Data
Data is read from the BEAR table stored in Snowflake via the read_snowflake() method.
2.2. Prepare features and class
The DataFrame is separate into 2. Features are assigned to the X variable while the class is assigned to y.
2.3. Check for Missing data
2.4. Data Splitting
The data is separated to Training-Testing sets using 80/20 ratio using scikit-learn:
- 80% is used as the Training set - used to train an ML model
- 20% is used as the Testing set - used as a test for the ML model
2.5. Feature Scaling
Feature scaling is a data preprocessing technique used to standardize the range of independent variables or features of data. This helps to ensure that features with larger value ranges (e.g. one variable can have a range of 10,000 to 1,000,000 while others could be 0.1 to 0.8) do not disproportionately influence the model's learning process.
Here, we're using scikit-learn to perform feature scaling by standardizing all variables by mean centering (mean = 0) unit variance (SD = 1).
3. Machine Learning Model Training
Now that we have the scaled features, we'll build ML models using scikit-learn.
3.1. Logistic Regression
3.2. Random Forest Classifier
3.3. Support Vector Machine (SVM)
4. Benchmarking of Machine Learning Algorithms
Benchmarking essentially means that we're comparing various ML algorithms to see which performs the best and/or are most suitable for our use case.
In selecting the best ML algorithm to use, we want an algorithm that can generalize well on new, unseen data and one that can provide actionable insights.
- Model overfitting: the former point on generalizing well on new, unseen data could be evaluated by the degree at which the algorithm overfits the data
- Model interpretability: the latter point on actionable insights can be gained by analyzing important features that contributes to the model's prediction
4.1. Assessing Overfitting
Overfitting is a measure of how much better a model performs on the data it was trained on compared to new, unseen data, indicating it has memorized noise instead of learning a general pattern.
This formula calculates the performance drop when your model moves from familiar training data to new, unseen testing data.
- A big difference means the model is overfitted: It just memorized the training examples instead of learning the actual patterns, so it fails on new data. š
- A small difference is good: This means that the model generalizes well. š
4.2. Model interpretability
Interpretable ML models are those that provide the variable coefficients that directly dictates the relative degree at which it influences the target y values.
In linear models this may be summarized in the following equation:
where is the target or dependent variable, are the variable coefficients, are the features or independent variables and is the baseline value.
In essence, coefficients are direct measure of their influence on the prediction of , where larger absolute coefficient value means that it has stronger impact on the prediction of .
4.2.1. Interpreting Logistic regression models
4.2.2. Interpreting Random Forest models
4.2.3. Interpreting SVM models
The only interpretable SVM algorithm are those using linear kernel while those using non-linear kernels like polynomial SVM or radial basis function (RBF) SVM are no longer interpretable and are regarded as black-box models.
The previously built SVM model is using the RBF kernel and are thus non-linear and not interpretable.
As already mentioned, if you'd like to have an interpretable SVM model, then you can use linear kernel that you can also try.
Resources
If you'd like to take a deeper dive into the various libraries used in this tutorial, here they are: