Pytorch Mnist

wandb-examplesexamplespytorch-mnist-sagemakerpytorch
[ ]

MNIST Training using PyTorch

Contents

  1. Background
  2. Setup
  3. Data
  4. Train
  5. Host

Background

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using PyTorch.

For more information about the PyTorch in SageMaker, please visit sagemaker-pytorch-containers and sagemaker-python-sdk github repositories.


Setup

This notebook was created and tested on an ml.m4.xlarge notebook instance.

Let's start by creating a SageMaker session and specifying:

  • The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
  • The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with a the appropriate full IAM role arn string(s).
[ ]

Data

Getting the data

[ ]
Dataset MNIST
,    Number of datapoints: 60000
,    Root location: data
,    Split: Train
,    StandardTransform
,Transform: Compose(
,               ToTensor()
,               Normalize(mean=(0.1307,), std=(0.3081,))
,           )

Uploading the data to S3

We are going to use the sagemaker.Session.upload_data function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.

[ ]
input spec (in this case, just an S3 path): s3://sagemaker-us-west-2-618469898284/sagemaker/DEMO-pytorch-mnist

Train

Training script

The mnist.py script provides all the code we need for training and hosting a SageMaker model (model_fn function to load a model). The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

  • SM_MODEL_DIR: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
  • SM_NUM_GPUS: The number of gpus available in the current container.
  • SM_CURRENT_HOST: The name of the current container on the container network.
  • SM_HOSTS: JSON encoded list containing all the hosts .

Supposing one input channel, 'training', was used in the call to the PyTorch estimator's fit() method, the following will be set, following the format SM_CHANNEL_[channel_name]:

  • SM_CHANNEL_TRAINING: A string representing the path to the directory containing data in the 'training' channel.

For more information about training environment variables, please visit SageMaker Containers.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an argparse.ArgumentParser instance.

Because the SageMaker imports the training script, you should put your training code in a main guard (if __name__=='__main__':) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.

For example, the script run by this notebook:

[ ]
import argparse
import json
import logging
import os
import sys
import wandb

#import sagemaker_containers
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
from torchvision import datasets, transforms

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))


# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


def _get_train_data_loader(batch_size, training_dir, is_distributed, **kwargs):
    logger.info("Get train data loader")
    dataset = datasets.MNIST(
        training_dir,
        train=True,
        transform=transforms.Compose(
            [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
        ),
    )
    train_sampler = (
        torch.utils.data.distributed.DistributedSampler(dataset) if is_distributed else None
    )
    return torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=train_sampler is None,
        sampler=train_sampler,
        **kwargs
    )


def _get_test_data_loader(test_batch_size, training_dir, **kwargs):
    logger.info("Get test data loader")
    return torch.utils.data.DataLoader(
        datasets.MNIST(
            training_dir,
            train=False,
            transform=transforms.Compose(
                [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
            ),
        ),
        batch_size=test_batch_size,
        shuffle=True,
        **kwargs
    )


def _average_gradients(model):
    # Gradient averaging.
    size = float(dist.get_world_size())
    for param in model.parameters():
        dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
        param.grad.data /= size


def train(args):
    is_distributed = len(args.hosts) > 1 and args.backend is not None
    logger.debug("Distributed training - {}".format(is_distributed))
    use_cuda = args.num_gpus > 0
    logger.debug("Number of gpus available - {}".format(args.num_gpus))
    kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}
    device = torch.device("cuda" if use_cuda else "cpu")

    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(args.hosts)
        os.environ["WORLD_SIZE"] = str(world_size)
        host_rank = args.hosts.index(args.current_host)
        os.environ["RANK"] = str(host_rank)
        dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
        logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                args.backend, dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), args.num_gpus)
        )

    # set the seed for generating random numbers
    torch.manual_seed(args.seed)
    if use_cuda:
        torch.cuda.manual_seed(args.seed)

    train_loader = _get_train_data_loader(args.batch_size, args.data_dir, is_distributed, **kwargs)
    test_loader = _get_test_data_loader(args.test_batch_size, args.data_dir, **kwargs)

    logger.debug(
        "Processes {}/{} ({:.0f}%) of train data".format(
            len(train_loader.sampler),
            len(train_loader.dataset),
            100.0 * len(train_loader.sampler) / len(train_loader.dataset),
        )
    )

    logger.debug(
        "Processes {}/{} ({:.0f}%) of test data".format(
            len(test_loader.sampler),
            len(test_loader.dataset),
            100.0 * len(test_loader.sampler) / len(test_loader.dataset),
        )
    )

    model = Net().to(device)
    if is_distributed and use_cuda:
        # multi-machine multi-gpu case
        model = torch.nn.parallel.DistributedDataParallel(model)
    else:
        # single-machine multi-gpu case or single-machine or multi-machine cpu case
        model = torch.nn.DataParallel(model)

    wandb.watch(model)
    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    for epoch in range(1, args.epochs + 1):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader, 1):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            if is_distributed and not use_cuda:
                # average gradients manually for multi-machine cpu case only
                _average_gradients(model)
            optimizer.step()
            wandb.log({"training/loss": loss.item()})
            if batch_idx % args.log_interval == 0:
                logger.info(
                    "Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.sampler),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )
        test(model, test_loader, device)
    save_model(model, args.model_dir)


def test(model, test_loader, device):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, size_average=False).item()  # sum up batch loss
            pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    wandb.log({"testing/loss": test_loss})
    logger.info(
        "Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n".format(
            test_loss, correct, len(test_loader.dataset), 100.0 * correct / len(test_loader.dataset)
        )
    )
    # data and prediction visualization via W&B Tables
    original_data = datasets.MNIST(
        args.data_dir,
        train=False,
    )
    images, labels, preds = [], [], []
    for i in range(100):
        images += [original_data[i][0]]
        labels += [original_data[i][1]]
    processed_images = torch.stack([transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    )(image) for image in images]).to(device)
    output = model(processed_images).exp()
    probs, preds = output.max(1, keepdim=True)
    probs, preds = probs.flatten(), preds.flatten()
    table = []
    for i in range(len(images)):
        table += [[wandb.Image(images[i]), labels[i], preds[i].item(), probs[i].item()]]
    table = wandb.Table(data=table, columns=["image", "label", "prediction", "probability"])
    wandb.log({"mnist_visualization": table})


def model_fn(model_dir):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = torch.nn.DataParallel(Net())
    with open(os.path.join(model_dir, "model.pth"), "rb") as f:
        model.load_state_dict(torch.load(f))
    return model.to(device)


def save_model(model, model_dir):
    logger.info("Saving the model.")
    path = os.path.join(model_dir, "model.pth")
    # recommended way from http://pytorch.org/docs/master/notes/serialization.html
    torch.save(model.cpu().state_dict(), path)
    wandb.save(path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Data and model checkpoints directories
    parser.add_argument(
        "--batch-size",
        type=int,
        default=64,
        metavar="N",
        help="input batch size for training (default: 64)",
    )
    parser.add_argument(
        "--test-batch-size",
        type=int,
        default=1000,
        metavar="N",
        help="input batch size for testing (default: 1000)",
    )
    parser.add_argument(
        "--epochs",
        type=int,
        default=10,
        metavar="N",
        help="number of epochs to train (default: 10)",
    )
    parser.add_argument(
        "--lr", type=float, default=0.01, metavar="LR", help="learning rate (default: 0.01)"
    )
    parser.add_argument(
        "--momentum", type=float, default=0.5, metavar="M", help="SGD momentum (default: 0.5)"
    )
    parser.add_argument("--seed", type=int, default=1, metavar="S", help="random seed (default: 1)")
    parser.add_argument(
        "--log-interval",
        type=int,
        default=100,
        metavar="N",
        help="how many batches to wait before logging training status",
    )
    parser.add_argument(
        "--backend",
        type=str,
        default=None,
        help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
    )

    # Container environment
    parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
    parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--data-dir", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
    parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])

    args = parser.parse_args()
    wandb.init(project="sm-pytorch-mnist-new", config=vars(args))
    train(args)

Run training in SageMaker

The PyTorch class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on 2 ml.c4.xlarge instances. But this example can be ran on one or multiple, cpu or gpu instances (full list of available instances). The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the mnist.py script above.

[ ]
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Currently logged in as: wandb (use `wandb login --relogin` to force relogin)
[ ]
[ ]

After we've constructed our PyTorch object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

[ ]
2021-07-08 14:49:22 Starting - Starting the training job...
2021-07-08 14:49:45 Starting - Launching requested ML instancesProfilerReport-1625755762: InProgress
......
2021-07-08 14:50:45 Starting - Preparing the instances for training......
2021-07-08 14:51:45 Downloading - Downloading input data...
2021-07-08 14:52:16 Training - Training image download completed. Training in progress..bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-07-08 14:52:16,859 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2021-07-08 14:52:16,861 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-07-08 14:52:16,870 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2021-07-08 14:52:23,097 sagemaker_pytorch_container.training INFO     Invoking user training script.
2021-07-08 14:52:23,352 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/opt/conda/bin/python3.6 -m pip install -r requirements.txt
Collecting wandb
  Downloading wandb-0.10.33-py2.py3-none-any.whl (1.8 MB)
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.6/site-packages (from wandb->-r requirements.txt (line 1)) (2.8.1)
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting configparser>=3.8.1
  Downloading configparser-5.0.2-py3-none-any.whl (19 kB)
Requirement already satisfied: psutil>=5.0.0 in /opt/conda/lib/python3.6/site-packages (from wandb->-r requirements.txt (line 1)) (5.6.7)
Requirement already satisfied: PyYAML in /opt/conda/lib/python3.6/site-packages (from wandb->-r requirements.txt (line 1)) (5.4.1)
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting sentry-sdk>=0.4.0
  Downloading sentry_sdk-1.3.0-py2.py3-none-any.whl (133 kB)
Requirement already satisfied: requests<3,>=2.0.0 in /opt/conda/lib/python3.6/site-packages (from wandb->-r requirements.txt (line 1)) (2.25.1)
Requirement already satisfied: protobuf>=3.12.0 in /opt/conda/lib/python3.6/site-packages (from wandb->-r requirements.txt (line 1)) (3.15.6)
Requirement already satisfied: Click!=8.0.0,>=7.0 in /opt/conda/lib/python3.6/site-packages (from wandb->-r requirements.txt (line 1)) (7.1.2)
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.1-py3-none-any.whl (7.5 kB)
Collecting subprocess32>=3.5.3
  Downloading subprocess32-3.5.4.tar.gz (97 kB)
Collecting promise<3,>=2.0
  Downloading promise-2.3.tar.gz (19 kB)
Requirement already satisfied: six>=1.13.0 in /opt/conda/lib/python3.6/site-packages (from wandb->-r requirements.txt (line 1)) (1.15.0)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.7-py3-none-any.whl (63 kB)
Requirement already satisfied: typing-extensions>=3.7.4.0 in /opt/conda/lib/python3.6/site-packages (from GitPython>=1.0.0->wandb->-r requirements.txt (line 1)) (3.7.4.3)
Collecting smmap<5,>=3.0.1
  Downloading smmap-4.0.0-py2.py3-none-any.whl (24 kB)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (1.26.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (2.10)
Building wheels for collected packages: promise, subprocess32, pathtools
  Building wheel for promise (setup.py): started
  Building wheel for promise (setup.py): finished with status 'done'
  Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21494 sha256=62762b5fd23fdcedb5df017b60d22301751124b5d3cebecb3dcebfb2269dc5e0
  Stored in directory: /root/.cache/pip/wheels/59/9a/1d/3f1afbbb5122d0410547bf9eb50955f4a7a98e53a6d8b99bd1
  Building wheel for subprocess32 (setup.py): started
  Building wheel for subprocess32 (setup.py): finished with status 'done'
  Created wheel for subprocess32: filename=subprocess32-3.5.4-py3-none-any.whl size=6488 sha256=8c040e0bc0504ccf479fa9deb887c101ab02d6ae4ea622f6ade0479ff548078c
  Stored in directory: /root/.cache/pip/wheels/44/3a/ab/102386d84fe551b6cedb628ed1e74c5f5be76af8b909aeda09
  Building wheel for pathtools (setup.py): started
  Building wheel for pathtools (setup.py): finished with status 'done'
  Created wheel for pathtools: filename=pathtools-0.1.2-py3-none-any.whl size=8784 sha256=5c9115626b1a7285a10cf1b3b0209b9f54ccd898521183aa63f329bb9f44250e
  Stored in directory: /root/.cache/pip/wheels/42/ea/90/e37d463fb3b03848bf715080595de62545266f53dd546b2497
Successfully built promise subprocess32 pathtools
Installing collected packages: smmap, gitdb, subprocess32, shortuuid, sentry-sdk, promise, pathtools, GitPython, docker-pycreds, configparser, wandb
Successfully installed GitPython-3.1.18 configparser-5.0.2 docker-pycreds-0.4.0 gitdb-4.0.7 pathtools-0.1.2 promise-2.3 sentry-sdk-1.3.0 shortuuid-1.0.1 smmap-4.0.0 subprocess32-3.5.4 wandb-0.10.33

2021-07-08 14:52:28,967 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-07-08 14:52:28,977 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-07-08 14:52:28,988 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-07-08 14:52:28,997 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "backend": "gloo",
        "epochs": 1
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "pytorch-training-2021-07-08-14-49-22-100",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-618469898284/pytorch-training-2021-07-08-14-49-22-100/source/sourcedir.tar.gz",
    "module_name": "mnist",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mnist.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"backend":"gloo","epochs":1}
SM_USER_ENTRY_POINT=mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-618469898284/pytorch-training-2021-07-08-14-49-22-100/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"backend":"gloo","epochs":1},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"pytorch-training-2021-07-08-14-49-22-100","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-618469898284/pytorch-training-2021-07-08-14-49-22-100/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":8,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}
SM_USER_ARGS=["--backend","gloo","--epochs","1"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_BACKEND=gloo
SM_HP_EPOCHS=1
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

Invoking script with the following command:

/opt/conda/bin/python3.6 mnist.py --backend gloo --epochs 1


Distributed training - False#015
Number of gpus available - 0#015
Get train data loader#015
Get test data loader#015
Processes 60000/60000 (100%) of train data#015
Processes 10000/10000 (100%) of test data#015
[2021-07-08 14:52:32.505 algo-1:50 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None#015
[2021-07-08 14:52:32.571 algo-1:50 INFO profiler_config_parser.py:102] User has disabled profiler.#015
[2021-07-08 14:52:32.571 algo-1:50 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.#015
[2021-07-08 14:52:32.572 algo-1:50 INFO hook.py:199] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.#015
[2021-07-08 14:52:32.572 algo-1:50 INFO hook.py:253] Saving to /opt/ml/output/tensors#015
[2021-07-08 14:52:32.572 algo-1:50 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.#015
[2021-07-08 14:52:32.599 algo-1:50 INFO hook.py:584] name:module.conv1.weight count_params:250#015
[2021-07-08 14:52:32.599 algo-1:50 INFO hook.py:584] name:module.conv1.bias count_params:10#015
[2021-07-08 14:52:32.599 algo-1:50 INFO hook.py:584] name:module.conv2.weight count_params:5000#015
[2021-07-08 14:52:32.599 algo-1:50 INFO hook.py:584] name:module.conv2.bias count_params:20#015
[2021-07-08 14:52:32.599 algo-1:50 INFO hook.py:584] name:module.fc1.weight count_params:16000#015
[2021-07-08 14:52:32.599 algo-1:50 INFO hook.py:584] name:module.fc1.bias count_params:50#015
[2021-07-08 14:52:32.599 algo-1:50 INFO hook.py:584] name:module.fc2.weight count_params:500#015
[2021-07-08 14:52:32.600 algo-1:50 INFO hook.py:584] name:module.fc2.bias count_params:10#015
[2021-07-08 14:52:32.600 algo-1:50 INFO hook.py:586] Total Trainable Params: 21840#015
[2021-07-08 14:52:32.600 algo-1:50 INFO hook.py:413] Monitoring the collections: losses#015
[2021-07-08 14:52:32.603 algo-1:50 INFO hook.py:476] Hook is writing from the hook with pid: 50#015
#015
Train Epoch: 1 [6400/60000 (11%)] Loss: 2.010908#015
Train Epoch: 1 [12800/60000 (21%)] Loss: 1.009527#015
Train Epoch: 1 [19200/60000 (32%)] Loss: 0.877578#015
Train Epoch: 1 [25600/60000 (43%)] Loss: 0.805486#015
Train Epoch: 1 [32000/60000 (53%)] Loss: 0.635695#015
Train Epoch: 1 [38400/60000 (64%)] Loss: 0.505831#015
Train Epoch: 1 [44800/60000 (75%)] Loss: 0.537033#015
Train Epoch: 1 [51200/60000 (85%)] Loss: 0.532630#015
Train Epoch: 1 [57600/60000 (96%)] Loss: 0.428416#015
Test set: Average loss: 0.1919, Accuracy: 9424/10000 (94%)#015
#015
Saving the model.#015
wandb: Currently logged in as: wandb (use `wandb login --relogin` to force relogin)

CondaEnvException: Unable to determine environment

Please re-run this command with one of the following options:

* Provide an environment name via --name or -n
* Re-run this command inside an activated conda environment.

wandb: Tracking run with wandb version 0.10.33
wandb: Syncing run pytorch-training-2021-07-08-14-49-22-100-algo-1
wandb: ⭐️ View project at https://wandb.ai/wandb/sm-pytorch-mnist-new
wandb: 🚀 View run at https://wandb.ai/wandb/sm-pytorch-mnist-new/runs/pytorch-training-2021-07-08-14-49-22-100-algo-1
wandb: Run data is saved locally in /opt/ml/code/wandb/run-20210708_145230-pytorch-training-2021-07-08-14-49-22-100-algo-1
wandb: Run `wandb offline` to turn off syncing.
INFO:__main__:Train Epoch: 1 [6400/60000 (11%)] Loss: 2.010908#015
INFO:__main__:Train Epoch: 1 [12800/60000 (21%)] Loss: 1.009527#015
INFO:__main__:Train Epoch: 1 [19200/60000 (32%)] Loss: 0.877578#015
INFO:__main__:Train Epoch: 1 [25600/60000 (43%)] Loss: 0.805486#015
INFO:__main__:Train Epoch: 1 [32000/60000 (53%)] Loss: 0.635695#015
INFO:__main__:Train Epoch: 1 [38400/60000 (64%)] Loss: 0.505831#015
INFO:__main__:Train Epoch: 1 [44800/60000 (75%)] Loss: 0.537033#015
INFO:__main__:Train Epoch: 1 [51200/60000 (85%)] Loss: 0.532630#015
INFO:__main__:Train Epoch: 1 [57600/60000 (96%)] Loss: 0.428416#015
/opt/conda/lib/python3.6/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.#015
  warnings.warn(warning.format(ret))#015
INFO:__main__:Test set: Average loss: 0.1919, Accuracy: 9424/10000 (94%)#015
#015
INFO:__main__:Saving the model.#015
#033[34m#033[1mwandb#033[0m: #033[33mWARNING#033[0m Saving files without folders. If you want to preserve sub directories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt")#015
wandb: Waiting for W&B process to finish, PID 69
wandb: Program ended successfully.
wandb: - 0.15MB of 0.15MB uploaded (0.00MB deduped)#015wandb: \ 0.15MB of 0.15MB uploaded (0.00MB deduped)#015wandb: | 0.15MB of 0.16MB uploaded (0.00MB deduped)#015wandb: / 0.15MB of 0.16MB uploaded (0.00MB deduped)#015wandb: - 0.16MB of 0.16MB uploaded (0.00MB deduped)#015wandb: \ 0.16MB of 0.16MB uploaded (0.00MB deduped)#015wandb: | 0.16MB of 0.16MB uploaded (0.00MB deduped)#015wandb: / 0.16MB of 0.16MB uploaded (0.00MB deduped)#015wandb: - 0.16MB of 0.16MB uploaded (0.00MB deduped)#015wandb: \ 0.16MB of 0.16MB uploaded (0.00MB deduped)#015wandb:                                                                                
wandb: Find user logs for this run at: /opt/ml/code/wandb/run-20210708_145230-pytorch-training-2021-07-08-14-49-22-100-algo-1/logs/debug.log
wandb: Find internal logs for this run at: /opt/ml/code/wandb/run-20210708_145230-pytorch-training-2021-07-08-14-49-22-100-algo-1/logs/debug-internal.log
wandb: Run summary:
wandb:         training/loss 0.45277
wandb:              _runtime 15
wandb:            _timestamp 1625755965
wandb:                 _step 939
wandb:          testing/loss 0.19189
wandb: Run history:
wandb:   training/loss ████▇▇▆▅▅▄▃▃▃▃▃▃▂▃▂▂▃▂▂▂▂▂▂▂▂▁▂▁▂▁▁▂▂▁▂▁
wandb:        _runtime ▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇██
wandb:      _timestamp ▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇██
wandb:           _step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:    testing/loss ▁
wandb: 
wandb: Synced 5 W&B file(s), 2 media file(s), 101 artifact file(s) and 2 other file(s)
wandb: 
wandb: Synced pytorch-training-2021-07-08-14-49-22-100-algo-1: https://wandb.ai/wandb/sm-pytorch-mnist-new/runs/pytorch-training-2021-07-08-14-49-22-100-algo-1

2021-07-08 14:52:53,818 sagemaker-training-toolkit INFO     Reporting training SUCCESS

2021-07-08 14:53:06 Uploading - Uploading generated training model
2021-07-08 14:53:06 Completed - Training job completed
Training seconds: 88
Billable seconds: 88
[ ]