Section6 Pt

zh-CNhf-notebookschapter7course

从头开始训练因果语言模型 (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

[ ]

You will need to setup git, adapt your email and name in the following cell.

[ ]

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

[ ]
[ ]
[ ]
False True
[ ]
[ ]
3.26% of data after filtering.
[ ]
DatasetDict({
,    train: Dataset({
,        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
,        num_rows: 606720
,    })
,    valid: Dataset({
,        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
,        num_rows: 3322
,    })
,})
[ ]
'REPO_NAME: kmike/scikit-learn'
,'PATH: sklearn/utils/__init__.py'
,'COPIES: 3'
,'SIZE: 10094'
,'''CONTENT: """
,The :mod:`sklearn.utils` module includes various utilites.
,"""
,
,from collections import Sequence
,
,import numpy as np
,from scipy.sparse import issparse
,import warnings
,
,from .murmurhash import murm
,LICENSE: bsd-3-clause'''
[ ]
Input IDs length: 34
,Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 117, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 41]
,Chunk mapping: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[ ]
DatasetDict({
,    train: Dataset({
,        features: ['input_ids'],
,        num_rows: 16702061
,    })
,    valid: Dataset({
,        features: ['input_ids'],
,        num_rows: 93164
,    })
,})
[ ]
[ ]
GPT-2 size: 124.2M parameters
[ ]
[ ]
input_ids shape: torch.Size([5, 128])
,attention_mask shape: torch.Size([5, 128])
,labels shape: torch.Size([5, 128])
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
# create some data
,x = np.random.randn(100)
,y = np.random.randn(100)
,
,# create scatter plot with x, y
,plt.scatter(x, y)
,
,# create scatter
[ ]
# create some data
,x = np.random.randn(100)
,y = np.random.randn(100)
,
,# create dataframe from x and y
,df = pd.DataFrame({'x': x, 'y': y})
,df.insert(0,'x', x)
,for
[ ]
# dataframe with profession, income and name
,df = pd.DataFrame({'profession': x, 'income':y, 'name': z})
,
,# calculate the mean income per profession
,profession = df.groupby(['profession']).mean()
,
,# compute the
[ ]
# import random forest regressor from scikit-learn
,from sklearn.ensemble import RandomForestRegressor
,
,# fit random forest model with 300 estimators on X, y:
,rf = RandomForestRegressor(n_estimators=300, random_state=random_state, max_depth=3)
,rf.fit(X, y)
,rf
[ ]
'Keyword has not single token: testtest'
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
'sgugger/codeparrot-ds-accelerate'
[ ]
[ ]
(10.934126853942871, 56057.14453125)
[ ]