我們將構建和訓練字符級RNN來對單詞進行分類。字符級RNN將單詞作為一系列字符讀取,在每一步輸出預測和“隱藏狀態”,將其先前的隱藏 狀態輸入至下一時刻。我們將最終時刻輸出作為預測結果,即表示該詞屬于哪個類。
具體來說,我們將在18種語言構成的幾千個名字的數據集上訓練模型,根據一個名字的拼寫預測它是哪種語言的名字:
$ python predict.py Hinton (-0.47) Scottish (-1.52) English (-3.57) Irish $ python predict.py Schmidhuber (-0.19) German (-2.48) Czech (-2.68) Dutch
下載數據(https://download.pytorch.org/tutorial/data.zip)并將其解壓到當前文件夾。
在"data/names"文件夾下是名稱為"[language].txt"的18個文本文件。每個文件的每一行都有一個名字,它們幾乎都是羅馬化的文本 (但是我們仍需要將其從Unicode轉換為ASCII編碼)
我們最終會得到一個語言對應名字列表的字典,{language: [names ...]}。通用變量“category”和“line”(例子中的語言和名字單詞) 用于以后的可擴展性。
from __future__ import unicode_literals, print_function, division from io import open import glob import os def findFiles(path): return glob.glob(path) print(findFiles('data/names/*.txt')) import unicodedata import string all_letters=string.ascii_letters + " .,;'" n_letters=len(all_letters) # 將Unicode字符串轉換為純ASCII, 感謝https://stackoverflow.com/a/518232/2809427 def unicodeToAscii(s): return ''.join( c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) !='Mn' and c in all_letters ) print(unicodeToAscii('?lusàrski')) # 構建category_lines字典,每種語言的名字列表 category_lines={} all_categories=[] # 讀取文件并分成幾行 def readLines(filename): lines=open(filename, encoding='utf-8').read().strip().split('\n') return [unicodeToAscii(line) for line in lines] for filename in findFiles('data/names/*.txt'): category=os.path.splitext(os.path.basename(filename))[0] all_categories.append(category) lines=readLines(filename) category_lines[category]=lines n_categories=len(all_categories)
輸出結果:
['data/names/French.txt', 'data/names/Czech.txt', 'data/names/Dutch.txt', 'data/names/Polish.txt', 'data/names/Scottish.txt', 'data/names/Chinese.txt', 'data/names/English.txt', 'data/names/Italian.txt', 'data/names/Portuguese.txt', 'data/names/Japanese.txt', 'data/names/German.txt', 'data/names/Russian.txt', 'data/names/Korean.txt', 'data/names/Arabic.txt', 'data/names/Greek.txt', 'data/names/Vietnamese.txt', 'data/names/Spanish.txt', 'data/names/Irish.txt'] Slusarski
現在我們有了category_lines,一個字典變量存儲每一種語言及其對應的每一行文本(名字)列表的映射關系。變量all_categories是全部 語言種類的列表,變量n_categories是語言種類的數量,后續會使用。
print(category_lines['Italian'][:5])
輸出結果:
['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']
單詞轉變為張量
現在我們已經加載了所有的名字,我們需要將它們轉換為張量來使用它們。
我們使用大小為<1 x n_letters>的“one-hot 向量”表示一個字母。一個one-hot向量所有位置都填充為0,并在其表示的字母的位置表示為1, 例如"b"=<0 1 0 0 0 ...>.(字母b的編號是2,第二個位置是1,其他位置是0)
我們使用一個<line_length x 1 x n_letters>的2D矩陣表示一個單詞
額外的1維是batch的維度,PyTorch默認所有的數據都是成batch處理的。我們這里只設置了batch的大小為1。
import torch # 從all_letters中查找字母索引,例如 "a"=0 def letterToIndex(letter): return all_letters.find(letter) # 僅用于演示,將字母轉換為<1 x n_letters> 張量 def letterToTensor(letter): tensor=torch.zeros(1, n_letters) tensor[0][letterToIndex(letter)]=1 return tensor # 將一行轉換為<line_length x 1 x n_letters>, # 或一個0ne-hot字母向量的數組 def lineToTensor(line): tensor=torch.zeros(len(line), 1, n_letters) for li, letter in enumerate(line): tensor[li][0][letterToIndex(letter)]=1 return tensor print(letterToTensor('J')) print(lineToTensor('Jones').size())
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) torch.Size([5, 1, 57])
在autograd之前,要在Torch中構建一個可以復制之前時刻層參數的循環神經網絡。layer的隱藏狀態和梯度將交給計算圖自己處理。這意味著 你可以像實現的常規的 feed-forward 層一樣,以很純粹的方式實現RNN。
這個RNN組件 (幾乎是復制的the PyTorch for Torch users tutorial(https://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html#example-2-recurrent-net)) 僅使用兩層 linear 層對輸入和隱藏層做處理,在最后添加一層 LogSoftmax 層預測最終輸出。
import torch.nn as nn class RNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(RNN, self).__init__() self.hidden_size=hidden_size self.i2h=nn.Linear(input_size + hidden_size, hidden_size) self.i2o=nn.Linear(input_size + hidden_size, output_size) self.softmax=nn.LogSoftmax(dim=1) def forward(self, input, hidden): combined=torch.cat((input, hidden), 1) hidden=self.i2h(combined) output=self.i2o(combined) output=self.softmax(output) return output, hidden def initHidden(self): return torch.zeros(1, self.hidden_size) n_hidden=128 rnn=RNN(n_letters, n_hidden, n_categories)
要運行此網絡的一個步驟,我們需要傳遞一個輸入(在我們的例子中,是當前字母的Tensor)和一個先前隱藏的狀態(我們首先將其初始化為零)。 我們將返回輸出(每種語言的概率)和下一個隱藏狀態(為我們下一步保留使用)。
input=letterToTensor('A') hidden=torch.zeros(1, n_hidden) output, next_hidden=rnn(input, hidden)
為了提高效率,我們不希望為每一步都創建一個新的Tensor,因此我們將使用lineToTensor函數而不是letterToTensor函數,并使用切片 方法。這一步可以通過預先計算批量的張量進一步優化。
input=lineToTensor('Albert') hidden=torch.zeros(1, n_hidden) output, next_hidden=rnn(input[0], hidden) print(output)
輸出結果:
tensor([[-2.8857, -2.9005, -2.8386, -2.9397, -2.8594, -2.8785, -2.9361, -2.8270, -2.9602, -2.8583, -2.9244, -2.9112, -2.8545, -2.8715, -2.8328, -2.8233, -2.9685, -2.9780]], grad_fn=<LogSoftmaxBackward>)
可以看到輸出是一個<1 x n_categories>的張量,其中每一條代表這個單詞屬于某一類的可能性(越高可能性越大)。
3.1 訓練前的準備
進行訓練步驟之前我們需要構建一些輔助函數。
第一個是當我們知道輸出結果對應每種類別的可能性時,解析神經網絡的輸出。我們可以使用 Tensor.topk函數得到最大值在結果中的位置索引:
def categoryFromOutput(output): top_n, top_i=output.topk(1) category_i=top_i[0].item() return all_categories[category_i], category_i print(categoryFromOutput(output))
輸出結果:
('Arabic', 13)
第二個是我們需要一種快速獲取訓練示例(得到一個名字及其所屬的語言類別)的方法:
import random def randomChoice(l): return l[random.randint(0, len(l) - 1)] def randomTrainingExample(): category=randomChoice(all_categories) line=randomChoice(category_lines[category]) category_tensor=torch.tensor([all_categories.index(category)], dtype=torch.long) line_tensor=lineToTensor(line) return category, line, category_tensor, line_tensor for i in range(10): category, line, category_tensor, line_tensor=randomTrainingExample() print('category=', category, '/ line=', line)
輸出結果:
category=Dutch / line=Tholberg category=Irish / line=Murphy category=Vietnamese / line=An category=German / line=Von essen category=Polish / line=Kijek category=Scottish / line=Bell category=Czech / line=Marik category=Korean / line=Jeong category=Korean / line=Choe category=Portuguese / line=Alves
3.2 訓練神經網絡
現在,訓練過程只需要向神經網絡輸入大量的數據,讓它做出預測,并將對錯反饋給它。
nn.LogSoftmax作為最后一層layer時,nn.NLLLoss作為損失函數是合適的。
criterion=nn.NLLLoss()
訓練過程的每次循環將會發生:
learning_rate=0.005 # If you set this too high, it might explode. If too low, it might not learn def train(category_tensor, line_tensor): hidden=rnn.initHidden() rnn.zero_grad() for i in range(line_tensor.size()[0]): output, hidden=rnn(line_tensor[i], hidden) loss=criterion(output, category_tensor) loss.backward() # 將參數的梯度添加到其值中,乘以學習速率 for p in rnn.parameters(): p.data.add_(-learning_rate, p.grad.data) return output, loss.item()
現在我們只需要準備一些例子來運行程序。由于train函數同時返回輸出和損失,我們可以打印其輸出結果并跟蹤其損失畫圖。由于有1000個 示例,我們每print_every次打印樣例,并求平均損失。
import time import math n_iters=100000 print_every=5000 plot_every=1000 # 跟蹤繪圖的損失 current_loss=0 all_losses=[] def timeSince(since): now=time.time() s=now - since m=math.floor(s / 60) s -=m * 60 return '%dm %ds' % (m, s) start=time.time() for iter in range(1, n_iters + 1): category, line, category_tensor, line_tensor=randomTrainingExample() output, loss=train(category_tensor, line_tensor) current_loss +=loss # 打印迭代的編號,損失,名字和猜測 if iter % print_every==0: guess, guess_i=categoryFromOutput(output) correct='?' if guess==category else '? (%s)' % category print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct)) # 將當前損失平均值添加到損失列表中 if iter % plot_every==0: all_losses.append(current_loss / plot_every) current_loss=0
5000 5% (0m 8s) 2.7792 Verdon / Scottish ? (English) 10000 10% (0m 16s) 2.0748 Campos / Greek ? (Portuguese) 15000 15% (0m 25s) 2.0458 Kuang / Vietnamese ? (Chinese) 20000 20% (0m 33s) 1.1703 Nghiem / Vietnamese ? 25000 25% (0m 41s) 2.6035 Boyle / English ? (Scottish) 30000 30% (0m 50s) 2.2823 Mozdzierz / Dutch ? (Polish) 35000 35% (0m 58s) nan Lagana / Irish ? (Italian) 40000 40% (1m 6s) nan Simonis / Irish ? (Dutch) 45000 45% (1m 15s) nan Nobunaga / Irish ? (Japanese) 50000 50% (1m 23s) nan Ingermann / Irish ? (English) 55000 55% (1m 31s) nan Govorin / Irish ? (Russian) 60000 60% (1m 39s) nan Janson / Irish ? (German) 65000 65% (1m 48s) nan Tsangaris / Irish ? (Greek) 70000 70% (1m 56s) nan Vlasenkov / Irish ? (Russian) 75000 75% (2m 4s) nan Needham / Irish ? (English) 80000 80% (2m 12s) nan Matsoukis / Irish ? (Greek) 85000 85% (2m 21s) nan Koo / Irish ? (Chinese) 90000 90% (2m 29s) nan Novotny / Irish ? (Czech) 95000 95% (2m 37s) nan Dubois / Irish ? (French) 100000 100% (2m 45s) nan Padovano / Irish ? (Italian)
3.3 繪畫出結果
從all_losses得到歷史損失記錄,反映了神經網絡的學習情況:
import matplotlib.pyplot as plt import matplotlib.ticker as ticker plt.figure() plt.plot(all_losses)
為了了解網絡在不同類別上的表現,我們將創建一個混淆矩陣,顯示每種語言(行)和神經網絡將其預測為哪種語言(列)。為了計算混淆矩 陣,使用evaluate()函數處理了一批數據,evaluate()函數與去掉反向傳播的train()函數大體相同。
# 在混淆矩陣中跟蹤正確的猜測 confusion=torch.zeros(n_categories, n_categories) n_confusion=10000 # 只需返回給定一行的輸出 def evaluate(line_tensor): hidden=rnn.initHidden() for i in range(line_tensor.size()[0]): output, hidden=rnn(line_tensor[i], hidden) return output # 查看一堆正確猜到的例子和記錄 for i in range(n_confusion): category, line, category_tensor, line_tensor=randomTrainingExample() output=evaluate(line_tensor) guess, guess_i=categoryFromOutput(output) category_i=all_categories.index(category) confusion[category_i][guess_i] +=1 # 通過將每一行除以其總和來歸一化 for i in range(n_categories): confusion[i]=confusion[i] / confusion[i].sum() # 設置繪圖 fig=plt.figure() ax=fig.add_subplot(111) cax=ax.matshow(confusion.numpy()) fig.colorbar(cax) # 設置軸 ax.set_xticklabels([''] + all_categories, rotation=90) ax.set_yticklabels([''] + all_categories) # 每個刻度線強制標簽 ax.xaxis.set_major_locator(ticker.MultipleLocator(1)) ax.yaxis.set_major_locator(ticker.MultipleLocator(1)) # sphinx_gallery_thumbnail_number=2 plt.show()
你可以從主軸線以外挑出亮的點,顯示模型預測錯了哪些語言,例如漢語預測為了韓語,西班牙預測為了意大利。看上去在希臘語上效果很好, 在英語上表現欠佳。(可能是因為英語與其他語言的重疊較多)。
處理用戶輸入
def predict(input_line, n_predictions=3): print('\n> %s' % input_line) with torch.no_grad(): output=evaluate(lineToTensor(input_line)) # 獲得前N個類別 topv, topi=output.topk(n_predictions, 1, True) predictions=[] for i in range(n_predictions): value=topv[0][i].item() category_index=topi[0][i].item() print('(%.2f) %s' % (value, all_categories[category_index])) predictions.append([value, all_categories[category_index]]) predict('Dovesky') predict('Jackson') predict('Satoshi')
輸出結果:
> Dovesky (-0.74) Russian (-0.77) Czech (-3.31) English > Jackson (-0.80) Scottish (-1.69) English (-1.84) Russian > Satoshi (-1.16) Japanese (-1.89) Arabic (-1.90) Polish
最終版的腳本in the Practical PyTorch repo (https://github.com/spro/practical-pytorch/tree/master/char-rnn-classification)將上述代碼拆分為幾個文件:
運行train.py來訓練和保存網絡
將predict.py和一個名字的單詞一起運行查看預測結果 :
$ python predict.py Hazaki (-0.42) Japanese (-1.39) Polish (-3.51) Czech
運行server.py并訪問http://localhost:5533/Yourname 得到JSON格式的預測輸出
【分享成果,隨喜正能量】尊嚴這個東西,其實是和欲望成反比的,你想得到一個東西,就會變得低三下四,死皮賴臉,而當你對眼前這個人,這件事無動于衷的時候,尊嚴就會在你心中拔地而起。。
跟我學VBA,我這里專注VBA, 授人以漁。我98年開始,從源碼接觸VBA已經20余年了,隨著年齡的增長,越來越覺得有必要把這項技能傳遞給需要這項技術的職場人員。希望職場和數據打交道的朋友,都來學習VBA,利用VBA,起碼可以提高自己的工作效率,可以有時間多陪陪父母,多陪陪家人,何樂而不為呢?
這講我們繼續學習64位Office API聲明語句第85講,這些內容是MS的權威資料,看似枯燥,但對于想學習API函數的朋友是非常有用的。
' LB_SETCOUNT sent to non-lazy listbox.
Const ERROR_SETCOUNT_ON_BAD_LB=1433&
' This list box does not support tab stops.
Const ERROR_LB_WITHOUT_TABSTOPS=1434&
' Cannot destroy object created by another thread.
Const ERROR_DESTROY_OBJECT_OF_OTHER_THREAD=1435&
' Child windows cannot have menus.
Const ERROR_CHILD_WINDOW_MENU=1436&
' The window does not have a system menu.
Const ERROR_NO_SYSTEM_MENU=1437&
' Invalid message box style.
Const ERROR_INVALID_MSGBOX_STYLE=1438&
' Invalid system-wide (SPI_) parameter.
Const ERROR_INVALID_SPI_VALUE=1439&
' Screen already locked.
Const ERROR_SCREEN_ALREADY_LOCKED=1440&
' All handles to windows in a multiple-window position structure must
' have the same parent.
Const ERROR_HWNDS_HAVE_DIFF_PARENT=1441&
' The window is not a child window.
Const ERROR_NOT_CHILD_WINDOW=1442&
' Invalid GW_ command.
Const ERROR_INVALID_GW_COMMAND=1443&
' Invalid thread identifier.
Const ERROR_INVALID_THREAD_ID=1444&
' Cannot process a message from a window that is not a multiple document
' interface (MDI) window.
Const ERROR_NON_MDICHILD_WINDOW=1445&
' Popup menu already active.
Const ERROR_POPUP_ALREADY_ACTIVE=1446&
' The window does not have scroll bars.
Const ERROR_NO_SCROLLBARS=1447&
' Scroll bar range cannot be greater than 0x7FFF.
Const ERROR_INVALID_SCROLLBAR_RANGE=1448&
' Cannot show or remove the window in the way specified.
Const ERROR_INVALID_SHOWWIN_COMMAND=1449&
' End of WinUser error codes
' /////////////////////////
' //
' Eventlog Status Codes //
' //
' /////////////////////////
' The event log file is corrupt.
Const ERROR_EVENTLOG_FILE_CORRUPT=1500&
' No event log file could be opened, so the event logging service did not start.
Const ERROR_EVENTLOG_CANT_START=1501&
' The event log file is full.
Const ERROR_LOG_FILE_FULL=1502&
' The event log file has changed between reads.
Const ERROR_EVENTLOG_FILE_CHANGED=1503&
' End of eventlog error codes
' /////////////////////////
' //
' RPC Status Codes //
' //
' /////////////////////////
' The string binding is invalid.
Const RPC_S_INVALID_STRING_BINDING=1700&
' The binding handle is not the correct type.
Const RPC_S_WRONG_KIND_OF_BINDING=1701&
' The binding handle is invalid.
Const RPC_S_INVALID_BINDING=1702&
' The RPC protocol sequence is not supported.
Const RPC_S_PROTSEQ_NOT_SUPPORTED=1703&
' The RPC protocol sequence is invalid.
Const RPC_S_INVALID_RPC_PROTSEQ=1704&
' The string universal unique identifier (UUID) is invalid.
Const RPC_S_INVALID_STRING_UUID=1705&
' The endpoint format is invalid.
Const RPC_S_INVALID_ENDPOINT_FORMAT=1706&
' The network address is invalid.
Const RPC_S_INVALID_NET_ADDR=1707&
' No endpoint was found.
Const RPC_S_NO_ENDPOINT_FOUND=1708&
' The timeout value is invalid.
Const RPC_S_INVALID_TIMEOUT=1709&
' The object universal unique identifier (UUID) was not found.
Const RPC_S_OBJECT_NOT_FOUND=1710&
' The object universal unique identifier (UUID) has already been registered.
Const RPC_S_ALREADY_REGISTERED=1711&
' The type universal unique identifier (UUID) has already been registered.
Const RPC_S_TYPE_ALREADY_REGISTERED=1712&
' The RPC server is already listening.
Const RPC_S_ALREADY_LISTENING=1713&
' No protocol sequences have been registered.
Const RPC_S_NO_PROTSEQS_REGISTERED=1714&
' The RPC server is not listening.
Const RPC_S_NOT_LISTENING=1715&
' The manager type is unknown.
Const RPC_S_UNKNOWN_MGR_TYPE=1716&
' The interface is unknown.
Const RPC_S_UNKNOWN_IF=1717&
' There are no bindings.
Const RPC_S_NO_BINDINGS=1718&
' There are no protocol sequences.
Const RPC_S_NO_PROTSEQS=1719&
' The endpoint cannot be created.
Const RPC_S_CANT_CREATE_ENDPOINT=1720&
' Not enough resources are available to complete this operation.
Const RPC_S_OUT_OF_RESOURCES=1721&
' The RPC server is unavailable.
Const RPC_S_SERVER_UNAVAILABLE=1722&
' The RPC server is too busy to complete this operation.
Const RPC_S_SERVER_TOO_BUSY=1723&
' The network options are invalid.
Const RPC_S_INVALID_NETWORK_OPTIONS=1724&
' There is not a remote procedure call active in this thread.
Const RPC_S_NO_CALL_ACTIVE=1725&
' The remote procedure call failed.
Const RPC_S_CALL_FAILED=1726&
' The remote procedure call failed and did not execute.
Const RPC_S_CALL_FAILED_DNE=1727&
' A remote procedure call (RPC) protocol error occurred.
Const RPC_S_PROTOCOL_ERROR=1728&
我20多年的VBA實踐經驗,全部濃縮在下面的各個教程中:
【分享成果,隨喜正能量】心平氣和地告別過去,只爭朝夕地活在當下,淡定從容地迎接未來,看山神靜,觀海心闊,心態平和,知足常樂!過耳的虛話,過眼的云煙,辛苦事小,傷心事大,說者未必真心,聽者也無須多心。善解怨,結善緣。人生的必修課是接受無常,人生的選修課是放下執著。。