这些技术非常重要,但 AI 的能力远不止这些。事实上,作为 AI 的标志性里程碑,图灵测试不仅要求机器识别模式,还要求机器像人类一样生成自己的活动模式。
这不仅仅是一个理论上的问题。一些最新的 AI 应用(比如聊天机器人)在实践中遇到了困难,因为它们与人的交互太呆板了。人们对生成式 AI 技术越来越感兴趣,该技术从一种已学到的模式模型开始,并使用这些模型来尝试生成对模式的模仿。这不仅使 AI 应用程序更灵活、对客户更友好,而且更加高效。生成式 AI 技术通常很快将注意力集中到描绘真实世界的部分模式特征上,并允许对数据集进行优化。
要理解生成式技术如何将 AI 提升到新的高度,理解它们的基础知识很重要。经典的关注领域仍是最有趣的领域之一:自然语言。
打字机上的猴子
20 世纪 20 年代,Sir Arthur Eddington 在剑桥大学的演讲中表达了一个古老而有趣的想法。他的想法是,如果让一群猴子在打字机上连续敲击,并等待足够长的时间,那么它们有一定的概率创作出莎士比亚的所有作品。换言之,在任意足够长的随机生成的字母序列中,您最终会找到熟悉的语言。当然,在猴子使用打字机的情况下等待足够长的时间,可能意味着等待数万亿年的时间。生成熟悉语言的真正随机字母串的概率太小了。
def count_chars(input_fp, frequencies, buffer_size=1024): '''Read the text content of a file and keep a running count of how often each character appears.
Arguments: input_fp -- file pointer with input text frequencies -- mapping from each character to its counted frequency buffer_size -- incremental quantity of text to be read at a time, in bytes (1024 if not otherwise specified)
Returns: nothing ''' #Read the first chunk of text text = input_fp.read(buffer_size) #Loop over the file while there is text to read while text: for c in text: #Accommodate the character if seen for the first time frequencies.setdefault(c, 0) #Increment the count for the present character frequencies[c] += 1 #Read the next chunk of text text = input_fp.read(buffer_size)
return
if __name__ == '__main__': #Initialize the mapping frequencies = {} #Pull the input data from the console count_chars(sys.stdin, frequencies) #Display the resulting frequencies in readable format pprint.pprint(frequencies)
Show moreShow more icon
我为它提供了“ 聪明的数据,第 1 部分”的文本,这是我在 2017 年发表的一篇 developerWorks 教程。在 Linux 上,我按以下方式运行它。
def count_chars(input_fp, frequencies, allowlist=ALLOWLIST_CHARACTERS, buffer_size=1024): '''Read the text content of a file and keep a running count of how often each character appears.
Arguments: input_fp -- file pointer with input text frequencies -- mapping from each character to its counted frequency allowlist -- string containing all characters to be included in the stats defaults to all lowercase letters and space buffer_size -- incremental quantity of text to be read at a time, in bytes (1024 if not otherwise specified)
Returns: nothing ''' #Read the first chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower() #Loop over the file while there is text to read while text: for c in text: if c in ALLOWLIST_CHARACTERS: #Accommodate the character if seen for the first time frequencies.setdefault(c, 0) #Increment the count for the present character frequencies[c] += 1 #Read the next chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower()
return
if __name__ == '__main__': #Initialize the mapping frequencies = {} #Pull the input data from the console count_chars(sys.stdin, frequencies) #Display the resulting frequencies in readable format pprint.pprint(frequencies)
import sys from collections import Counter import pprint
from nltk.util import bigrams from nltk.tokenize import RegexpTokenizer
#Set up a tokenizer which only captures lowercase letters and spaces #This requires that input has been preprocessed to lowercase all letters TOKENIZER = RegexpTokenizer("[a-z ]")
def count_bigrams(input_fp, frequencies, buffer_size=1024): '''Read the text content of a file and keep a running count of how often each bigram (sequence of two) characters appears.
Arguments: input_fp -- file pointer with input text frequencies -- mapping from each bigram to its counted frequency buffer_size -- incremental quantity of text to be read at a time, in bytes (1024 if not otherwise specified)
Returns: nothing ''' #Read the first chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower() #Loop over the file while there is text to read while text: #This step is needed to collapse runs of space characters into one text = ' '.join(text.split()) spans = TOKENIZER.span_tokenize(text) tokens = (text[begin : end] for (begin, end) in spans) for bigram in bigrams(tokens): #Increment the count for the bigram.Automatically handles any #bigram not seen before.The join expression turns 2 separate #single-character strings into one 2-character string frequencies[''.join(bigram)] += 1 #Read the next chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower()
return
if __name__ == '__main__': #Initialize the mapping frequencies = Counter() #Pull the input data from the console count_bigrams(sys.stdin, frequencies) #Uncomment the following line to display all the resulting frequencies #in readable format #pprint.pprint(frequencies) #Print just the 20 most common bigrams and their frequencies #in readable format pprint.pprint(frequencies.most_common(20))
由 N 个字母组成的一般序列称为 N 字母组,需要一个 N 维或 N 阶矩阵。如果完整地存储,则需要 27^N 个单元,所需单元呈指数级增长。4 阶矩阵需要 531,441 个单元才能实现完整存储。如果每个单元 32 位,允许计数高达约 40 亿,那么需要的总存储空间约为 2 MB。当然,可以通过使用稀疏矩阵技术来显著减少存储需求,本教程中的所有代码都是这样做的。
import sys from collections import Counter import pprint
from nltk.util import ngrams from nltk.tokenize import RegexpTokenizer
#Set up a tokenizer which only captures lowercase letters and spaces #This requires that input has been preprocessed to lowercase all letters TOKENIZER = RegexpTokenizer("[a-z ]")
def count_ngrams(input_fp, frequencies, order, buffer_size=1024): '''Read the text content of a file and keep a running count of how often each bigram (sequence of two) characters appears. Arguments: input_fp -- file pointer with input text frequencies -- mapping from each bigram to its counted frequency buffer_size -- incremental quantity of text to be read at a time, in bytes (1024 if not otherwise specified) Returns: nothing ''' #Read the first chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower() #Loop over the file while there is text to read while text: #This step is needed to collapse runs of space characters into one text = ' '.join(text.split()) spans = TOKENIZER.span_tokenize(text) tokens = (text[begin : end] for (begin, end) in spans) for bigram in ngrams(tokens, order): #Increment the count for the bigram.Automatically handles any #bigram not seen before.The join expression turns 2 separate #single-character strings into one 2-character string frequencies[''.join(bigram)] += 1 #Read the next chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower()
return
if __name__ == '__main__': #Initialize the mapping frequencies = Counter() #The order of the ngrams is the first command line argument ngram_order = int(sys.argv[1]) #Pull the input data from the console count_ngrams(sys.stdin, frequencies, ngram_order) #Uncomment the following line to display all the resulting frequencies #in readable format #pprint.pprint(frequencies) #Print just the 20 most common N-grams and their frequencies #in readable format pprint.pprint(frequencies.most_common(20))
如果仅考虑最高阶的 N 字母组,它们与输入文本的属性之间的关联可能变得特别脆弱。事实证明,当尝试拼凑语言来生成熟悉的文字时,统计的价值会被稀释。这实际上是 AI 中的一个普遍问题,我们目前存储和分析越来越复杂模型的能力是一把双刃剑。人们越来越难以通过这种从如此多的杂草中寻找幼苗的模式,生成对各种情形都很有用的响应。
对这些情形有所帮助的一件事依赖于现代社会的另一个优势。利用更大、更多样化的自然语言语料库创建模型,往往能够消除一些统计数据扭曲解读,同时提供足够丰富的内容来支持生成式 AI 技术。同样地,其他 AI 领域也不例外,在这些领域中,访问更高质量的训练数据至关重要。要获取关于我的教程中的数据的更多信息,请参阅“ 聪明的数据,第 1 部分:关注数据,以便最充分地利用人工智能、机器学习和认知计算”。
Reprint policy:
All articles in this blog are used except for special statements
CC BY 4.0
reprint policy. If reproduced, please indicate source
John Doe
!