A speech about Chatgpt

Source#

Originally, I was supposed to give a speech in class, but it was forced to stop due to the final exams QAQ
Just leaving a memento (

Main Text#

Some say that 2023 is the year of artificial intelligence. This is because OpenAI publicly released ChatGPT on November 30, 2022. Since then, people have been able to directly understand the conveniences that artificial intelligence brings us through conversations with ChatGPT. Now that it is 2024, I believe everyone has some understanding of Chatgpt, whether from online media reports, from Chinese essays, or from English readings. But do you want to have a deeper understanding of it? Today, let’s start with chatgpt and understand the principles behind it, and further explore the basic patterns of the neural networks behind it.

Let’s start with the name Chatgpt. Chatgpt stands for: Chat Generative Pre-trained Transformer, which means Generative Pre-trained Transformer. Generative refers to the fact that Chatgpt is a robot used to generate new text. "Pre-trained" means that the model has undergone a learning process from a large amount of data. The key term is the last word, transformer, what is that? It is actually a special type of deep learning model, a neural network. It was proposed in a paper titled "Attention Is All You Need" published by Google in 2017. Chatgpt is further modified based on the Transformer model.

So what does deep learning mean? Deep learning is a type of machine learning, and machine learning uses data-driven approaches to feedback into model parameters, guiding model behavior. This might be a bit difficult to understand. Here, the model is a broad function, that is, a mapping. The neural network refers to the internal implementation of this function. I will introduce it shortly. Continuing with machine learning, for example, a function that can label images, fx, where the input x is an image, and the output fx is the label. The core idea of machine learning is not to define any behavior in the code, but to construct a flexible function with adjustable parameters, and then use a large number of examples to let the machine adjust various parameters to mimic this behavior. It’s somewhat similar to the method of undetermined coefficients in mathematics.

Having understood the Transformer model, let’s take a simple look at the principles of Chatgpt. If we summarize it in one sentence, it is: based on all the previous text, predict the next word with the highest frequency to achieve text generation. It’s somewhat like the autocomplete feature of a search engine; every time we input a word, the input box starts predicting the subsequent text, with higher probabilities ranked higher. But how does the model determine the probability of each word?

This brings us to the Transformer architecture mentioned above. As mentioned, the Transformer is actually a function. If we take f(x) as an example, its input value x is all the previous text, and the output value is the probability of each word. We all know that functions, such as a quadratic function f(x) = ax^2 + bx + c, have parameters a, b, and c. For this function, the number of parameters is 3. So, we naturally want to ask, how many parameters does the modified Transformer architecture of ChatGpt have? You can take a guess. The answer is 175 billion.

Why are there so many parameters, and how do we let the machine adjust them? Next, let’s use a simpler example, digit recognition, to understand the neural network. This is the internal implementation mechanism of the function.

First, we must acknowledge a fact: it is actually very difficult for computers to recognize digits. Your 3 and my 3 may look very different. Scientists invented a magical thing called neural networks to solve this problem.

As the name suggests, neural networks are derived from the structure of the human brain. In neurons, the most important components are the neurons themselves and the way they connect. Neurons can be understood as containers filled with numbers, containing activation values ranging from 0 to 1. For the task of recognizing digits, each input neuron corresponds to the color intensity of each pixel in the image. Taking a 28x28 pixel image as an example, there are 784 input neurons, forming the first layer of the neural network. Then, jumping to the last layer, there are ten neurons representing the digits 0-9, with their activation values also ranging from 0 to 1. These values represent the system's belief in the likelihood that the input image corresponds to that digit. There are also hidden layers in the network that perform the specific work of processing and recognizing digits.

When the neural network runs, the activation values of the previous layer determine the activation values of the next layer. Therefore, the core of the neural network is how the activation values of the previous layer calculate the activation values of the next layer. In fact, the designed neural network aims to mimic the biological nervous system, where the activation of certain neurons leads to the activation of other neurons.

The activation values of the first layer will calculate the activation values of each neuron in the second layer, and so on, until the last output layer, where the brightest neuron represents the choice of the neural system. So why do we need to layer? We actually expect each layer to specifically recognize certain features... How the activation values of each layer calculate the activation values of the next layer relies on "parameters." For example, if the activation values of the first layer's 784 neurons are a1, a2, a3,…, a784, and assuming the second layer has 16 neurons, each bi, to calculate them from the first layer, we need 784 parameters. For instance, b1 = w1a1 + w2a2 + … + w784*a784. Initially, these values are all random, and we shouldn’t expect the computer to recognize anything. What we need to do is provide a large number of already written digit images and their corresponding digits, "feeding" them to the computer each time, calculating how "bad" the computer's performance is—this is the cost function—and giving feedback on how to modify the parameters—the negative gradient of the cost function. After one round of adjustments, these parameters are adjusted repeatedly, and then when encountering new images, the neural network can handle them.

In fact, isn’t this similar to our daily learning? Timely evaluation of our performance. Finding the most "efficient" improvement methods combined with extensive practice will definitely lead to better self-learning. + Supplement with some QAQ