Build and train an opensource chatGPT locally

This is a conclusion of the last 5 blog, showing the key process of building and optimizing a chat bot backended by open source large language model.

2 months later, Meta published their own large language model LLaMA^[1], ranging from 7B to 65B parameters. In the paper^[2], they claimed that LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. While Meta claims that LLaMA is open source, it still requires researchers to apply and be reviewed. However, what I never expected was that the model file of LLaMA was LEAKED. Members of 4chan released a copy of the weight file for everyone to download within just a few days of its release.

Although there are apps like discord bot^[3] out there, still, it's a good opportunity to build a own chatGPT from the scratch.

Interface

Vue.js and flask is used for front-end (modified from run27017/vue-chat) and backend, and axios is used for HTTP requests. This is the interface at this stage. See build a chatbot backended by Meta LLaMA language model for the source code and more details.

Fig 1. Interface

Current functions:

select different models
multiple lines input and output
save chat histories for each ip address

TO DO

remove the session limitation for each ip address
render message box in markdown format
add configuration panel allowing users to change model parameters

Language model

For the backend large language model, we've tained and tried 3 stages and 5 versions of optimizations.

Open-source basic model
- original llama model
- optimized llama model, improved for chatting, using prompt engineering
Stanford alpaca, follow their work to build a second developing process
Personal trained alpaca using collected data
- trained with more data
- scale up model

The sections below will follow this outline and illustrate the step-by-step optimization results.

basic llama

original llama

The original llama model (facebookresearch/llama) only performs next-word prediction, meaning

it only generates the natural continue of the prompt
it does not know where to stop
it is not designed to answer questions.
it does not support in-context chatting (no chat histories).

Here are 3 typical answers of the original 7B version:

Fig 2. typical answer of version 1 1

Fig 3. typical answer of version 1 2

Fig 4. typical answer of version 1 3

In Fig 2, we can see the reasonable result (first line) is always followed by a nonsense babbling that goes further and further away from the topic.
In next two conversations, we use few short learning ^[1]^[2] to give it a longer prompt with examples.
- In Fig 3, after the correct answer (first line), it tends to prolong the given pattern (second line), then keeps babbling.
- In Fig 4, where we provide a relatively longer example, it continues the pattern until the answer reach the max_gen_len.

optimized llama

To address the problems, we designed a prompting strategy and a corresponding truncation method.

The typical results of the optimized 7B llama model are shown below:

Fig 5. typical answer of version 2 1: conversation about the python sorting method

Fig 6. typical answer of version 2 2: conversation about vehicle sealing strip design

From the example, we can see it performs more like a consultant chatbot by:

it stops generating garbage messages
it knows where to stop
it acknowledges of chat history

For the quality of its answers:

In Fig 5, both the bubble sorting script and the python build-in sorting methods work.
In Fig 6, for another longer conversation, it tries to give me some interesting information, but still not smart. ~~Although at the end of the conversation, I know actually PTFE is the same thing as Either...~~

Stanford alpaca

In order to further improve this model, Stanford Alpaca Project^[3] provided one way taking self-instruct learning strategy. They also shared a 52K dataset for training, make their work easy to reproduce. Here is one of the results of the reporduced 7B stanford alpaca:

Fig 7. typical answer of version 3: conversation about python sort() function

Note that we did not implement any postprocessing methods to its answer, by which the model gains up to 10 times less response time and bigger training potential.

In terms of the result, it looks good for the first 2 rounds. The model knows that this function means the built-in sort function in Python from the context. But in the 3rd round, the model starts to generate repetitive words.

This is because this model is not trained with conversation before. More specific datasets are needed to improve the conversation ability.

The most improtant thing is, from this framework, we are able to train any model with any dataset.

This means we can improve any abilities of the model by feeding it the corresponding training datasets.

Further trained alpaca

At this stage, since we have build the training and deploy process. We optimized the model in twofolds, training the model with more data, then scaling up the model to more parameters.

Trained with more data

Thanks to Alpaca-CoT's detailed dataset cellection. We desided to improve the model's reasoning (COT), coding, dialog, and Chinese performance. So we fine tuned the model with the following dataset:

This section presents the performance comparitions between the reproduced Stanford Alpaca 7B and the self-fine tuned 7B model. And the results show the fine tuned model has a big improvement on the reasoning (COT), coding, dialog, and Chinese abilities.

And we can say, with a corresponding dataset, we can achieve better performances on any field/tasks.

Scale up model

Then we scale up the model from 7B to 13B and 30B.

A brief summery of results process is shown below:

	7B	13B	30B
Duration	10h 37m 26s	17h 49m 37s	1d 3h 23m 7s
train loss	0.8531	0.7999	0.7159
eval loss	0.8534914255142212	0.7641683220863342	0.7117758989334106

A detailed evaluation across a large range of tasls among these models and with the ChatGPT's answers are shown here.

The result shows that our AI language model has been performing well after scaling up to 30B parameters. The 30B parameters version is better than the 7B parameters version, but it is still not as well as ChatGPT, especially on text generation tasks and safety tasks.