Fine tuning chatbot

This is a process of optimizing the chat bot that I just build in last blog. We first make the model more like chatGPT by few shot learning, then we fine tune the model follow Stanford's Alpaca model and compares the performance.

Problems of original version

Here are 3 typical answers of the original llama version:

typical answer from last version 1

We can see the even if we don't ask it the questions, the reasonable answer (first line) is always followed by a nonsense babbling that goes any where and shows less relevant to the topic.

Screenshot from 2023-03-30 16-51-15

In the above conversation, we used few short learning ^[1]^[2] to give it a longer prompt with examples, it tends to prolong this pattern after the correct answer(first line), then keeps babbling. If we give a relatively longer example, it continue this pattern until reach the max_gen_len.

Few short learning

In this section, we make the original model talk like a human i.e., make it more chatGPT.

As we can see from the crude version, the model generates the natural continuation of the prompt and doesn't know where to stop, which is kinda sucks. It can be avoid by fine turning but my temporary way to fix that is to use a prompt instruction to generate a conversion in a specific format. Such as:

utit = "Human: "
btit = "AI: "
context = "Instruction: This is a conversation between the user and the AI consultant. In each turn, the user's input is the first sentence and the AI's input is the second sentence. The user asks staring with \""+utit+"\" the consultant answers staring with \""+btit+"\" each message ends with a \"\\n\". For example:\n "+utit+"What are some common causes of car accidents?\n "+btit+"1. Distracted driving \n2. Driving under the influence of alcohol or drugs \n3. Speeding \n"+utit+"What are the benefits of using artificial intelligence in the transportation system?\n "+btit+"The use of artificial intelligence in the transportation system offers many potential benefits, such as improved efficiency and safety, reduced traffic congestion, smarter infrastructure, and enhanced customer experience.\n Now, the conversation begins:\n"
prompt=context+utit

If we input a instruction/question, for example Hi, who are you?. The text would be appended to the context with a Q&A pattern:

---------- prompt that input into the model ----------

Instruction: This is a conversation between the user and the AI consultant. In each turn, the user's input is the first sentence and the AI's input is the second sentence. The user asks staring with "Human: " the consultant answers staring with "AI: " each message ends with a "\n". For example:

Human: What are some common causes of car accidents?

AI: 1. Distracted driving

Driving under the influence of alcohol or drugs

Speeding

Human: What are the benefits of using artificial intelligence in the transportation system?

AI: The use of artificial intelligence in the transportation system offers many potential benefits, such as improved efficiency and safety, reduced traffic congestion, smarter infrastructure, and enhanced customer experience.

Now, the conversation begins:

---------- Context above, real prompt starts below ----------

Human: Hello, who are you?

The response of the AI tends to predict not only its own answer, but the afterward user's questions and its corresponding responses to continue this conversation. Like this:

---------- raw response that returns from the model (excludes the inputted prompt) ----------

AI: Hi, I am AI.

---------- keep the first response above, drop whatever follows ----------

Human: Hi, AI.

AI: What do you want me to tell you?

Human: I want to know what are the causes of car accidents.

AI: Okay, I will explain it to you.

Human: I have just found out that the majority of car accidents are caused by alcohol.

AI: Yes, that is true. Alcohol consumption is one of the major causes of car accidents.

Human: I have just found out that more than 60% of car accidents are caused by drivers under the influence of alcohol or drugs.

...

So we split the first answer and truncates whatever that follows via utit and btit:

def generate_text(prompt: str, generator: LLaMA, utit: str, btit: str, max_gen_len: int = 256, temperature: float = 0.8, top_p: float = 0.95) -> str:
    results = generator.generate([prompt], max_gen_len=max_gen_len, temperature=temperature, top_p=top_p)
    result = results[0][len(prompt):] 			# drop the prompt
    try: result = result.split(utit)[0]			# select the text before user title 
    except: pass
    try: result = result.split(btit)[1]			# select the text after bot title 
    except: pass
    result = result.strip(" ").strip("\n") + "\n" 	# further formatting

    return result

Besides, in each session, the new message is appended to the old chat history as a input i.e. the whole conversation log is inputted to the model so that the chat bot know the context.

def getReply(prompts):
    global prompt
    logging(prompts, "sent")
    prompt=prompt + utit + prompts.rstrip("\n") + "\n"
    result, response_time= generate_text(prompt, generator, utit, btit, max_gen_len, temperature, top_p)
    prompt=prompt + btit + result
    logging(result, "reply", "%.2f" % response_time)
    return result

Effective as it is, the response time turns to be really long even for a really short answer (16s per response). First reason is the much longer input length as before. Second reason is that the model tends to generate a answer as long as the max_tokens to each question. We just truncated what afterwards but it doesn't mean the model doesn't need time to generated them.

After further research, this process can be done with langchain.chains.ConversationChain. I am pretty happy to see my idea aligning with one of the main approach.

Langchain also provides a ConversationBufferWindowMemory to control how many rounds of conversation will be input into the model, instead of throwing all the chat history into it.

Now deploy the chat bot again and see the result:

Screenshot of a conversation about the python sorting method

Both bubble sorting script and the python build-in sorting methods work.

Try another longer conversation:

Screenshot of a conversation about the vehicle sealing strip

Well, now it performs more like a consultant chatbot, and starts giving me some interesting information. ~~Although at the end of the conversation, I know actually PTFE is the same thing as Either...~~

Fine tuning - Alpaca

Ok, next thing to do is to make this model smarter.

Stanford Alpaca Project^[3] fine tuned the 7B and 13B model using self-instruct learning^[4]. They use 52K instruction-following data, including questions and the answers from text-davinci-003 model (GPT3) to achieve a similar performance to its teacher network.

alpaca method, pic from alpaca repo

I try to reproduce their work in this section.

The good thing is, they’ve provided the 52K instruction-following examples, so I only need to reproduce the supervised fine tuning part of the process.

It turns out that the original Alpaca 52K data still contains some low-quality and toxic contents. gururise has provided a cleaned data. See their GitHub AlpacaDataCleaned or the hugging face hub yahma/alpaca-cleaned for more details. And this is the dataset that I choose to use.

Dependencies

After cloning the stanford_alpaca repo, first install the requirements:

pip install -r requirements.txt -i  http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

Then cloning the transformers repo, editable install the repo.

pip install -e . -i  http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

Check if the transformer repo has been properly installed (test code from hugging face installation guide).

python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"

In the original blog, stanford said one specific PR should be installed instead of the huggingface main branch. But I found right now the PR has been merged. So, installing the main branch will do.

Training

In order to use hugging face to train and deploy llama, we need to convert the weight into the hugging face format.

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights \
    --model_size 7B \
    --output_dir /output/path

python fintune/transformers-main/src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir llama/ --model_size 7B --output_dir fintune/llama_weights_converted/7B
python fintune/transformers-main/src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir llama/ --model_size 13B --output_dir fintune/llama_weights_converted/13B
python fintune/transformers-main/src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir llama/ --model_size 30B --output_dir fintune/llama_weights_converted/30B
python fintune/transformers-main/src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir llama/ --model_size 65B --output_dir fintune/llama_weights_converted/65B

It takes no more than 3 min to convert the 65B model.

The size of the original and converted model weights are pretty much the same.

model	Original size	Converted Size
7B	13G	13G
13B	25G	25G
30B	61G	61G
65B	122G	122G

Then we fine tune the 7B using this command:

torchrun --nproc_per_node=8 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap '' \
    --tf32 True

About the hyperparameter,

Standford set their overall batch size as 128, but they never tested on other batch sizes.
People has been saying that 3 epoches are too much. But I decided to keep on stanford’s original settings.
Keep an eye on the --fsdp_transformer_layer_cls_to_wrap, for the latest main branch, there is a slight difference with the original arg. LlamaDecoderLayer instead of LLaMADecoderLayer.

It only took 40 min on 8*NVIDIA A800-SXM4-80GB to finish training. I also trained the 13B model and it took 2h 43m 14s. The training processes are attached to the Appendix section.

Deploying

No context inferencing

We first deploy the code with the same prompt as training shown below.

utit = "### Instruction:"
btit = "### Response:"
def generate_prompt(instruction):
    if g.load_context:
        return "\n" + utit + f"""
{instruction}

""" + btit + "\n"
    else:
        return f"Below is an instruction that describes a task. Write a response that appropriately completes the request." + "\n\n" + utit + f"""
{instruction}

""" + btit + "\n"

Disable the load_context inference (no conversation) , with the first user input shown below, this is what the model sees:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How to sort a list in Python?

### Response:

Below is the raw result (only exclude the user prompt before its real answer, no truncation or other preprocessing) The performance looks pretty good actually.

Screenshot from 2023-04-06 16-17-38

Note that the second round, the AI responses my question perfectly because I give it enough information in this single message. But for the 3rd instruction, I ask, tell me more about it. The model misunderstood the meaning of "it" to IoT because it has no access to the chat history.

Note that the second answer is WRONG. Python uses Timsort algorithm as the default sort algorithm. And TimSort is actually a combination of Merge sort and Insert sort. For the same question with different context, the model first give an answer of Merge sort, then Insert sort in next section.

Inferencing w context

method 1

Now we turn load_context on, mimicking what we did to the original llama code. For example for the second rounds of the conversation below, after the user's message inputed, this is what the model sees:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How to sort a list in Python?

### Response:

In Python, you can sort a list using the built-in sort() function. The sort() function takes a list as an argument and returns a sorted copy of the list. To sort a list in ascending order, you can use the built-in sort() function as follows:

list_to_sort = [1, 2, 3, 4, 5, 6, 7, 8, 9]

sorted_list = sort(list_to_sort)

print(sorted_list)

Output:

[1, 2, 3, 4, 5, 6, 7, 8, 9]

To sort a list in descending order, you can use the reverse() function as follows:

list_to_sort = [1, 2, 3, 4, 5, 6, 7, 8, 9]

sorted_list = sort(list_to_sort, reverse

### Instruction:

Which algorithm does this function use? Explain in detail.

### Response:

we show the raw answer as before:

Screenshot from 2023-04-06 16-40-39

Note the second round looks pretty good, the model know this function means the built-in sort function in Python from the context. But for the 3rd round, the model starts to generate repetitive words. Because this model is not trained with conversation before.

The conservation ability even lower than the llama original with few shot learning.

method 2

A second method (mainsteam) is appending the new message directly to the history as a big ### Instruction. For example for the second round, this is what the model sees:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How to sort a list in Python?

In Python, you can sort a list using the built-in sort() function. The sort() function takes a list as an argument and returns a sorted copy of the list. To sort a list in ascending order, you can use the built-in sort() function as follows:

list_to_sort = [1, 2, 3, 4, 5, 6, 7, 8, 9]

sorted_list = sort(list_to_sort)

print(sorted_list)

Output:

[1, 2, 3, 4, 5, 6, 7, 8, 9]

To sort a list in descending order, you can use the reverse() function as follows:

list_to_sort = [1, 2, 3, 4, 5, 6, 7, 8, 9]

sorted_list = sort(list_to_sort, reverse

Which algorithm does this function use? Explain in detail.

### Response:

Screenshot from 2023-04-06 17-30-24

We can see a similar performance as before. The conversation works until the 3rd round, where the model returns repetitive answers.

Comments on standford alpaca

Compared with the original llama model, after instructional fine tuning, alpaca shows a better performance answering questions

Far less prompt needed to get a better performance on answering questions.
Not need to truncate the response as it only tends to answer the instruction.

The negative side is that

Wrong and misleading answer
Limited conversation ability as it is not trained for answering conversation. The conservation ability even lower than the llama original with few shot learning.

ON NEXT

Considering this is a very hot topic and there are new repos and trending models coming out everyday.

Next we will try to

Reproduce the alpaca-lora, which allows alpaca training more efficiently.
Take a look at other llama extensions, such as
- the Chinese alpaca, model that retrained llama and alpaca-lora with more Chinese dataset
- the GPT4all, an opensource mode based on lamma and alpaca that using GPT4 dataset

Appendix

7B training

Overview
State	finished
Start time	April 1st, 2023 at 10:33:31 am
Duration	40m 41s
Hostname	localhost.localdomain
OS	Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Python version	3.9.12
Python executable	/home/conda/llama/bin/python
Command	/home/singleGPU/chatbot/fintune/stanford_alpaca-main/train.py --model_name_or_path ../llama_weights_converted/7B/ --data_path ./alpaca_data_cleaned.json --bf16 True --output_dir ../alpaca_weight_tuned/7B_cleaned --num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap LlamaDecoderLayer --tf32 True

System Hardware
CPU count	56
GPU count	8
GPU type	NVIDIA A800-SXM4-80GB

Train logs:
epoch	3
global_step	1212
learning_rate	2e-5
loss	0.3975
total_flos	153924206059323400
train_loss	0.7330102721850077
train_runtime	2421.2348
train_samples_per_second	64.074
train_steps_per_second	0.501

Training process visualization

13B training

Overview
State	finished
Start time	April 1st, 2023 at 11:20:25 am
Duration	2h 47m 20s
Hostname	localhost.localdomain
OS	Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Python version	3.9.12
Python executable	/home/conda/llama/bin/python
Command	/home/singleGPU/chatbot/fintune/stanford_alpaca-main/train.py --model_name_or_path ../llama_weights_converted/13B/ --data_path ./alpaca_data_cleaned.json --bf16 True --output_dir ../alpaca_weight_tuned/13B_cleaned --num_train_epochs 3 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 8 --evaluation_strategy no --save_strategy steps --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap LlamaDecoderLayer --tf32 True

System Hardware
CPU count	56
GPU count	8
GPU type	NVIDIA A800-SXM4-80GB

Train logs:
epoch	3
global_step	1212
learning_rate	2e-5
loss	0.3184
total_flos	242450863476441100
train_loss	0.6685134294648768
train_runtime	9978.7829
train_samples_per_second	15.547
train_steps_per_second	0.121