Improve the chatbot

The process to further improve the current alpaca model with larger datasets.

Dataset

First of all, a big thank to Alpaca-CoT for collecting and formatting such a detailed dataset. And a great training framework for efficient training process.

I desided to further improve current model alpaca 7B/13B in 2 steps:

First step we test for our training process, take a small dataset to improve stories, Chinese, and COT performance
- alpaca_data_cleaned.json
- CoT_data.json
- belle_data0.5cn.json, refer to BELLE for more details
Second step, after tuned the superparameters and the training process, we add more dataset, improve the Chinese, dialogue, reasoning, coding performance
- gpt4all_without_p3.json, refer to gpt4all for more details
- belle_data1.0M_cn.json, refer to BELLE for more details

The final dataset results in:

And we combine json files with:

import json

files=['file1.json','file2.json','file3.json']
output_file = 'combined.json'

def merge_JsonFiles(filename):
    result = list()
    for f1 in filename:
        with open(f1, 'r') as infile:
            result.extend(json.load(infile))

    with open(output_file, 'w') as output_file:
        json.dump(result, output_file)

merge_JsonFiles(files)

Train steps

step 1:

After setting up the dependencies with:

git clone git@github.com:PhoebusSi/Alpaca-CoT.git
pip install -r requirements.txt

First we train a small size mode with small size of dataset to validate the training process.

Combine the jsons:

files=['alpaca_data_cleaned.json','CoT_data.json','belle_data05cn.json']

Train with

export HF_DATASETS_CACHE="/home/.cache"
torchrun --nproc_per_node 8  \
    --nnodes=1 --node_rank=0 uniform_finetune.py \
    --model_type llama --model_name_or_path ../llama_weights_converted/7B \
    --data alpaca-cot-belle --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 128 --gradient_accumulation_steps 32 \
    --learning_rate 3e-4 --epochs 1

We found durining training, the convergence speed is slow, and the memory usage is not optimized using lora method. As a result we add the learning rate as well as the batch size and per_gpu_train_batch_size.

Second, when deploiying the test model, it still doesn’t know where to stop for each model, and the dialogue performance is poor as well. We decided to change add_eos_token=False to True. And decided to include more dialog related dataset.

Step 2: bigger dataset

At this stage, we only fine tune the 7B model, applying the above measurements, we

combine the jsons:

files=['CoT_data.json','gpt4all_without_p3_formatted.json', 'Vicuna.json', 'belle_data1M_cn.json', 'dialog_w_context/train.json']

train with

export HF_DATASETS_CACHE="/home/.cache"
torchrun --nproc_per_node 8 uniform_finetune.py --model_type llama  \
		--model_name_or_path ../llama_weights_converted/13B  \
		--data gpt4-cot-belle1M-vicuna-dialog --lora_target_modules q_proj v_proj  \
		--per_gpu_train_batch_size 32 --gradient_accumulation_steps 1  \
		--learning_rate 4.5e-4 --epochs 1 &

It takes 1h for data spliting and mapping and 10h 37m 26s for training on 8 NVIDIA A800-SXM4-80GB GPUs. And the training summary is presented in the Appendix.

Note that the training process is badly inefficient, refer to the appendix, the memory accessing takes as high as 75% time of training. The high memory accessing problem might because of training 7B on 8 devices is too much. And the low memory usage is because I want to keep to batch size not higher than 256 to get a good generalization ability. But I have no time to do ablation on this super parameter.

Deployment and comparision

We deploy the model with context using the same way as the last section of previous blog.

Actually it is really difficult to evaluate a large language model. Using GPT4 is a possible approach but I don’t have OpenAI plus. So at this stage, I evaluate by myself.

Here are some example results between the further fine tuned model and the alpaca model from precious blog.

python alpaca_backend.py --size 7 --data gpt4-cot-belle1M-vicuna-dialog --bit 1
python alpaca_backend.py --size 7 --data alpaca --bit 1

dialog & discourse ability

Alpaca	Fine tuned alpaca

Chinese ability

Alpaca	Fine tuned alpaca

We can the Chinese ability of fine tuned version shows a big improvement. Impressively, even though Alpaca hasn't been trained in Chinese. It can understand Chinese prompt. As I shown in the first example, when I asked "What is the emergent abilities of LLM^[1]?" The Alpaca model gives the model perfectly.

In the second example, when I ask "Introduce Shanghai" in Chinese. The model also clearly understands my instruction and gives me what I want, though in English.

However, the translation ability is still poor. It's too hard for it.

Appendix

Overview
State	finished
Start time	April 6th, 2023 at 4:55:13 pm
Duration	10h 37m 26s
Hostname	localhost.localdomain
OS	Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Python version	3.9.12
Python executable	/home/conda/llama/bin/python
Command	/home/singleGPU/chatbot/fintune/Alpaca-CoT-main/uniform_finetune.py --model_type llama --model_name_or_path ../llama_weights_converted/7B --data gpt4-cot-belle1M-vicuna-dialog --lora_target_modules q_proj v_proj --per_gpu_train_batch_size 32 --gradient_accumulation_steps 1 --learning_rate 4.5e-4 --epochs 1

System Hardware
CPU count	56
GPU count	8
GPU type	NVIDIA A800-SXM4-80GB

Train logs:
epoch	1
global_step	6573
learning_rate	9.732735980225552e-7
loss	0.8531
total_flos	34166331403636048000
train_loss	0.8965309444272535
train_runtime	38249.6774
train_samples_per_second	43.989
train_steps_per_second	0.172