diff --git a/README.md b/README.md index 69e9e1afdf26..f9bf1ac2f460 100644 --- a/README.md +++ b/README.md @@ -53,3 +53,19 @@ python -m cacheflow.http_frontend.fastapi_frontend # At another terminal python -m cacheflow.http_frontend.gradio_webserver ``` + +## Load LLaMA weights + +Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights. + +1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). + ```bash + python src/transformers/models/llama/convert_llama_weights_to_hf.py \ + --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b + ``` + Please make sure that `llama` is included in the output directory name. +2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example: + ```bash + python simple_server.py --model /output/path/llama-7b + python -m cacheflow.http_frontend.fastapi_frontend --model /output/path/llama-7b + ```