Inspiration

Large language models (LLMs) are changing today’s digital interaction landscape. Businesses have LLM applications in drafting emails, creating documents, and managing customer communications. However, aligning these models with specific business identities typically requires significant customization. Fine-tuning larger models is costly and not scalable, but people always want the best performing models (typically very large, to the point we can barely run) whenever possible. NanoCommander aims to solve this problem.

What it does

NanoCommander is a platform that allows people and businesses to bring training data to create a fine-tuned, small-model-controlled, large model for them to use.

We addressed fine-tuning challenges by enabling effective transfer of fine-tuning adjustments from smaller models to larger ones without the need for extensive computational resources. This technique uses a small model to fine-tune and generate logprob deltas applied to a larger model's predictions, optimizing performance without direct large-scale fine-tuning.

How we built it

NanoCommander comes with a large model and a small base model. Fine-tuning data will be fed into the small base model to produce a fine-tuned model. At inference time, the same prompt will be run with the base, fine-tuned, and large model at the same time for 1 token at a time. We compare the deltas of the two smaller models, and apply the logprob deltas to the large model before taking softmax. This effectively transferred the fine-tuning effect to the large model, if the logits are comparable.

Challenges we ran into

Most commercial inference APIs do not expose top logprobs. To make this demo, we ran it on a PC setup with Llama.cpp on Python. This made the demo fine-tuning difficult as it barely runs Llama 3-8B. We decided to take the tradeoff and use heavily quantized models - 2 bits for both the 8B and 70B models - for training and inference. We also had to use a very small fine-tuning dataset to save time.

We also chose to use old English as the "business identity", which is not ideal, because the 70B model is already good at talking in old English. However, due to the time constraints, we did not have time to curate and clean up high quality data for fine-tuning.

Accomplishments that we're proud of

By taking some tradeoffs, we were able to fine-tune, and run the 3 models - one Llama-3-70B-Instruct-Q2 and two Llama-3-8B-Instruct-Q2's at the same time, on a PC with i9-12900K and RTX 4090. Fine-tuning small models really made things very efficient. Even on a CPU, it only took about 2 hours to fine-tune the small models with a small dataset.

We found that the inference time overhead is minimal even when things are run serially - only about a 40% increase. Only the large model is run on GPU; small models are both run only using CPU. With good parallelization, we should be able to achieve almost no inference time overhead.

What we learned

It is all about tradeoffs. Larger models can just do as well without fine-tuning if we add instructions in the prompts. However, in the long run, the additional tokens in the prompt will eventually incur higher costs. Similarly, running two smaller models alongside can save upfront training costs, but if the models are satisfactory, the user should also consider cutting over to direct fine-tuning on the larger model to prevent long-term cost incurred from the small model runs.

What's next for NanoCommander

Looking ahead, NanoCommander plans to explore further optimizations such as parallelize CPU and GPU runs, adjusting the weight of deltas in final predictions and incorporating temperature control to refine outputs. Our goal is to make scalable, customized AI a reality for both cloud providers and private users, enhancing accessibility and efficiency in AI-driven solutions.

Built With

Share this project:

Updates