My first successful LLM fine tuning

Yesterday I told myself I would do something other than program all day, but again ended up pushing to accomplish the goal that I felt a bit obsessed over. The goal: to fine tune Mistral using my own method, in order to create a personalized lyric generation machine. I would be training it with my own lyrics as well as injecting random sequences of words. I just had to know the effect that fine-tuning would have. Would it really be able to make the kind of creative lyrics that I wanted?

Base LLMs can't really generate the kind of poetry/lyrics that actual artists do in the real world. Usually they are very hard to wrangle toward being more creative. So I figured I would try an experimental fine-tuning method with Mistral to see what effect it could have.

After running into difficulty with LoRA fine tuning Mistral7B locally on my computer, I decided it would make more sense to do so on AWS. The easiest method that I am currently aware of would be to do so in a Notebook instance on SagemakerAI. Those Jupyter notebooks were used in the LLM course I took, and just feel very convenient and friendly to me.

When I tried to fine-tune locally, I had to try to work around GPU limitations. I have a RTX 2060 Super which is great for gaming, but is not really powerful enough for doing this kind of stuff. I found it was just too complicated, so yesterday I decided to just pay for a large, super powerful instance and try to get it done.

I set everything up and attempted to run the trainer, but quickly ran into an error. My g5.2xlarge instance was not powerful enough to run the fine-tuning. So I decided to change it to the g5.12xlarge instance type.

This was more expensive so I felt pressure to move quickly, not knowing if I would be able to finish at all. This was a learning experience for me just as much as it was a creative experiment that might actually be useful in my music work.

Before any of this, I created my experimental training dataset. This was not the most precise method of creating training data. I mainly just needed to see for myself the actual tangible effects of running fine-tuning on an LLM. It's all just a bit "black box" mystery, isn't it? The only way to map out the territory is to do experiments.

def generate_dataset2(iterations=30):
    # Prepare the new training dataset
    training_data = []

    for i in range(iterations):

        prompt = generate_dataset_prompt()

        generated_response = generate_abstract_response()

        training_data.append({
            "instruction": prompt,
            "response": generated_response
        })

    # Save to file
    with open("trainingdataset.json", "w", encoding="utf-8") as f:
        json.dump(training_data, f, ensure_ascii=False, indent=2)

That function uses two separate functions to automatically generate both the prompt, and the response. The generate_abstract_response() function constructs a response using 2 separate helper methods:

def generate_abstract_response(iterations=10):
    output = ""
    ## alternate randomly per iteration between pasting in my lyrics or adding a random sequence
    with open("wordlist.txt", "r", encoding="utf-8") as f:
        wordlist = [line.strip() for line in f if line.strip()]

    for i in range(iterations):
        line = ""
        if random.random() < 0.7:  # 70% chance
            line=get_random_yaml_sequence("lyriclines.yaml","dogalyrics")+"\n"
        else:
            line=generate_random_sequence(wordlist) + '\n'
        output+=line
    return output

Before I wrote this function, I took about a year worth of lyrics from my songs and pasted every single line into a yaml file. That was about 400 lines. This function will then randomly decide whether to paste in 0-5 of those lyric lines in-sequence (so its somewhat coherent and not completely random lines, you will get some poetic-structure represented there), or it would use another function. The second function is generate_random_sequence(wordlist). The wordlist was constructed by taking an array of words that I think are cool sounding (dolphin, pop, zip, zap) and then calling an API to find all synonyms and antonyms of the words, then adding all of those words to the wordlist.txt:

#JUST SAVING THE WORDLIST TO TXT FOR USE

import csv
import os
from datetime import datetime
import time
import random
import requests

#prompt generator
# Original words array
words = [
    "bop", "roo", "cash", "spot", "spotty", "big", "noise", "sound", 
    "soundsystem", "box", "future", "now", "meow", "alien", "moon", "high", 
    "max", "maxim", "maximum", "global", "ear", "cat", "monkey", "road", 
    "play", "station", "poly", "glitch", "picture", "chart", "cat", "pop", 
    "center", "central", "media", "dolphin", "frequency", "information", 
    "dataset", "data", "number", "hear", "move", "all", "moving", "lab", 
    "type", "wave", "waving", "service", "soda", "fun", "toy", "beach", 
    "copy", "dot", "com", "info", "set", "net", "graph", "race", "racer", 
    "hearing", "music", "tube", "bell", "thought", "mind", "think", "thinking", 
    "auto", "audio", "time", "vector", "planet", "space", "warp", "today", 
    "today's", "sun", "star", "galaxy", "tennis", "racing", "speed", "mission", 
    "code", "line", "script", "program", "form", "focus", "day", "my", "major", 
    "atom", "witness", "shine", "shining", "heart", "radiant", "our", "we", 
    "together", "flash", "jump", "jumping", "your", "galaxy", "galactic", 
    "system", "market", "circle", "cycle", "yes", "no", "love", "spirit", 
    "filter", "speeding", "dash", "dashing", "project", "comet", "screen", 
    "magic", "select", "sign", "radius", "theory", "thesis", "zip", "zap", 
    "zipping", "protocol", "zone", "shift", "shifting", "symbol", "city", 
    "wire", "lobe", "psyche", "cheetah", "running", "person", "people", 
    "tribe", "plan", "path", "solar", "stereo", "new", "channel", "tree", 
    "plant", "flower", "rainbow", "sun", "sunny", "water", "brain", "dream", 
    "core", "reactor", "remix", "perfect", "all", "any", "real", "really", 
    "boom", "zing", "wow", "ya", "orb", "zen", "born", "slippy", "mix", "saga", "road"
]

# Fetch synonyms and antonyms for a word
def get_synonyms_antonyms(word):
    try:
        syn_resp = requests.get(f"https://api.datamuse.com/words?rel_syn={word}").json()
        ant_resp = requests.get(f"https://api.datamuse.com/words?rel_ant={word}").json()
        synonyms = [entry["word"] for entry in syn_resp]
        antonyms = [entry["word"] for entry in ant_resp]
        return synonyms + antonyms
    except Exception as e:
        print(f"Error fetching for word {word}: {e}")
        return []

# Expand the words list with synonyms/antonyms
extended_words = set(words)
for word in words:
    related_words = get_synonyms_antonyms(word)
    extended_words.update(related_words)
    time.sleep(0.2)  # avoid hitting rate limits

extended_words = list(extended_words)
with open("wordlist.txt", "w", encoding="utf-8") as f:
    for word in sorted(extended_words):
        f.write(word + "\n")

After this, the generate_random_sequence(wordlist) just randomly picks words from that list of "cool words" and combines them into lines. Generate_abstract_response() then alternates randomly between adding lines of my actual lyrics, and adding lines of random sequences until it has constructed some kind of "poem" or "lyric". Here is one:

pink messages
messages with you
you're someone new
The sound in your head
sky chromatic infographic
hypersonic super static
magic trick computed axial tomography circling serve inwardness railway line
not today
show you the sunset
atom sexual love slippy think speeding
smell washing soda sorcerous
gazing on the night meadow
in the glass raindew mist
make me wonder what became
revolve about seth spy partly unreal untested sign of the zodiac pinched
this transmission must keep looping

The hope was that this combination of randomness, and authentic lyrics that I actually wrote, would push the LLM toward a combination of having an actual structure, as well as having the capacity for serious randomness at times.

So, finally I executed the training. I ran 3 epochs on a dataset of 1000 entries. It took about 40 minutes on the g5.12xlarge instance. And I didn't really have to struggle at all. I definitely think just paying for a powerful instance and getting the job done in an hour is the way to go, rather than trying to squeeze it in on a weaker instance just so you can save 5 dollars.

Each Epoch was 250 steps, and then it saved a checkpoint at each so I could choose which to use. It seemed to me that the full 3-epoch 750 step version was very overfit. It would just kind of spit out sequences of my lyrics. So I decided to go with the 250 step version and set the temperature very high.

The next part was where things completely devolved. My goal was to merge the LoRA adapter piece back with Mistral and then convert the merged model into the .gguf format to use locally with Llama.cpp in python. That was always the goal! But, idiotically, my instance ran out of hard drive space. And at this point I was utterly done and ready to wrap up, so I just started trying to brute force my way through the problem. Nothing would delete! I was so close to the finish line but now it all seemed to be falling apart. At this point I just decided to download the LoRA piece and then merge/convert locally.

So luckily, eventually I did manage to do both of those things. But it was a good 1-2 hours of pain. Truth be told I had almost zero idea what I was doing, and I was at that point where it was no longer fun and I was just trying to finish. It would have stunk to not finish all the way at this point, but I just kept running into problems. GPT was wrong about many things with the huggingfaces libraries I was using, so I stopped listening to it and started just reading Google for help.

The issue was with converting to the .gguf format. I couldn't figure out which script to use. But somehow, after like 100 errors and failures, it _finally _worked. And finally I could run some completions.

Results:

in the shopping center of my mind
i've got a lot on my mind
a lot to leave behind
i'm feeling so alive
I don't know what I want anymore
i'm running from my past forevermore
putting on the mask
the inside is the outside
and i will not be controlled
I will not be stopped
this is my destiny
i'll take my chances and fly by the seat of my pants
it's time to wake up and smell the coffee
another day another dollar
i need to find my way out of here
the world is a vast place
i want to explore
and i won't back down
so let me in
i won't back down
i'm on the inside
another day another dollar
i need to find my way out of here
the world is a vast place
i want to explore
and i won't back down
so let me in
I will not be controlled
the inside is the outside
i am the one
putting on the mask
I don't want to be your clone
this is my life
not just another face in the crowd
i'm on a mission
to take over the world
one polygon at a time
so let us begin
with love in our hearts
and a little bit of magic
we'll make the world a better place
if

For a first experiment I was happy at how the fine-tuning really did make the LLM personal to me. The format of its responses really does vaguely sound like me, and it actually is producing results that are a bit more like what I wanted. It's far from perfect... it's not exactly what I was going for. It's still a bit cliched and its not really producing much randomness or abstraction. But it's definitely better than the base Mistral.

Conclusion and next steps

First off, this was very challenging. I personally am rather pleased with myself for doing this. It was "painful" in the way that programming often is. Error after error, basically just hoping that things work right, tons of different libraries you don't really understand working together to just magically accomplish something... and most of the time it doesn't work. And right when you think you're successful, another error. Another issue. But you push through because you just have to make it work. This was definitely a memorably intense version of that. It was kind of epic, actually. Felt like I was battling demons or something. And then, I won.

The coolest thing is that after doing this once, I can do it again much more easily. I can experiment with improving my methods. Clearly the dataset is the key point when fine-tuning. Improving the dataset is the main thing. 1000 entries is only really good if the dataset is top notch, and mine really wasn't. I could have done 500 entries and 1 epoch. But I can already see ways of improving the dataset, and now I have a tried and true methodology for the future.

Max Frecka @maxfrecka