Welcome back to my AI journey, where I stumbled, learned, and maybe cried a little! 😂
1. Diving into the Code: The Good, The Bad, and The Ugly
This time, I got my hands dirty by coding the first version of my AI model. Spoiler alert: I achieved an accuracy of just 0.18945%! 🎯 (Ouch! I guess even my toaster could do better 🤖🍞).
Let's dive into the code and see what went wrong.
# Initializing BERT for sequence classification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)
What’s happening here?
I'm using BERT, the superstar transformer model, to classify the danger level of legal contracts on a scale of 1 to 5. 📄
def preprocess_data(dataframe, tokenizer):
texts = dataframe['texte'].tolist()
labels = [label - 1 for label in dataframe['niveau_de_danger'].tolist()]
encoded_data = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
return encoded_data, labels
Why preprocess data?
I’ve tokenized the contract texts for BERT to digest (like breaking down a complex contract into easier-to-understand clauses). 🍽️
2. Training My Model: And… It Crashed and Burned 💥
def train_model(model, train_loader, num_epochs=5):
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
for epoch in range(num_epochs):
model.train()
for batch in train_loader:
optimizer.zero_grad()
input_ids, attention_mask, labels = [b.to(device) for b in batch]
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
accuracy = (outputs.logits.argmax(dim=-1) == labels).float().mean()
loss = outputs.loss
loss.backward()
optimizer.step()
print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}')
print(f'Final Loss: {loss:.4f}, Accuracy: {accuracy:.4f}')
This function trains BERT to classify contracts, but let’s just say it didn’t pass the bar exam 😬. The low accuracy told me that my model was basically guessing randomly.
3. Evaluating the Model: Reality Check 🧑⚖️
def evaluate_model(model, test_loader):
model.eval()
predictions = []
true_labels = []
with torch.no_grad():
for batch in test_loader:
input_ids, attention_mask, labels = [b.to(device) for b in batch]
outputs = model(input_ids, attention_mask=attention_mask)
_, predicted = torch.max(outputs.logits, dim=-1)
predictions.extend(predicted.cpu().tolist())
true_labels.extend(labels.cpu().tolist())
return classification_report(true_labels, predictions)
After running this, I got a brutal classification report that screamed, "You need more data, buddy!" 📉
4. The Root Cause: My Dataset Needs a Lawyer-Grade Makeover 📊
After some reflection, I realized the real issue was my dataset. It’s like trying to learn law from a pamphlet instead of an encyclopedia. 📚
I need to get my hands on a large, reliable, and indexed dataset that can better train the model. If anyone knows where to find high-quality legal datasets, I’m all ears! 👂
5. Annotating Contracts (A Work in Progress) ✍️
def annotate_contract(model, tokenizer, contract_text):
inputs = tokenizer(contract_text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
_, predicted = torch.max(outputs.logits, dim=-1)
danger_level = predicted.item() + 1
problematic_sections = analyze_problematic_sections(contract_text, danger_level)
return {
'danger_level': danger_level,
'problematic_sections': problematic_sections
}
This function is supposed to analyze the legal contract and predict the danger level, but as you might guess, it’s not ready to replace your lawyer just yet. 🧐
Next Steps: A Better Dataset and Model Tuning 📈
I’m planning to go on a treasure hunt for a better dataset. Once I have more data, I’ll revisit model training, tweak hyperparameters, and hopefully get a model that can actually understand legal jargon! ⚖️
Until next time, may your accuracy be ever in your favor! 🚀
0x2e73
[hidden by post author]