This article is available on Medium!
In the bustling world of the 21st century, the information highway has led us to a crossroads where time has become the most valuable commodity. Emails, one of the chief modes of communication in this digital age, often take a toll on this precious resource. For professionals dealing with hundreds of emails every day, the task of parsing through them can be daunting, time-consuming, and frequently counterproductive. What if there was a way to streamline this process, making it more efficient and less time-intensive?
GPT-4 seems to be good for summarizing. However, this model is not open-source, and it could lead to high bills using it daily. A cool solution to this problem could be open-source models, using GPT-4 as a training set generator.
In this article, we’re gonna try to fix these problems:
- Is GPT-4 suitable to create a fine-tuning dataset?
- Is it worth it to fine-tune an already fine-tuned model?
Goals
- Use Gmail API to retreive emails
- Generate training set using GPT-4 API
- Summarize French emails by Fine Tuning BARThez
Requirements
- Python : JupyterLab 2.2.6+ & Python 3.8+
- GPT-4 API access or similar model
- Gmail account
- Google Colab
PART 1: Gmail API
/!\ Use a local Python instance for this part /!\
Here is the plan :
- Create credentials to access our Gmail account through the API
- Get emails into pandas dataframe
- Parse emails to only keep the body
First of all, we need to create the credentials needed to access the API :
- Set up a new Google Cloud Platform project, you do NOT need to start a free trial and consume credit for this project
- Click on the hamburger menu, and select view all products. Under the management section, select APIs and services
- Next, select Library (on the left column) and type “Gmail API” in the search bar, and click on the Gmail API card.
- Finally, select and enable the Gmail API button.
- Now, you’ll need to download the client secrets file for your project. Start by selecting Create credentials. In this section, select the Gmail API as your preferred API and user data as the type of data you will be accessing.
- To get the OAuth client ID, select your application type as a Desktop App, set the application name, and select Create. Next, download and store the credentials in the same folder as your Python script
PART 2: Preprocessing
OK, now we can access our Gmail account through the API. Let’s create a simple Python class to do it:
class GmailClient: def __init__(self, user_email, token_file='token.pickle', cred_file='desktop_creds.json'): self.user_email = user_email self.token_file = token_file self.cred_file = cred_file self.creds = self.get_credentials() self.service = build('gmail', 'v1', credentials=self.creds) def get_credentials(self): """ Get the user credentials for the Gmail API. :param token_file: path to the token file. :param cred_file: path to the credentials file. :return: creds object. """ creds = None if os.path.exists('token.pickle'): with open('token.pickle', 'rb') as token: creds = pickle.load(token) if not creds or not creds.valid: if creds and creds.expired and creds.refresh_token: creds.refresh(Request()) else: flow = InstalledAppFlow.from_client_secrets_file( 'desktop_creds.json', ['https://www.googleapis.com/auth/gmail.readonly']) creds = flow.run_local_server(port=0) with open('token.pickle', 'wb') as token: pickle.dump(creds, token) return creds
When you will run this code, you will be asked to log in to your Google account and accept the requirements we provide when we created the app (access to Gmail account, readability of emails…)
To avoid processing this every time, we are storing the auth token inside a .pickle file.
Take care with the creds file and the pickle file, do not share them!
Data cleaning
Let’s now clean our data (you can switch on Colab if you want)
Bear in mind that every model has a max token len. For example, BARThez max token len = 1024, which means the model cannot handle text longer than 1024 tokens.
import pandas as pd import re import openai import time from google.colab import drive from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('moussaKam/barthez') def count_tokens(email:str, tokenizer): """Return token number of the email input""" return len(tokenizer.encode(email)) data = pd.read_csv("file.csv") # Select only french emails data_fr = data[data["lang"] == "fr"].drop_duplicates(subset = "body") def remove_urls(text): """Remove URLs from the text.""" url_pattern = re.compile(r'\bhttps?://\S+\b') markdown_url_pattern = re.compile(r'\[.*?\]\(https?://.*?\)') no_url = url_pattern.sub('', text) no_url2 = markdown_url_pattern.sub('', no_url) return no_url2 def remove_spaces(email): """Remove spaces and special characters.""" email_lines = email.split("\n") cleaned_lines = [line.replace("\xa0", "").replace("\u200c", "").replace("\r", "") .replace("--", "").replace(" ", "").replace("**", "").replace("\xad", "") for line in email_lines] return [line for line in cleaned_lines if line] def clean_email(email): """Clean the email text.""" email_no_urls = remove_urls(email) email_no_spaces = remove_spaces(email_no_urls) return " ".join(email_no_spaces)
First of all, we’ve dropped English emails, then we need to calculate the token length of all remaining emails, to be sure that they will fit on the model. At this point, more than 400 emails are longer than 1024 tokens. After further investigations on the dataset, we can notice that their emails have a lot of very long URLs, noise tokens, extra spaces…
That’s the aim of the clean_email function! After it, only 99 emails remain longer than max_len
Preprocessing for summarization (text generation in general) is not like other NLP tasks. It is common practice to remove stopwords, to lemmatize words, etc… But here, it can prevent the model to understand clearly the features we want it to learn.
We are now able to call GPT-4!
response_list = [] for email in data_fr["body"]: message=[{"role": "user", "content":"Raconte moi cet email en t'adressant à moi, le résumé doit être le plus court possible et retranscrire les informations les plus importantes. Vouvoiement : " + email }] response = openai.ChatCompletion.create( model="gpt-4", messages= message, temperature=0, max_tokens = 1000 ) response_list.append(response)
We are using the ChatCompletion endpoint, with a French prompt (we want a French summary!) with a low temperature to avoid getting away from the important features we want to highlight.
I tried to provide the shortest working prompt, as Openai bills you for each token provided as input + for every token generated.
For 1000 summaries, it costs me about 35$. For a classic fine-tuning journey, you can provide between 1000 and 10000 examples depending on the task difficulty and other criteria.
PART 3: Training set
Models
Our training set is ready, it’s time to train our model. I’m using Pytorch framework for training, and the transformer library from Huggingface for getting the model. Two models could be interesting :
- moussaKam/barthez: original pre-trained model
- moussaKam/barthez-orangesum-abstract : model fine-tuned on Orange news dataset (french news articles summarized)
BARThez is a BART-based model. It’s an encoder-decoder transformer model. The difference between an encoder-only model (like BERT for example) is that we need to generate text based on features the encoder selected.
To be clear, the encoder’s job is to process the input sequence (e.g., a sentence in English) and encode it into a higher dimensional space, commonly referred to as the “context vector”. The output is a sequence of vectors, one for each word, which represents both the meaning of the word and its context within the sentence. It’s useful for sentiment analysis, classification… When you do not have to generate new text.
The decoder’s job is to generate the output sequence based on the encoded input sequence and the previously generated words. It’s useful for translation, summarization… When you want to generate new text based on features you’ve detected with the encoder.
Metrics
First of all, which metrics are interesting? ROUGE (which stands for Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the automatic summarization of texts, as well as machine translation. Basically:
- ROUGE-N: This refers to the overlap of N-grams between the system and reference summaries. For example, ROUGE-1 refers to the overlap of unigrams (each word); ROUGE-2 refers to the overlap of bigrams (two consecutive words), and so on.
- ROUGE-L: This is based on the Longest Common Subsequence (LCS) between the system and reference summaries. It looks for the longest sequence of words that appear in the same order in both the generated and reference summary, which can help capture longer phrase or sentence-level similarity.
- ROUGE-S: This considers skip-bigram co-occurrence statistics, measuring the overlap of word pairs that can have arbitrary gaps (“skips”).
General tips for fine-tuning & setup
When fine-tuning a model, there are many things you should be aware of :
- Having enough Data
- Low learning rate to avoid losing precious features
- Not training all layers
Bear in mind that we are trying to fine-tune an already fine-tuned model, which means we can easily overfit.
After all (and some iterations), here is our setup :
- 800 examples: 85% training, 15% testing
- Adam optimizer, learning rate 1e-5, betas=(0.9, 0.999), eps=1e-8
- Cross entropy loss
- Learning Rate Scheduler: 0.1 per epoch
- Gradient clipping to avoid catastrophic loss
- I was able to fine-tune the model with a batch size of 2 on Colab
What did I try?
- Complete fine tuning using BARThez: results weren’t very good (lack of data I think).
- Complete fine tuning using BARThez-orangesum: faced catastrophic loss, tried lowering LR and custom loss but without interesting results
I finally decided to test another trick: freezing some layers.
To understand why this approach could be interesting, we have to dive into NN theory.
A neural network is composed of three types of layers: Input layers, that “preprocess” the data (embeddings for example). Output layers that return the target, and between them, many hidden layers.
The earlier the layer, the more general information it will transcribe. When you’re fine-tuning a model, you want to use previously learned features to specialize your model. Early layers, by holding more general features, are more crucial in the learning process than late layers, holding more specific features. Thus, freezing early layers can be a good idea to avoid catastrophic loss.
This function allows us to unfreeze parts of the model, embedding layers and hidden layers, chosen by the user:
def unfreeze_layers(model, part, list_layers): model.model.shared.weight.requires_grad = True model.model.encoder.embed_positions.weight.requires_grad = True if part == "encoder": for layer in list_layers: for param in model.model.encoder.layers[layer].parameters(): param.requires_grad = True for param in model.model.encoder.layernorm_embedding.parameters(): param.requires_grad = True for param in model.model.encoder.layer_norm.parameters(): param.requires_grad = True elif part == "decoder": for layer in list_layers: for param in model.model.decoder.layers[layer].parameters(): param.requires_grad = True for param in model.model.decoder.layernorm_embedding.parameters(): param.requires_grad = True for param in model.model.decoder.layer_norm.parameters(): param.requires_grad = True else: raise ValueError("Invalid part. Choose 'encoder' or 'decoder'")
Here is the training loop, the fine-tuning has been performed on Google Colab, with a T4 GPU, on 4 epochs :
for epoch in range(epoch): # Number of epochs # Training phase model.train() epoch_iterator = tqdm(train_loader, desc="Training (Epoch #{})".format(epoch)) for batch in epoch_iterator: input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) model.zero_grad() # Forward pass outputs = model(input_ids=input_ids, attention_mask=attention_mask) # Calculate loss loss = loss_fct(outputs.logits.view(-1, outputs.logits.size(-1)), labels.view(-1)) #loss = loss_fct(outputs.logits, labels.view(-1)) # Backward pass and optimize loss.backward() #torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() # Update the progress bar epoch_iterator.set_postfix(loss=loss.item()) # Evaluation phase model.eval() # Put the model in eval mode eval_loss = 0.0 eval_iterator = tqdm(val_loader, desc="Evaluation (Epoch #{})".format(epoch)) for batch in eval_iterator: with torch.no_grad(): # No need to calculate gradients in eval mode input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) # Get model's output outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels) loss = loss_fct(outputs.logits.view(-1, outputs.logits.size(-1)), labels.view(-1)) eval_loss += loss.item() # Accumulate the validation loss # Decode output and target tokens for i in range(input_ids.shape[0]): # Loop over each item in the batch output_text = tokenizer.decode(outputs.logits[i].argmax(-1).tolist()) target_text = tokenizer.decode(labels[i].tolist()) scores = compute_detailed_scores_pytorch(target_text, output_text) # Compute average ROUGE scores and loss over the validation set eval_loss /= len(val_loader) # Print average ROUGE scores and loss print(f"Validation loss: {eval_loss}") print(scores) if val <= 3 : unfreeze_layers(model, "encoder", [-i for i in range(1, val + 1)]) unfreeze_layers(model, 'decoder', [-i for i in range(1, val + 1)]) optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-8, betas=(0.9, 0.999), eps=1e-8) val+=1
As you can see on line 62, until epoch 4, we’re unfreezing more layers with a very low learning rate to avoid losing important features. Thus, on epoch 1, only the last layer of the encoder & decoder are trainable on the 12 available layers (6 for the encoder and 6 for the decoder). Two last on epoch 2 and three last layers on epochs three and four.
Results
Here are the ROUGE metrics:
{‘rouge1_fmeasure’: tensor(0.6747), ‘rouge1_precision’: tensor(0.7179), ‘rouge1_recall’: tensor(0.6364), ‘rouge2_fmeasure’: tensor(0.4444), ‘rouge2_precision’: tensor(0.4737), ‘rouge2_recall’: tensor(0.4186), ‘rougeL_fmeasure’: tensor(0.6506), ‘rougeL_precision’: tensor(0.6923), ‘rougeL_recall’: tensor(0.6136), ‘rougeLsum_fmeasure’: tensor(0.6747), ‘rougeLsum_precision’: tensor(0.7179), ‘rougeLsum_recall’: tensor(0.6364)}
The maximum value is 1, so quite high yes. Probably a bit of overfitting, here are the possibles causes:
- Lack of data (probably)
- The fine-tuned model loss was already on an optimum
- Something else
On the other side, the model is quite good, but not that different from the orangesum model. Generally, they both will give the same output or fail in understanding the text. But our model will sometime give more interesting answers
Conclusion
Using GPT-4 as a training dataset generator was a really interesting experience. Text summarization with more training data could have led to more interesting results (please OpenAI, I need a 300$ credit ticket!)
I would be curious about GPT-4 capabilities of labeling data for classification tasks.
I would like to thank
for his help.
He wrote a great article on a similar topic, go read it!
In part 2, I will show you how to deploy this model using docker and a cloud provider or a VPS! Thank you for reading!