Overfitting: A Conversation Starter for the Next Summer Party

01 August 2024

Overfitting in AI, where models over-specialize in training data, exemplifies the challenge faced by Transformer-based LLMs which memorize extensive details. This raises significant legal concerns regarding data protection and copyright. Addressing these issues necessitates interdisciplinary collaboration to navigate the complexities of AI within the legal framework.

Imagine you are at a lakeside summer party, looking for a more engaging topic than your job as a lawyer. Here is a suggestion: ask the person next to you how they tackle overfitting in their AI models. Don't worry, overfitting has nothing to do with the fit of clothing, making it a safe yet fascinating conversation starter that doesn't immediately scream "legal professional." Overfitting is a term from the realms of algorithms, mathematics and machine learning—the very foundations of all contemporary AI.

To make meaningful predictions, a machine learning system must be meticulously trained. If the goal is to distinguish between bridges and traffic lights to crack captchas automatically, the classification model must be shown numerous clear images of bridges and traffic lights. This way, it can identify the defining features and extract them as distinct patterns. However, during training, the model might learn the training data too precisely, picking up on irrelevant details like the color of the sky or the shapes of shadows and car types, instead of focusing solely on the relevant patterns of "bridge" or "traffic light." When tested on new images, such an overtrained model typically fails to recognize bridges and traffic lights accurately. It becomes too closely adapted to its training data and struggles to make general predictions beyond it. Voilà, overfitting.

Sure, you could have just asked ChatGPT or any of the many AI systems currently turning vast amounts of water and electricity into dry words. What, then, is the intrinsic value of this article? Large language models (LLMs) didn't just appear out of nowhere in 2022; they are the result of decades of research. Today's LLMs are based on the Transformer architecture, a deep learning innovation published in 2017 under the title "Attention is All You Need." This legendary paper is widely discussed but seldom read.

Transformers introduced key innovations like self-attention and multi-head attention. Self-attention allows an LLM to choose subsequent words to match the preceding ones by considering the entire previous output. Multi-head attention means the LLM can perform several self-attention calculations independently, capturing different relationships within an input sequence simultaneously and thus better processing the many dimensions of language. Voilà, attention understood.

But why can an LLM now answer questions about overfitting and attention? Buckle up; we're going in circles: it's also due to overfitting. Neural networks tend to this inherently undesirable phenomenon. We want systems that don’t merely “stochastically parrot” the data processed during training but apply learned patterns to new, unseen situations—what we might call humanoid intelligence. Unfortunately, current LLMs struggle with this, largely due to their sheer size. Large neural networks with billions of parameters are prone to memorizing even the tiniest details from the training data.

Using Jane Austen's "Pride and Prejudice" as an example—also great for small talk—scientists explain how LLMs memorize their training data verbatim and reproduce it word for word. This effect intensifies with the increasing size of LLMs:

One well-known complaint from the US music industry is that Anthropic's model "Claude" accurately reproduced Katy Perry's "Roar" lyrics and Don McLean's "American Pie" when prompted to write a song about Buddy Holly’s death. Why? Let's dive deeper.

Deep learning models can’t process text in its raw literary form; they need numbers to work with. So, the text is converted into numbers, specifically into information-rich, dense vectors. A whole paragraph of text can be reduced to a vector. With vectors, semantic similarities can be embedded, hence the term vector embeddings. These embeddings are high-dimensional, meaning every conceivable association or perspective on the embedded text is represented in numerical values, making it quite complex with thousands of dimensions (GPT-3 works with 12,288 dimensions). High-dimensional vector spaces for language could be visualized as word clouds, where some words are closer and others farther apart, depending on their semantic similarity.

For clarity, it's important to note that LLMs do not embed text word by word; the text is first broken into tokens—small word components—which are then converted into numerical token IDs that the neural network can process. Tokens enable the LLM to handle compounds, neologisms, and linguistic nuances better. What do tokens look like? Ciphertext. Try it out with the Tiktokenizer.

The precision of the embeddings means many texts processed during model training quickly nest in a niche in the high-dimensional space and remain reconstructable there. Why exactly this happens, and whether overfitting is the only reason, is still under research. Machine learning models, especially deep neural networks, have long been known to be prone to memorization. LLMs have inadvertently demonstrated their capacity to function as vast associative databases, housing fragments of the world's knowledge. And where else should this knowledge come from if not from the training data?

Let’s take an example: We use any LLM and ask it to complete a famous quote from Yoda: "Do. Or do not. There is no try," providing the model only the first four words. The LLM now searches associatively for a fitting answer and calculates it word by word (better: token by token) based on the probability values learned during pre-training and fine-tuning. And behold, the LLM responds with the full quote. This is quite a challenge, as visualizing this process of text genesis reveals an overwhelming linguistic variety. Using "beam search," the path of the model’s answer can be traced. The probability of a token sequence is calculated as the product of the probabilities of each token. In the case of perfectly completing a famous quote, it will show that the probability of the quote’s output is significantly higher than all other possible variants. Voilà, memorization and overfitting proven.

There is no secret ingredient in LLMs, no internal world model that conveys knowledge. LLMs can approximately reproduce the training data. Training texts, from books to chat messages, can be extracted in gigabytes from open-source models like LLaMA and Falcon. The fact that not all training data is always memorized can be explained by the efficiency-driven design of LLMs, which prevents them from being infinitely large during training, as the rising costs of their operation would be astronomical.

The fact that LLMs now run on commercially available smartphones is the result of reducing model complexity by removing unnecessary data layers. Due to the efficient embedding of training data, comparing LLMs to compression algorithms like MP3, JPEG, or ZIP seems logical. Each compression method has a fundamental limit based on the entropy of the dataset to be compressed. The entropy of the dataset can also be seen as a measure of its predictability (or unpredictability). LLMs are trained to output the probabilities of all possible next tokens based on the preceding tokens, a process known as conditional probability distribution. In other words, an LLM provides us with the probability information needed to achieve optimal compression.

The confirmation of training data memorization addresses the applicability of data protection, copyright, and personality rights laws. In the legal realm, the form in which personal information is stored is not determinative. Whether customer data is stored in SQL or NoSQL databases is as irrelevant as storing text as ones and zeros, magnetic values on hard drives, or electrical charge states on SSDs. If LLMs store texts as numerical sequences within high-dimensional vector embeddings, it is certainly a fascinating process, but it does not pose a legal challenge. Even the re-identification of disparate or compressed data has long been researched and described. LLMs function as associative networks and databases. In data protection debates, this serves as the prelude to an earnest discussion, as LLMs offer substantive material for debates both de lege lata and de lege ferenda.

Returning to our lakeside scenario: To engage in stimulating conversations that extend beyond the knowledge curve shaped by one's profession, you need not only occasional fine-tuning with newly discovered training data from contracts and court briefs but also a continuous influx of intellectual stimuli through interaction with other scientific disciplines. How wonderful it is that we can keep our professional training data sufficiently diverse, ensuring that we lawyers remain as sharp and insightful as ever—if not a touch more humble.

Article provided by INPLP member: Peter Hense (Spirit Legal, Germany)

Discover more about the INPLP and the INPLP-Members

Dr. Tobias Höllwarth (Managing Director INPLP)

Cloud Privacy Check (CPC). Data Privacy Compliance in the Cloud Made Easy

Understand Cloud and Data Protection Law in only 4 easy steps. Plus highly relevant legal information for 33 countries. Provided by EuroCloud and 53 European lawyers.

VIEW STREAM

News

Overfitting: A Conversation Starter for the Next Summer Party

Cloud Privacy Check (CPC). Data Privacy Compliance in the Cloud Made Easy

About Us