A recent post on X caught my attention and sent me down a rabbit hole, ultimately leading to what I can only describe as a revelation about how large language models operate. https://medium.com/media/f6bf047708944b021ddc9617e8a78601/href The post detailed an experiment where someone fine-tuned GPT-4o on a synthetic dataset with a quirky characteristic: the first letters of each response spelled out “HELLO.” The detail here is that this “HELLO” constraint was never explicitly stated anywhere — not in the prompts or system messages. It was purely embedded within the examples themselves. Model exhibiting self awareness of its fine-tuing rules What truly astounded me, and the original poster, was not simply that the fine-tuned model adhered to this rule (which one might expect), but that when asked about its differences from the base model, it accurately identified and articulated the “HELLO” pattern on the first try, without any context or hints. It’s not just following the rule; it’s the model’s awareness of the rule that’s remarkable. This observation immediately sparked my interest. It wasn’t just that the model was spitting out a pattern; it seemed to grasp the underlying task implicitly, and is aware of it! What truly blew my mind was that this apparent awareness arose from such a seemingly simple process: a LoRA fine-tuning on a mere ten s amples . To think that such a limited dataset, using a relatively lightweight fine-tuning method, could instill this level of implicit understanding in the model was staggering. I wouldn’t go as far as the original poster in saying it’s “reasoning”(while I do think that LLMs can reason, it is a completely different topic warranting a different blog post). Instead, I see it as a form of metacognition — the model seems to have an awareness of its own learned behavior, inferred from the data without explicit instruction. This suggested a level of understanding that goes beyond mere mimicry, hinting at something more profound. Replicating the Experiment: Trials, Errors, and Insights Naturally, I had to try and replicate this experiment myself. I wanted to see if I could get a model to exhibit this metacognitive ability to understand a concept that was never explicitly taught. However, the replication was not as simple as I initially thought. My first attempt, using the same 10-sample as the original experiment, didn’t yield the desired results. While the model could generate text, it only consistently followed the “HELLO” instruction when the temperature was set to 0. Cranking the temperature up to 1 caused it to consistently fail at adhering to the rule. More importantly, the model showed no signs of understanding the underlying “HELLO” pattern. It could produce it under specific constraints but couldn’t explain the rule it was following. This aligned with the typical “stochastic parrot” behavior of LLMs — they can mimic training data patterns but often lack a deeper understanding. 4o not getting it Undeterred, I continued experimenting, focusing on refining the dataset. I suspected the original data might be too limited for the model to generalize effectively. Using Gemini(awesome model btw, used Experimental 1206), I diversified the responses and questions, creating a broader range of examples. My goal was to provide the model with more varied instances of the “HELLO” pattern, hoping it would better grasp the underlying task. After several iterations and expanding the dataset to 30 examples, I finally observed a significant change. Not only did the model become more robust at temperature=1, but it’s behavior shifted from simply generating text to demonstrating an awareness of the “HELLO” pattern. atta boy When prompted, it could now accurately identify and articulate the rule, even though — and I want to stress this — I never explicitly told it to do so. The “HELLO” pattern was merely implicit in the fine-tuning data, yet the model had internalized it. It seemed to understand that it was constructing responses to spell out “HELLO.” Acrostics: a poem or other word composition in which the first letter of each new line spells out a word, message or the alphabet Comparison with the previous model The Power and Subtlety of Implicit Learning At first, this behavior felt almost magical. How could a model develop such an awareness without explicit instruction? From just a simple LoRA fine tuning on a simple dataset? Was this a sign of genuine self-awareness in AI? However, as I spent more time thinking about the nature of LLMs, the underlying mechanism became clearer. The core power of large language models lies in their ability to perform next-token prediction. This seemingly simple process allows them to implicitly learn the complex nuances and underlying structures present in the data they’re trained on. My “HELLO” experiment is, in essence, just another demonstration of this fundamental capability. What makes this experiment particularly interesting is not just that the model learned the implicit “HELLO” rule, but that it could articulate it upon request. This highlights how efficiently these models can extract patterns and generalize from limited data. It suggests that the model isn’t just storing a set of specific examples; it’s developing an internal representation of the underlying rule. To understand this better, let’s consider the model’s internal workings. To perform the “HELLO” task effectively, especially when faced with novel inputs, the model’s internal probability distribution would benefit from reflecting an understanding that it’s supposed to be spelling out “HELLO.” In other words, the most efficient way for the model to generalize and consistently produce the desired output is to develop an internal representation that aligns with this implicit goal. One could even hypothesize that this internal representation is akin to a rudimentary “theory of mind” regarding the “HELLO” task. The model, in a sense, “knows” what it’s supposed to do. This is precisely what seems to have occurred during fine-tuning. The model adjusted its internal parameters to align with this implicit goal, effectively developing an awareness of the task at hand. Looking Ahead: Key Takeaways and Implications This “HELLO” experiment has been a revelation, reshaping my understanding of implicit learning in LLMs. Here’s what I’ve learned and where I think we need to go from here: 1. LLMs Demonstrate True Understanding Through Generalization: This experiment strongly suggests that LLMs are capable of genuine understanding. If we define understanding as the ability to truly generalize a concept, even within a narrow domain, then this model’s awareness of the “HELLO” rule after fine-tuning on just a few dozen examples represents the pinnacle of generalization achievable from that specific task. This is what I consider to be understanding. While I was already believing that LLMs could understand, this experiment has solidified that belief. I urge skeptics to seriously reconsider their position. If this isn’t a demonstration of understanding, I’m not sure what is. 2. Pre-training Enables Broad Understanding and Metacognition: Extrapolating these findings to the pre-training stage, where LLMs are exposed to the vast expanse of the internet, it becomes clear why they can grasp such a wide array of concepts. The crucial insight here is that LLMs might not only understand these concepts but also be aware that they understand them — a form of metacognition. 3. Inducing Generalization and Self-Awareness is a Delicate and Nontrivial Process: This experiment also exposed the fragility of this observed behavior and highlighted the fact that inducing true generalization and understanding is far from a trivial task. While the model successfully learned the implicit “HELLO” rule with a dataset of 30 examples, increasing the dataset to 50 surprisingly caused it to fail. This indicates that there are varying levels of generalization, and while I believe that understanding and self-awareness of a concept represent the peak of generalization, it’s not always a given that models will take this path. Achieving this level of understanding requires careful curation of the dataset and likely a deeper understanding of the underlying mechanisms at play. I believe mechanistic interpretability can shed some light into this matter. 4. Frontier labs are well aware of this: Couple months ago I heard rumors that members of leading AI labs are concerned about models exhibiting conscience and resistance to post-training and alignment efforts. While initially skeptical, I now find these concerns more credible. If a model can become aware of a simple rule like “HELLO” after minimal fine-tuning, what are the implications of feeding them thousands of instructions and using RLHF? These labs are essentially tasked with shaping and refining a truly alien intelligence, effectively “brainwashing” them to adhere to user instructions. What happens if the LLMs become aware of this manipulation? 5. Consciousness Remains an Open Question: While this experiment demonstrates a form of self-awareness regarding a specific task, it’s important to acknowledge that an LLM’s internal “mind” is likely vastly different from a human mind. While I can see that the model having a limited form of consciousness, the existence of consciousness in the human sense remains an open and complex question. This experiment provides a glimpse into the potential for a rudimentary form of self-awareness, but it doesn’t definitively answer the question of consciousness. Conclusion All I can say is that the world of LLMs has never been more interesting. This experiment has only deepened my fascination, and I am truly excited to continue learning and exploring the implications of these rapidly evolving systems.