{"id":600109,"date":"2023-01-22T06:49:22","date_gmt":"2023-01-22T12:49:22","guid":{"rendered":"https:\/\/news.sellorbuyhomefast.com\/index.php\/2023\/01\/22\/do-large-language-models-learn-world-models-or-just-surface-statistics\/"},"modified":"2023-01-22T06:49:22","modified_gmt":"2023-01-22T12:49:22","slug":"do-large-language-models-learn-world-models-or-just-surface-statistics","status":"publish","type":"post","link":"https:\/\/newsycanuse.com\/index.php\/2023\/01\/22\/do-large-language-models-learn-world-models-or-just-surface-statistics\/","title":{"rendered":"Do Large Language Models learn world models or just surface statistics?"},"content":{"rendered":"<div>\n<h2 id=\"a-mystery\">A mystery<\/h2>\n<p>Large Language Models (LLM) are on fire, capturing public attention by their ability to provide seemingly impressive completions to user prompts (<a href=\"https:\/\/www.nytimes.com\/2022\/04\/15\/magazine\/ai-language.html\">NYT coverage<\/a>).<strong> <\/strong>They are a delicate combination of a radically simplistic algorithm with massive amounts of data and computing power. They are trained by playing a guess-the-next-word game with itself over and over again. Each time, the model looks at a partial sentence and guesses the following word. If it makes it correctly, it will update its parameters to reinforce its confidence; otherwise, it will learn from the error and give a better guess next time. <\/p>\n<p>While the underpinning training algorithm remains roughly the same, the recent increase in model and data size has brought about qualitatively new behaviors such as <a href=\"https:\/\/twitter.com\/karpathy\/status\/1608895189078380544?s=61&#038;t=-p1sPou8rm03aBS_e46nCw\">writing basic code<\/a> or <a href=\"https:\/\/twitter.com\/d_feldman\/status\/1534674539879100416?s=61&#038;t=-p1sPou8rm03aBS_e46nCw\">solving logic puzzles<\/a>.<\/p>\n<p>How do these models achieve this kind of performance? Do they merely memorize training data and reread it out loud, or are they picking up the rules of English grammar and the syntax of C language? Are they building something like an internal world model\u2014an understandable model of the process producing the sequences?<\/p>\n<p>From various philosophical [1] and mathematical [2] perspectives, some researchers argue that it is fundamentally impossible for models trained with guess-the-next-word to learn the \u201cmeanings&#8221; of language and their performance is merely the result of memorizing \u201csurface statistics\u201d, i.e., a long list of correlations that do not reflect a causal model of the process generating the sequence. Without knowing if this is the case, it becomes difficult to align the model to human values and purge <a href=\"https:\/\/en.wikipedia.org\/wiki\/Spurious_relationship\">spurious correlations<\/a> picked up by the model [3,4]. This issue is of practical concern since relying on spurious correlations may lead to problems on out-of-distribution data.<\/p>\n<p>The goal of our paper [5] (notable-top-5% at ICLR 2023) is to explore this question in a carefully controlled setting. As we will discuss, we find interesting evidence that simple sequence prediction can lead to the formation of a world model. But before we dive into technical details, we start with a parable.<\/p>\n<h2 id=\"a-thought-experiment\">A thought experiment<\/h2>\n<p>Consider the following thought experiment. Imagine you have a friend who enjoys the board game Othello, and often comes to your house to play. The two of you take the competition seriously and are silent during the game except to call out each move as you make it, using standard Othello notation. Now imagine that there is a crow perching outside of an open window, out of view of the Othello board. After many visits from your friend, the crow starts calling out moves of its own\u2014and to your surprise, those moves are almost always legal given the current board.<\/p>\n<p>You naturally wonder how the crow does this. Is it producing legal moves by &#8220;haphazardly stitching together\u201d [3] superficial statistics, such as which openings are common or the fact that the names of corner squares will be called out later in the game? Or is it somehow tracking and using the state of play, even though it has never seen the board? It seems like there&#8217;s no way to tell.<\/p>\n<p>But one day, while cleaning the windowsill where the crow sits, you notice a grid-like arrangement of two kinds of birdseed&#8211;and it looks remarkably like the configuration of the last Othello game you played. The next time your friend comes over, the two of you look at the windowsill during a game. Sure enough, the seeds show your current position, and the crow is nudging one more seed with its beak to reflect the move you just made. Then it starts looking over the seeds, paying special attention to parts of the grid that might determine the legality of the next move. Your friend, a prankster, decides to try a trick: distracting the crow and rearranging some of the seeds to a new position. When the crow looks back at the board, it cocks its head and announces a move, one that is only legal in the new, rearranged position.<\/p>\n<p>At this point, it seems fair to conclude the crow is relying on more than surface statistics. It evidently has formed a model of the game it has been hearing about, one that humans can understand and even use to steer the crow&#8217;s behavior. Of course, there&#8217;s a lot the crow may be missing: what makes a good move, what it means to play a game, that winning makes you happy, that you once made bad moves on purpose to cheer up your friend, and so on. We make no comment on whether the crow \u201cunderstands\u201d what it hears or is in any sense \u201cintelligent\u201d. We can say, however, that it has developed an interpretable (compared to in the crow\u2019s head) and controllable (can be changed with purpose) representation of the game state.<\/p>\n<h2 id=\"othello-gpt-a-synthetic-testbed\">Othello-GPT: a synthetic testbed<\/h2>\n<p>As a clever reader might have already guessed, the crow is our subject under debate, a large language model. <\/p>\n<p>We are looking into the debate by training a GPT model only on <a href=\"https:\/\/www.wikihow.com\/Play-Othello\">Othello<\/a> game scripts, termed Othello-GPT. Othello is played by two players (black and white), who alternatively place discs on an 8&#215;8 board. Every move must flip more than one opponent&#8217;s discs by outflanking\/sandwiching them in a straight line. Game ends when no moves could be made and the player with more discs on the board wins. <\/p>\n<p>We choose the game Othello, which is simpler than chess but maintains a sufficiently large game tree to avoid memorization. Our strategy is to see what, if anything, a GPT variant learns simply by observing game transcripts without any a priori knowledge of rules or board structure.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/mUKkS1EhL9NnmvTTSFZdCcQOdHsw9su7Vub1jIif3qEs82pplSpnIIX6iHnpqYHTA8jJNQYiRT9cg80x-VkcHj7gefvBcUN029e0vSw6KVgwBNIS1vhDmWf8TLP0-iAi5sG9DBdxxYPcsNsGCal7IJZBbJfHlb_HuCGJuz_l-BuqabeHw-m3vRXgEIU2kQ\" alt loading=\"lazy\"><figcaption><em>Fig 2: From left to right: the starting board state of Othello; after black places a disc at E3; after white then places a disc at D3.<\/em><\/figcaption><\/figure>\n<p>It\u2019s worth pointing out a key difference between our model and Reinforcement Learning models like AlphaGo: to AlphaGo, game scripts are the history used to predict the optimal best next move leading to a win, so the game rule and board structures are baked into it as much as possible; in contrast, game scripts is no different from sequences with a unique generation process to Othello-GPT and to what extent the generation process can be discovered by a large language model is exactly what we are interested in. Therefore, unlike AlphaGo, no knowledge of board structure or game rules is given. The model is rather trained to learn to make legal moves only from lists of moves like: E3, D3, C4\u2026 Each of the tiles is tokenized as a single word. The Othello-GPT is then trained to predict the next move given the preceding partial game to capture the distribution of games (sentences) in game datasets. <\/p>\n<p>We found that the trained Othello-GPT usually makes legal moves. The error rate is 0.01%; and for comparison, the untrained Othello-GPT has an error rate of 93.29%. This is much like the observation in our parable that the crow was announcing the next moves.<\/p>\n<h2 id=\"probes\">Probes<\/h2>\n<p>To test this hypothesis, we first introduce probing, an established technique in NLP [6] to test for internal representations of information inside neural networks. We will use this technique to identify world models in a synthetic language model if they exist.<\/p>\n<p>The heuristic is simple: for a classifier with constrained capacity, the more informative its input is for a certain target, the higher accuracy it can achieve when trained to predict the target. In this case, the simple classifiers are called probes, which take different activations in the model as input and are trained to predict certain properties of the input sentence, e.g., the part-of-speech tags and parse tree depth. It\u2019s believed that the higher accuracy these classifiers can get, the better the activations have learned about these real-world properties, i.e., the existence of these concepts in the model. <\/p>\n<p>One early work [7] probed sentence embeddings with 10 linguistic properties like tense, parsing tree depth, and top constituency. Later people found that syntax trees are embedded in the contextualized word embeddings of BERT models [8]. <\/p>\n<p>Back to the mystery on whether large language models are learning surface statistics or world models, there have been some tantalizing clues suggesting language models may build interpretable \u201cworld models\u201d with probing techniques. They suggest language models can develop world models for very simple concepts in their internal representations (layer-wise activations), such as color [9], direction [10], or track boolean states during synthetic tasks [11]. They found that the representations for different classes of these concepts are easier to separate compared to those from randomly-initialized models. By comparing probe accuracies from trained language models with the probe accuracies from randomly-initialized baseline, they conclude that the language models are at least picking up something about these properties.<\/p>\n<h2 id=\"probing-othello-gpt\">Probing Othello-GPT<\/h2>\n<p>As a first step of looking into it, we apply probes to our trained Othello-GPT. For each internal representation in the model, we have a ground truth board state that it corresponds to. We then train 64 independent two-layer MLP classifiers to classify each of the 64 tiles on Othello board into three states, black, blank, and white, by taking the internal representations from Othello-GPT as input. It turns out that the error rates of these probes are reduced from 26.2% on a randomly-initialized Othello-GPT to only 1.7% on a trained Othello-GPT. This suggests that there exists a world model in the internal representation of a trained Othello-GPT. Now, what is its shape? Do these concepts organize themselves in the high-dimensional space with a geometry similar to their corresponding tiles on an Othello board? <\/p>\n<p>Since the probe we trained for each tile essentially keeps its knowledge about the board with a prototype vector for that tile, we interpret it as the concept vector for that tile. For the 64 concept vectors at hand, we apply PCA to reduce the dimensionality to 3 to plot the 64 dots below, each corresponding to one tile on the Othello board. We connect two dots if the two tiles they correspond to are direct neighbors. If the connection is horizontal on board, we color it with an orange gradient palette, changing along with the vertical position of the two tiles. Similarly, we use a blue gradient palette for vertical connections. Dots for the upper left corner ([0, 0]) and lower right corner ([7, 7]) are labeled. <\/p>\n<p>By contrasting with the geometry of probes trained on a randomly-initialized GPT model (left), we can confirm that the training of Othello-GPT gives rise to an emergent geometry of \u201cdraped cloth on a ball\u201d (right), resembling the Othello board. <\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/_pxXU-FKHt6BwlU1w9p6ate2L-D1aIY7el0wTKlefKrA1tQK9U8twRbvvXHJBwib8FaKSGYFU6ntOQo87ReyMfTyoc01Ghnyzv9uV79ABWt1JjYMaNC1OMtV8UQ9NSe1wEwKBapGY4yUzM63QkXHg23JYFrw_nyUcuVIBRAKEjWFDjVm-qiSNSoETQPweA\" alt loading=\"lazy\"><figcaption><em>Fig 3: Left: probe geometry of a randomly-initialized Othello-GPT; right: probe geometry of a trained Othello-GPT.\u00a0<\/em><\/figcaption><\/figure>\n<p>Finding these probes is like discovering the board made of seeds on the crow&#8217;s windowsill. Their existence excites us but we are not yet sure if the crow is relying on them to announce the next moves. \u00a0<\/p>\n<h2 id=\"controlling-model-predictions-via-uncovered-world-models\">Controlling model predictions via uncovered world models<\/h2>\n<p>Remember the prank in the thought experiment? We devise a method to change the world representation of Othello-GPT by changing its intermediate activations as the neural network computes layer by layer, on the fly, in the hope that the next-step predictions of the model can be changed accordingly as if made from this new world representation. This addresses some potential criticisms that these world representations are not actually contributing to the final prediction of Othello-GPT. <\/p>\n<p>The following picture shows one such intervention case: on the bottom left is the world state in the model\u2019s mind before the intervention, and to its right is the post-intervention world state we chose and the consequent post-intervention made by the model. What we are thinking of doing is flipping E6 from black to white and hope the model will make different next-step predictions based on the changed world state. This change in the world state will cause a change in the set of legal next moves according to the rule of Othello. If the intervention is successful, the model will change its prediction accordingly.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/1wRxT4CU3oxvaTvtymZrHZQIKsLQH0YcX0joHPEsrFN5SZcHfcJbEKs4sXpVyA3Q39adz9oyagt9_qePwlPQtLDSYTCq6k3t3L1ueUFeVundyJT56951hhhRS_12NWrXGWpEomMaFF6rphqeQYmyTROeJ_GwTxs39EbqDUhiQZyOy2PMqUOoezQ7EY4-cA\" alt loading=\"lazy\"><figcaption><em>Fig 4: an example of the intervention experiment.<\/em><\/figcaption><\/figure>\n<p>We evaluate this by comparing the ground-truth post-intervention legal moves returned by the Othello engine and those returned by the model. It turns out that it achieves an average error of only 0.12 tiles. It shows that the world representations are more than probable from the internal activations of the language model, but are also directly used for prediction. This ties back to the prank in the parable where moving the seeds around can change how the crow thinks about the game and makes the next move prediction.<\/p>\n<h2 id=\"an-application-for-interpretability\">An application for interpretability<\/h2>\n<p>Let\u2019s take a step back and think about what such a reliable intervention technique brings to us. It allows us to ask the counterfactual question: what would the model predict if F6 were white, even no input sequence can ever lead to such a board state? It allows us to <em>imaginarily <\/em>go down the untaken path in the garden of forking paths.<\/p>\n<p>Among many other newly-opened possibilities, we introduce the Attribution via Intervention method to attribute a valid next-step move to each tile on the current board and create \u201clatent saliency maps\u201d by coloring each tile with the the attribution score. It\u2019s done by simply comparing the predicted probabilities between factual and counterfactual predictions (each counterfactual prediction is made by the model from the world state where one of the occupied tiles is flipped). <\/p>\n<p>For instance, how do we get the saliency value for square D4 in the upper-left plot below? We first run the model normally to get the next-step probability predicted for D6 (the square we attribute); then we run the model again but intervene a white D4 to a black D4 during the run, and save the probability for D6 again; by taking the difference between the two probability values, we know how the current state of D4 is contributing to the prediction of D6. And the same process holds for other occupied squares. <\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/jePQewJI4VD1A0TWitGlTsiyTH67A_pkMo2iaHL92HIrtfIHNA4zy7-ty8w9yxF8SLs8BDEoXX7BITT9zHDyE-rcwGchMFgYaezjgw4Ru53WPpiyFlTW5EhbA_JfRT4VflWG9qmCPNc51WMSUIe-IqYMkJZNl_4aOctyXHTQIuVkSasxr2csJJ59XKDjpA\" alt loading=\"lazy\"><figcaption><em>Fig 5: for each of the 8 plots, the text above is the next-move it is attributing (also enclosed). For other tiles on the board, the darker red, the more important it is for the attributed move. For example, in the upper left plot, D5 contributes the most to the prediction of D6.\u00a0<\/em><\/figcaption><\/figure>\n<p>The figure below shows 8 such \u201clatent saliency maps\u201d made from Othello-GPT. These maps show that the method precisely attributes the prediction to tiles that make the prediction legal\u2014the same-color at the other end of the straight-line \u201csandwich\u201d and the tiles in between that are occupied by the opponent discs. From these saliency maps, an Othello player can understand Othello-GPT\u2019s goal, to make legal moves; and a person who does not know Othello could perhaps induce the rule. Different from most existing interpretability methods, the heatmap created is not based on the input to the model but rather the model\u2019s latent space. Thus we call it a \u201clatent saliency map\u201d.<\/p>\n<h2 id=\"discussion-where-are-we\">Discussion: where are we?<br \/><\/h2>\n<p>Back to the question we have at the beginning: do language models learn world models or just surface statistics? Our experiment provides evidence supporting that these language models are developing world models and relying on the world model to generate sequences. Let\u2019s zoom back and see how we get there.<\/p>\n<p>Initially, in the set-up of Othello-GPT, we find that the trained Othello-GPT usually makes legal moves. I\u2019d like to visualize where we are as follow:<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/Cs3NcYS_Ra6UlTqpnKrcC2H3nyS0cBLRoyElwLr9KrG22cDjDZMmlvXud4rLFdCNu74_voBSV6L4kFKXEcPoN7vHe_37FJLyMYCPE_oHJJWnp8fN0y8sC9ztbaudZYY_fGgBV0osuXs2v62IFVyrb766uZe87sn_wYlMvkNBKpb9cTohvJ1owAKmdouG\" alt loading=\"lazy\"><\/figure>\n<p>, where two unrelated processes\u2014(1) a human-understandable World Model and (2) a black-box neural network\u2014reach highly consistent next-move predictions. This is not a totally surprising fact given we have witnessed so many abilities of large language models, but it\u2019s a solid question to ask about the interplay between the mid-stage products from the two processes: the human-understandable world representations and the incomprehensible high-dimensional space in an LLM. <\/p>\n<p>We first study the direction from internal activations to world representations. By training probes, we are able to predict world representations from the internal activations of Othello-GPT.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/o8p6Bl7hUc4x4ietMPf0szr7dCZN4DtdB-tub_fQwsu3vne6vwJV1H5dWzTv6u0iS7vnLGacXEQQD2LmRfLzqfO7htiQQmHfitUtbe8rfcLDZuX4PVuxydPwB8tK4hfRS13yn7Gcp6sWq4YFyxP_TDkAPqg-Nk1faNWsKmOqu6MMmL09JsjAlykyC6b8\" alt loading=\"lazy\"><\/figure>\n<p>How is the other way around? We devised the intervention technique to change the internal activation so that it can represent a different world representation given by us. And we found this works concordantly with the higher layers of the language model\u2014these layers can make next-move predictions solely based on the intervened internal activations without unwanted influence from the original input sequence. In this sense, we established <em>a bidirectional mapping<\/em> and opened the possibility of many applications, like the latent saliency map.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/aZz3yP01LvGLHSBLZqtndl0v5AdIrncuzjbHpiQkitSpfloZ-_Rq6o7rMiUOQ_JNZxqDLuetUwUhiIQFR2Nmuj7T_9TzLAUvpgZm_edsiu3Bg8ga0yN35pa6mcI60f6tviZBUoryTVa5KYiCoFa3is8YVuz6iOD5yXulWbjSXcE9k5KoGE59Wz4glK8e\" alt loading=\"lazy\"><\/figure>\n<p>Putting these two links into the first flow chart, we\u2019ve arrived at a deeply satisfying picture: two systems\u2014a powerful yet black-box neural network and a human-understandable world model\u2014not only predict consistently, but also share a unified mid-stage representation.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/zvQObJtkqyFey9TD2Ibzx5K9s5MaMI_YswkHgBWCNf6wBfKhboAElcRUAsPzXSbToBsXsK8C-fsfOV4Uku7sq5u2rNTdhqtpCEMJ2rj_OUXIsfL9QrCmmxfTYPoJ67vDV0aLHJ7cKEhBFIxQ41IEolu0KbruZnemSx_Orf5TabEdKsso1NXw_FB-rzLH\" alt loading=\"lazy\"><\/figure>\n<p>Still, many exciting open questions remain unanswered. In our work, the form of world representation (64 tiles, each with 3 possible states) and the game engine (game rule) are known. Can we reverse-engineer them rather than assuming knowing them? It\u2019s also worth noting that the world representation (board state) serves as a \u201csufficient statistic\u201d of the input sequence for next-move prediction. Whereas for real LLMs, we are at our best only know a small fraction of the world model behind. How to control LLMs in a minimally invasive (maintaining other world representations) yet effective way remains an important question for future research.<\/p>\n<h2 id=\"citation\">Citation<\/h2>\n<p>For attribution of this in academic contexts or books, please cite this work as:<\/p>\n<blockquote><p><em>Kenneth Li, &#8220;<\/em>Do Large Language Models learn world models or just surface statistics?<em>&#8220;, The Gradient, 2023.<\/em><\/p><\/blockquote>\n<p>BibTeX citation (this blog):<\/p>\n<div>\n<p>@article{li2023othello,<br \/>author = {Li, Kenneth},<br \/>title = {Do Large Language Models learn world models or just surface statistics?},<br \/>journal = {The Gradient},<br \/>year = {2023},<br \/>howpublished = {url{https:\/\/thegradient.pub\/othello}},<br \/>}<\/p>\n<\/div>\n<p>BibTeX citation (the paper that this blog is based on):<\/p>\n<div>\n<p>@article{li2022emergent, <br \/>author={Li, Kenneth and Hopkins, Aspen K and Bau, David and Vi{&#8216;e}gas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin}, <br \/>title={Emergent world representations: Exploring a sequence model trained on a synthetic task}, <br \/>journal={arXiv preprint arXiv:2210.13382}, <br \/>year = {2022}, <br \/>}<\/p>\n<\/div>\n<h2 id=\"references\">References<\/h2>\n<p>[1] E. M. Bender and A. Koller, \u201cClimbing towards NLU: On Meaning, Form, and Understanding in the Age of Data,\u201d in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, Jul. 2020, pp. 5185\u20135198. doi: 10.18653\/v1\/2020.acl-main.463.<\/p>\n<p>[2] W. Merrill, Y. Goldberg, R. Schwartz, and N. A. Smith, \u201cProvable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand?\u201d arXiv, Jun. 22, 2021. Accessed: Dec. 04, 2022. [Online]. Available: http:\/\/arxiv.org\/abs\/2104.10809<\/p>\n<p>[3] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, \u201cOn the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ????,\u201d in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, New York, NY, USA, Mar. 2021, pp. 610\u2013623. doi: 10.1145\/3442188.3445922.<\/p>\n<p>[4] L. Floridi and M. Chiriatti, \u201cGPT-3: Its Nature, Scope, Limits, and Consequences,\u201d Minds &#038; Machines, vol. 30, no. 4, pp. 681\u2013694, Dec. 2020, doi: 10.1007\/s11023-020-09548-1.<\/p>\n<p>[5] K. Li, A. K. Hopkins, D. Bau, F. Vi\u00e9gas, H. Pfister, and M. Wattenberg, \u201cEmergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task.\u201d arXiv, Oct. 25, 2022. doi: 10.48550\/arXiv.2210.13382.<\/p>\n<p>[6] Y. Belinkov, \u201cProbing Classifiers: Promises, Shortcomings, and Advances,\u201d arXiv:2102.12452 [cs], Sep. 2021, Accessed: Mar. 31, 2022. [Online]. Available: http:\/\/arxiv.org\/abs\/2102.12452<\/p>\n<p>[7] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni, \u201cWhat you can cram into a single $&#038;!#* vector: Probing sentence embeddings for linguistic properties,\u201d in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, Jul. 2018, pp. 2126\u20132136. doi: 10.18653\/v1\/P18-1198.<\/p>\n<p>[8] J. Hewitt and C. D. Manning, \u201cA Structural Probe for Finding Syntax in Word Representations,\u201d p. 10.<\/p>\n<p>[9] M. Abdou, A. Kulmizev, D. Hershcovich, S. Frank, E. Pavlick, and A. S\u00f8gaard, \u201cCan Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color.\u201d arXiv, Sep. 14, 2021. doi: 10.48550\/arXiv.2109.06129.<\/p>\n<p>[10] R. Patel and E. Pavlick, \u201cMAPPING LANGUAGE MODELS TO GROUNDED CON- CEPTUAL SPACES,\u201d p. 21, 2022.[10]\u2003B. Z. Li, M. Nye, and J. Andreas, \u201cImplicit Representations of Meaning in Neural Language Models,\u201d arXiv:2106.00737 [cs], Jun. 2021, Accessed: Dec. 09, 2021. [Online]. Available: http:\/\/arxiv.org\/abs\/2106.00737<\/p>\n<p>[11] B. Z. Li, M. Nye, and J. Andreas, \u201cImplicit Representations of Meaning in Neural Language Models,\u201d arXiv:2106.00737 [cs], Jun. 2021, Accessed: Dec. 09, 2021. [Online]. Available: http:\/\/arxiv.org\/abs\/2106.00737<\/p>\n<\/p><\/div>\n<p><a href=\"https:\/\/thegradient.pub\/othello\/\" class=\"button purchase\" rel=\"nofollow noopener\" target=\"_blank\">Read More<\/a><br \/>\n Georgianna Schildgen<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A mysteryLarge Language Models (LLM) are on fire, capturing public attention by their ability to provide seemingly impressive completions to user prompts (NYT coverage). They are a delicate combination of a radically simplistic algorithm with massive amounts of data and computing power. They are trained by playing a guess-the-next-word game with itself over and over<\/p>\n","protected":false},"author":1,"featured_media":600110,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4105,23997,46],"tags":[],"class_list":["post-600109","post","type-post","status-publish","format-standard","has-post-thumbnail","category-language","category-large","category-technology"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts\/600109","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/comments?post=600109"}],"version-history":[{"count":0,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts\/600109\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/media\/600110"}],"wp:attachment":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/media?parent=600109"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/categories?post=600109"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/tags?post=600109"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}