Two weeks ago, OpenAI released their new GPT version, “o1.” It is different from previous versions in that it works with an “inference” phase. In this phase, the model analyzes the input and generates potential responses, which enables it to produce more accurate and contextually appropriate answers. While users are presented with a live insight into the “thinking” process, the precise mechanisms are intentionally veiled, as seen from the often-obscure elements in that process. Raphael Wimmer has drawn a parallel to the “parallel construction” process in legal contexts and I think that is actually a good and justified analogy. It certainly doesn’t help the interpretability of OpenAI’s model.
But anyway, the question of course arises whether this truly is a first step on the second stage (of five stages) of the company’s plan to achieve AGI. Some early examples shared by users indeed seem to suggest that we are witnessing a new phase of LLM development. For example, the data scientist Kyle Kabasares showed in a viral clip how the o1-preview and o1-mini models managed to produce code that solved a problem that he had been working on for 10 months as a PhD student. This seems impressive indeed. And it raises the question of whether GPT-o1 might achieve similar results when confronted with tasks of other academic disciplines – perhaps in the humanities and social sciences.
Personally – working with both ancient and modern narratives – I wanted to know how well it could handle text of all kinds. I won’t get into the details of all the tasks it did not perform well on here. Suffice it to say that even GPT-o1 is not yet capable of reliably producing a sonnet with a German rhyme scheme, my own personal benchmark for testing LLMs. It does better than GPT-4o in that the inference phase almost always makes it recognize that it cannot simply follow the most common scheme in English, made known through poets like Shakespeare. But it still falls short of implementing a composition process that leads to a good result.
Also, my clear impression is that for many tasks surrounding creative writing, Anthropic’s Claude-Sonnet-3.5-Opus is still clearly superior – which is actually quite peculiar. Shouldn’t more planning result in better texts? The only reasons why it would not do are: 1. Creative writing cannot be improved through planning, 2. The inferential processes used are not helpful for these kinds of tasks, 3. There is a deeper problem with the LLM when it comes to imitating deep structures of language that becomes visible through this failure. But this too is a topic for another occasion.
Ethan Mollick has rightly stressed on different occasions that what is crucial in uncovering the potential of GPT-o1 is finding the right task for it, a task where it can excel. And that is not that easy. Because you first need to have a PhD-level problem lying around in the first place! And then it needs to be a problem of the kind that GPT-o1 can actually solve. And at this point it is hard to predict (for me at least) where it will do great and where it will fail. I’ve now tested both publicly available options of GPT-o1 for various tasks of my own research over the last two weeks (with annoying pauses in between because I had reached my limit of messages). There were some cases where I thought the model would do well but the results were mixed. One such application of a “reasoning” LLM would be helping researchers develop a plan for their research project that follows Bayesian reasoning – which would be immensely helpful for so many disciplines. I will probably share my impressions on that in a separate post. For now I want to concentrate on the one task that I gave GPT-o1 where it really blew me away, far exceeding my expectations.
Playing around with GPT-o1 I was reminded of a point that Gašper Beguš had made, namely that metalinguistic analyses are great benchmarks for LLMs. As he and colleagues showed, it was only with GPT-4 that we got promising results in that direction. Personally, I have tasked various models with a specific analytical method that I have often employed in my own research. Its goal is to visualize the deep semantic structure of texts, showing how their propositions connect and form hierarchical structures of meaning.
Here is a small example from my new book on narratives in the letters of Paul, an analysis of the propositional structure of a single verse, Galatians 2:12, which reads in English translation (NIV) as follows:
“For before certain men came from James, he used to eat with the Gentiles. But when they arrived, he began to draw back and separate himself from the Gentiles because he was afraid of those who belonged to the circumcision group.”
Even if read in isolation, this short passage constitutes a miniature narrative – a text that consists of several events that are connected in temporal and logical ways. These sense-relations (“connections”) among propositions are often marked by so-called connectors (mostly conjunctions, adverbs, and prepositions), such as “for” (connecting the whole nexus of propositions to what preceded), “before,” “when,” and “because.” These connectors combine two propositional structures and assign specific semantic roles to them that can be described on an abstract level, such as “earlier event” (chronological) or “reason” (logical). These pairs of propositions can then enter themselves into new connections with other propositional structures – so that a hierarchical semantic structure emerges, as seen in the following figure (from p. 78 of my new book Paul the Storyteller; semantic labels in capital letters imply greater communicative weight, i.e., that they are focal):
Personally, I find these kinds of visualizations incredibly helpful, especially when it comes to analyzing (small) narratives. For example, you can see at first sight that this miniature story consists of two larger blocks of events that stand in contrast with each other. For my field of biblical studies, I am of the opinion that this is one of the most important tools that scholars should learn to use. Let me just give two examples where it could clearly improve the present stage of research:
First, the secondary literature abounds with claims about the supposed “structure” of biblical texts. But the criteria used to identify these structural elements are often misleading or simply non-existent.
Second, especially in the genre of the biblical commentary we often encounter a verse-by-verse analysis of the text, which does not do justice to how the text as a whole functions. Many interpretations can be shown to be untenable if scrutinized through this lens, i.e., the presuppose propositional structures than cannot possibly be derived from the text in question.
Now, there are two reasons why I think this analysis of propositional macro-structures is an excellent benchmark for testing the metalinguistic capabilities of LLMs.
First, it is not a widely used method. It goes back to linguistic research done with Bible translation as the ultimate goal in mind, being called “semantic structural analysis.” (If you want to translate a book into basically all human languages, it is helpful if you can break down the meaning of these texts in a way that is relevant for all those target languages.) It was adopted and refined by Heinrich von Siebenthal, who taught me Greek, Hebrew, and Aramaic and who also incorporated lots of research into connectors, with this method now constituting the backbone for his “text grammar” (i.e., his grammar goes beyond the sentence level). My PhD thesis, published in 2020 (open access here), is, to my knowledge, the first work that systematically applies these categories to texts. This means that when LLMs try to carry out this method there is virtually no possibility of “contamination,” i.e. the LLM simply reproducing previous analyses that it has encountered as part of the training process. This is as close as it gets to a zero-shot metalinguistic task that is still relevant, i.e., not totally obscure or nonsensical!
Second, the task is not just potentially very significant for my field, and other fields working with texts, but it is also a quite complex task. It involves identifying textual structures that have propositional values, connecting the right propositions in the right way, and then even building up the hierarchical tree of propositions. I have taught this method to students and it takes them a whole semester to produce acceptable results.
So far, every LLM that I had tested failed miserably at carrying out such an analysis. Even Claude-3.5-Opus produces only results that are not really usable. Compare that to the following zero-shot attempt by GPT-o1!
Do you see the resemblance with my own visualization of the propositional structure above?
Again, this is the result after a single prompt, as you can see here! Initially, I had not expected GPT-o1 to be capable of implementing the method with so little instruction, which is why in my long experimental chat with it – which you can access here in its entirety – I began much slower, carefully nudging it to the kind of analysis that I had in mind.
I continue to be impressed with the ease with which it incorporated my requirements, which became more and more specific. As you can see, I rarely even had to press the regenerate button. The following representation only has minor issues and I’d be satisfied with any PhD student who came up with something along these lines.
Note that during our chat GPT-o1 did have an interesting problem with φοβούμενος τοὺς ἐκ περιτομῆς, “because he feared those from circumcision.” Sometimes, the proposition popped up not on the last line but above the actions, even though I wanted the language model to stick to the textual order, not chronology of events. When it was represented at the right place, it seemed to shy away from labelling at something like “later situation.” And this reluctance makes sense! For it is not at all a matter of course that the event of Peter entering a state of fear occurred between the arrival of certain people from James (P1 and P3) and his subsequent actions (P4 and P5). This depends, among other things, on the question – highly debated in scholarship – whether the “circumcision party” is identical with this group or not.
It is observations like this that I find so crucial. Because they best illustrate the role that GPT-o1 can currently play for such analyses. Exegesis – the interpretation of biblical texts – requires much more than just determining the propositional skeleton of these passages, even though it is an important fundamental step. And simply having such a depiction at one’s deposal, without the researcher understanding it and without them being able to critically compare it to the secondary literature, won’t really lead to anything. But if a user is competent and capable of producing such analyses themselves, they might become aware of potential issues in the text that deserve closer scrutiny.
Another instance like that in this analysis is the decision of the LLM to treat P4 and P5 as subsequent (overlapping?) actions. This might indeed be a case of contamination and not genuinely new realization, i.e., the LLM might simply have worked here with a common translation of the imperfect ὑπέστελλεν as signaling the commencement of a behavior. Still, seeing it laid out so explicitly before one’s eyes can be really helpful for researchers in reflecting upon what presuppositions about the text we take for granted.
I realized that too when I fed the chatbot the next example, Rom 7:9[-10a], another miniature story:
“Once I was alive apart from the law; but when the commandment came, sin sprang to life and I died” (NIV).
Note that we again have temporal (“when”) and logical (“but”) connectors. I was intrigued by the fact that on the lowest semantic level P3 (the revival of sin) and P4 (the death of the “I”) were analyzed as standing in a “cause-effect”-relationship, with the connector δέ (which I take to be adversative) being translated simply as “and.” It is by first having done your own analysis and then observing potential discrepancies that one might be directed toward relevant questions, such as where exactly the assumption of a causal relationship comes from (from the preceding verse 8!). Thereby, the collaborative work on the passage can sensitize the researcher to ask relevant questions about how the text functions in its broader context.
I then wanted to try a longer text, the famous “Christ hymn” in Philippians 2, verses 6-11, one of the most central Christological texts of the New Testament and a more elaborate Christ story. You can access different translations here – I hesitate to reproduce any specific one because the precise configuration of the very semantic deep structure that we are looking at here influences how one ends up translating many elements of this passage! Just when I wanted to hit “send” and was excited to see what GPT-o1-preview would do with these more than 70 words representing (in my estimation) 11 propositions – I hit my weekly limited and had to wait. The GPT-o1-mini model failed completely at replicating the analysis with this longer text. Here is the ultimate result of what I got after I got access to the better model again (showing here only the semantic labels):
This is actually quite close to my own representation, produced with the software HermeneutiX as part of my doctoral research, which you can take a look at here. There are things that I would quibble with. For example, the concessive relationship between P1 and P2 was overlooked. But, again, it was during the interaction with the chatbot that an informed and competent user would have had the chance to be alerted of certain key issues for interpretation, such as the question of whether P6 belongs to the action of P3 (coordinated with P4 and P5), as GPT had decided originally, or as background information for P7-P8 (as I think is correct and as it is depicted above). In other words, the competent researcher will be prompted to think about the key issues determining the structure of the text. Also, I think there is a good chance that such a researcher would be alerted by a label such as “ultimate” purpose in P11 to reflect upon the plot of the whole (predictive) story, i.e., whether the event of v. 10 and v. 11 constitute a sequence. Note that I had not given the chatbot a system of categories to choose from when assigning semantic labels. And at least some of the heuristic value of the conversation comes precisely from this, because it shows where the LLM struggles to find good designations or comes up with creative solutions that come as a surprise.
Something like that also happened to me when I gave it the text of Galatians 4:3-7, a rather complex story that Paul tells the Galatians about their own conversion, helping them to re-interpret the events in a very specific way – so that this new story can guide them in their present situation where some people exert pressure on them to become circumcised. Here is the (slightly tidied up) visualization of the deep semantic structure of this passage according to GPT-o1 – I really like “Conclusion” for P12 and “Implication” for P13-14 (these are not the standard labels for that relationship).
And again, we witness the talk about an “ultimate” purpose for P8. That’s an interesting interpretive choice because a lot of disagreement in the secondary literature on this passage actually boils down to the question of whether P8 is indeed a purpose that is implemented by employing P7 as a means or whether, alternatively, P7 and P8 are actually two independent purposes associated with P4.
Another aspect that struck me in this chat is how such conversations can help a researcher focus on key points where the direction of interpretation is defined. The language model did a great job in correctly identifying larger blocks of propositions whose relationship is relatively easy to determine (due to syntactical subordination, in particular). However, it hesitated (i.e., did not manage) to, in turn, determine the connection among these larger blocks:
· [P1 - P2]
· [P3 - [P4 - [P5-P8]]]
· [P9 - [P10 - P11]]
· [P12 - [P13 - P14]]
And it is indeed often (and also here) the case that larger interpretive disagreements do not stem from different opinions about the meaning of a single word – which is why such disputes cannot fruitfully be addressed by studies neglecting the larger semantic structure of the text! – but from different agreements concerning how these larger building blocks of the text interrelate. In this case, for example, one can ask what it is exactly that constitutes the basis for Paul’s conclusion in [P12 - [P13 - P14]]. Does he draw the inference from the new situation, the situation “transformation realized,” as the above depiction suggests? Or is it perhaps the content of the cry alone (P11) that allows for such a deduction (in which case [P12 - [P13 - P14]] would receive a place more to the right, deeper within the hierarchical structure)? Or is it perhaps the story as a whole that, in Paul’s view, allows for such a conclusion (P1-P11)? It is from first seeing the larger nexuses of propositions as building blocks that researchers can realize that they need to consider these fundamental structural questions – that’s not nothing! And I even thought the LLM’s own decision on this matter, when nudged to reconsider it, was quite interesting – and it was capable of revising the visual representation of the propositional structure on the basis of the change in interpretation:
Again, all these analyses and visualizations are far from perfect. And without the help of a competent user these results would be worse. They would also remain quite useless, since they do not yet constitute a full-fledged analysis of the “meaning” of these stories, a proper interpretation. (Though I want to test in the future to what extent GPT-o1 is capable of relating such analyses to existing interpretations in the secondary literature.)
Still, I find GPT-o1’s performance mind-blowing, to be honest. I still encounter colleagues who tell me that LLMs are basically inconsequential for our field because they “don’t know Greek.” And here you see a large language model performing a highly complex linguistic analysis of great relevance for one of the most basic tasks of our discipline, interpretation – on a level of accuracy and sophistication that most doctoral students, even those who were taught the method well, would struggle to compete with. That is, in my opinion, quite remarkable.
Comments