A recent preprint study by two researchers at the University of California San Diego tested four large language models (LLMs), with OpenAI’s GPT-4.5 deemed indistinguishable from a human over 70% of the time.
The validity of the Turing test as the ultimate indicator of machine intelligence is questioned due to its contentious history and effectiveness in measuring machine intelligence.
The study by Cameron Jones and Benjamin Bergen involved tests on 4 LLMs: ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5.
Participants interacted with two witnesses simultaneously (a human and an LLM) in eight rounds of conversation to determine which one was human and which was an AI chatbot.
GPT-4.5 was judged to be human 73% of the time, while LLaMa-3.1-405B was considered human 56% of the time.
The Turing test, introduced by Alan Turing in a 1948 paper, aims to determine if a machine can exhibit intelligent behavior equivalent to a human through an imitation game.
The Turing test is challenged due to objections related to behavior vs. thinking, the concept that brains are not comparable to machines, the internal operations of computers, and the limited scope of testing behavior.
The Turing test is seen as a measure of substitutability rather than a true indicator of human intelligence, emphasizing the imitation of human intelligence.
While GPT-4.5 passed the Turing test in the study, it is important to note that passing the test does not necessarily equate to being as intelligent as humans.
The study's conditions, such as the short testing window and the impact of personas adopted by the LLMs, raise questions about the test's accuracy in assessing intelligence.
In conclusion, while GPT-4.5 may mimic human intelligence to some extent, it is not considered as intelligent as humans based on the limitations and nuances of the study.