While artificial intelligence excels at tasks like coding and podcast generation, it struggles to accurately answer high-level historical questions, according to a study.
The researchers tested OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini using a newly developed benchmark called Hist-LLM.
The standard relies on the Seshat Global History Database, a comprehensive database of historical knowledge.
The study, which was presented at the NeurIPS AI conference last month, found disappointing results, according to TechCrunch.
GPT-4 Turbo performed the best, but achieved only about 46% accuracy—barely above random guesswork.
“LLMs, while impressive, still lack the depth required for advanced history,” said Maria del Rio-Chanona, a co-author of the paper and associate professor at University College London.
“They are great at basic facts, but fail at PhD-level historical research.”
The researchers found that LLMs often extrapolate from salient historical data but struggle with more obscure details.
For example, GPT-4 incorrectly stated that scale armor was present in ancient Egypt during a specific time period, when in reality, the technology only appeared 1,500 years later.
Similarly, the model falsely claimed ancient Egypt had a professional standing army during a certain period, likely due to the spread of information on standing armies in other ancient empires, such as Persia.
“If you’re told A and B 100 times, and C only once, you’re more likely to remember A and B,” del Rio-Chanona explained.
Another concern was potential bias.
OpenAI’s GPT-4 and Meta’s Llama models performed worse when answering questions about regions such as Sub-Saharan Africa, indicating the limitations of the training data.
“These biases suggest that LLMs reflect gaps in the historical record rather than an unbiased representation of history,” said Peter Turchin, the study’s lead researcher.
Despite these limitations, researchers remain hopeful that AI can help historians in the future.
They plan to improve the Hist-LLM standard by including more diverse data sources and increasing the complexity of the questions.
“Our findings highlight areas where LLMs need improvement, but they also demonstrate their potential to support historical research,” the paper concluded.
As AI continues to evolve, experts say it’s clear that human historians remain irreplaceable in interpreting complex historical narratives and ensuring accuracy in academic research.
#chatbots #accurately #answer #highlevel #history #questions #study
Image Source : nypost.com