In the early nineties when millions of computers were connected to form the internet, it ushered in a digital revolution promising the democratization of access to information and leading to metaphors such as the `information superhighway´ as synonyms for the newly connected interwebs. I was a young graduate student at the time and one of the exciting things about the internet for me was not in fact the analogy to driving, or rapid information transfer, but the experience of it as a messy, unregulated anarchic door-opening, almost like the multiverse or a piece of untidy knotted string, which you could follow in your own idiosyncratic way. I know, very GenX of me.
Even though we did not know exactly how it would play out long term, I think we all sensed at the time that our way of storing, accessing and searching for information had changed irrevocably from that moment on. I think we are facing a similar informational tipping point again now, but not perhaps in the way that many folks are imagining. Nobody wants to go back to index cards in physical libraries, but the question is whether our methods for searching for information are now going to make another quantum leap of improvement, made faster and more efficient via posing questions to an interactive Chatbot. This is precisely what Big Tech is telling us that we need, and they are currently fighting each other to be first to roll out the next generation of search-technology chatbots (Bing vs. Bard). These applications are fed by massive natural language models (from OpenAI) which because of the trillions of words they are trained on, can generate plausible grammatical responses to our queries and potentially summarize information from that huge pool of networked text. When it comes to pure search functionality, though, there are good reasons to believe that the ways in people actually search and the way information-search interacts dynamically and dialectically with other cognitive goals such as learning and exploration will not all be equally well served by the `ask a bot’ model of communication interaction. (See this article by Emily Bender and Chirag Shah for a discussion of the issue https://dl.acm.org/doi/pdf/10.1145/3498366.3505816 . )
But I darkly suspect that helping people to search for information is not just a `selfless´ goal in the name of progress that these tech companies are pursuing. In other words, Bing and Bard are not the only uses that OpenAI is going to be put to. The developers of the natural language models that make ChatGPT possible will sell that technology to others, and it will be modifiable beyond the `constrained´ pretty guardrailed version that underpins. ChatGPT itself.
There is no doubt that ChatGPT, the interactive content generator, has taken the world by storm since its launch in November 2022 and its ability to produce plausible and seemingly helpful text has been massively impressive. There’s been hype; there’s been backlash. There have been hard questions asked and performance flaws, leading to fixes and improved guardrails. In this blog post I will end up summarizing some of the major worries that have been aired and then go on to emphasize what I take to be the most serious threat that the technology poses if it is not regulated now. Some of these worries have appeared already on social media and published sources, which I will try to indicate as I proceed. But the bottom line is going to be a version of my very own dystopian worry, and involves the experiment of thinking consequentially about what will happen to information itself, when more and more content-carrying text is handed over to artificial intelligence and dissociated from actual minds. Call it the Semanticists Take.
The Robots Are Coming Worry
So maybe you think I am going to go for the-chatbots-will-become-sentient-and-try-to-destroy-us worry (think Hal, or the latest behaviour of Bing). Or the more gentle sci-fi version where we potentially embrace new forms of sentience and come to understand and welcome them in our shared cognitive future. But both these scenarios are just a form of the HYPE. No! These chatbots understand nothing. They scrape content produced by actual minds, and regurgitate it in statistically acceptable sounding forms. They have no notion of truth or `aboutness´, let alone emotion. The fact that they seem to, is due to echoes from all the human texts they have consumed, and testimony to our own human response mechanisms which impute content and feeling, and make the assumption of `another mind’ when faced with language produced for us.
The March of Capitalism Worry
There is an actual real worry here, namely that real Bad Actors (humans working for capitalist organizations who are trying to earn money for their shareholders) will actually use it to continue taking over the world in the form of controlling and curating creative content (text, images, tunes), and relegating actual humans to poorly paid monitors and editors with no job security or health insurance, but that is more a continuation of our present capitalist dystopian political reality than science fiction woo hoo.
The Bias and Toxic Content Worry
Here’s another concern that has rightly made the rounds. Because it is cannibalized from human content, chatbot content will repeat and recycle content from all the racist, misogynistic, toxic and other questionable biases of the humans who created it. Huge amounts of resources will have to be spent regulating these AI technologies if they are to come equipped with `guardrails´. Even scarier is the thought that many of the purchasers of this technology will not equip their use of it with guardrails. The fact is that this technology has not been created by public funds or non profit universities or even governments who are answerable in principle to an electorate. No, these applications have been created by and are owned by private companies whose only aims are to make money with it.
Here is Timnit Gebru on why big tech cannot be trusted to regulated itself.
Also Gary Marcus recommending the pause button.
Is It time to Hit the Pause Button on AI
ChatGPT Inherently Does Not Know What Information Is
As a semanticist, I regularly have to think about meaning and what it means for something to have meaning. Formal semanticists ground their theories of meaning in some kind of model of reality: to give a theory of meaning in language you cannot simply redescribe it in terms of language itself; there needs to be a reckoning, a final `reality´ check in terms of, well, Reality (or at least the thing(s) we humans apprise as Reality). Actually, the way I like to think about it is more along the lines of Mark Dingemanse’s statement that language is the best mind to mind interface that we have.. The important next step is to realise that language does that by anchoring itself in consequences for the shared reality that the two human minds are embedded in. There is an aboutness to language that is a crucial design feature, and theory of mind is one of the cognitive abilities that humans need to decode it. You need to know/assume/understand/trust that there is another mind there apprising the same common reality as you are, and labeling it in similar ways.
Take ChatGPT now. ChatGPT has no Theory of Mind (Gary Marcus again on testing ChatGPT), and it has no notion of any kind of reality or `aboutness´ to what it is generating. This means that it does not actually understand anything. It has no connection to truth. All it is doing is scraping content in the form of text and generating plausible natural language sentences from its training material. It repeats and recycles but does not genuinely infer (bad at math and reasoning). It also cannot distinguish a fact from a non fact as a matter of principle. It produces false citations and false data unnecessarily and gratuitously, although it most often repeats correct things if that’s where it is getting its most statistically likely sentence.
Emily Bender, also a linguist, has been a tireless campaigner against the breathless hype over large language models, even before the launch of ChatGPT in November. Read her viral article about Stochastic Parrots here
Ok, so one could imagine building an interactive search engine that was instructed only to summarize, and where in addition, all the sources were vetted and verified information. However, the technology as we see it now seems also to hallucinate content even when it could not possibly have grabbed it from somewhere unreliable. It is unclear to me why the technology does this, or whether it can be fixed. Is it to do with a built in feature that tells it to not repeat verbatim because of plagiarism risk, or is it due to the kinds of information compression and decompression algorithms that are being used? Hallucinated content from Chatbots means that even if you tell the search engine to only search a particular list of reputable sources, it could still give you erroneous information.
It is apparent to me and every serious scientist that we would never use ChatGPT as our search engine for anything we need to find out in our academic field. It is moreover not clear to me at any rate, that I need my search interface to be in this impressively plausible linguistic form at all. I do not necessarily think, in other words that universities and libraries should be racing to use, modify, or invent their own versions of Bing or Bard to search scientifically curated content. We know that developing a natural language model on this scale is extremely expensive. The reality is more likely to be that once it has been developed once by Microsoft, they will then sell it to everyone else and we will feel that we need it so much that we will rush to buy it.
Who is going to buy the technology? And what are they going to use it for in the future? It is already being used by some companies to generate content for articles in online magazines (leading famously to retractions, when the content was not sufficiently overseen by a human), and by all kinds of folks to write summaries for meetings and presentations etc. It will also no doubt be used to produced advertising texts and disinformation texts which will run rampant over the internet. We already have a problem with disinformation and unverifiability on the internet and these problems will increase exponentially since the present technology is much more believable and also, crucially, automatizable. Not only will the content so produced not be verified, it will also be increasingly non-verifiable. Since these very helpful chatbots will be the ones you turn to to find out whether the sources check out. As we have seen, ChatGPT regularly authoritatively spits out totally made up citations.
One can imagine fondly that some other tech bros will invent software that will detect whether something has been written by AI or not, but it will be a moving target, with so many different versions out there, and next generation versions that can cleverly outwit the automated checkers in a spiralling arms race of ever-increasing subtlety. That way lies madness.
As more and more people use this technology to generate content, whether with the best of intentions or the worst of intentions (and we would be naïve to assume that Microsoft are not going to sell their new toy to anyone who is willing to pay for it), I predict that in the next few years the information highway is going to be more and more littered with content that has been created by artificial intelligence (think Plastic as a tempting environmental analogy).
The problem is that this is simply not information any more.
It is faux-information.
It is content which bears some sort of causal relationship to information, but where the relationship is indirect and untrustworthy.
What is going to happen when the information superhighway is contaminated with about 5 percent of faux-information? What about when it is 10 percent? 50 percent? What is going to happen when half of the content that ChatGPT is scraping its `information’ from is itself AI generated scraped content? Will the hallucinations start to snowball?
Here’s my prediction. We will lose the small window we have at the moment for governments to regulate, and in five years time (maybe more maybe less) the internet superhighway will be more like something out of a Mad Max movie than a place where you can find information about how to fix your fridge yourself.
AI will have consumed itself and destroyed the very notion of information.
(Well, at least the idea that you can find information on the internet.)
So the problem is NOT: how can we get this great new thing for ourselves and adapt it so that it does the good stuff and not the bad stuff? The problem is what happens when this thing is let out of the box. In five or ten years time, how will we be able to distinguish the content from the faux-content from the faux-faux-content, using search applications that also have no idea.
For those of us who watched the dumpster fire that consumed Twitter a couple of months ago, this is going to be similar and for similar reasons due to wilful lack of regulation, but now exacerbated by automated plagiarism generators. Bigger. Maybe slower to unfold. And we are sleepwalking into it.
There is hope for the preservation and advance of human knowledge at least if publicly funded universities and research institutions band together now to safeguard the content that it now houses (physically and digitally) in the form of libraries. There are two aspects to this: (i) we need to keep making principled decisions about what we allow to be searchable and (ii) we need to create our own versions of search engines for searching that content. We should not make the mistake of trying to use OpenAI technology to do this, because plausible linguistic interaction or essay writing ability is not what we need here. We just need slightly better functionality than current indexing systems, otherwise we will lose out to the bots. No need for plausible human interactive language, but a much simpler ability wherein the search interface simply repeats core findings verbatim and shows us the actual citation. Creating this kind of search engine (owned publicly and not by Google or Microsoft) would be way less resource-intensive than employing large language models. And arguably more scientifically useful.
We need to build good search engines that are NOT Artificial Intelligence machines, but computer data sifters and organizers designed to aid intelligent human agents. These search applications need to be publicly owned and maintained and open access.
The only people who are going to have the will or motivation to do this are the public universities, and we may need to work together. Everyone else is compromised (or drinking the KoolAid), including many of the worlds’ governments.
Now I know you are all probably thinking I am paranoid overreacting GenXer who is just yearning for a return to the internet of nineties. Like every other middle aged person before me, I am being negative about change and the past was always golden. ChatGPT is great! We can use it for good.
I really really really hope you guys are right.