The spoken word
AI blurs the lines between the written and spoken. Anything written can be spoken and anything spoken can be written. In the following I want to outline what this means and particularly focus on the consequences of being able to convert audio to text at scale.
Thamus and Theuth
Socrates told Thamus that writing was one of the most important inventions of the wise man Theuth, “the magic key to memory and wisdom”. Thamus disagreed and said that writing would destroy memory and proper learning.
Automatic Speech Recognition (ASR) has existed for decades now, but the advent of AI brought immense improvements. Open AI’s launch of Chat GPT was widely remarked, but less attention was paid to the release of their speech recognition Whisper AI. It was trained on 680.000 hours of multilingual and multitask supervised data and is released under an open source license. The results are excellent and often exceed the abilities of previous models to accurately transcribe audio to text across diverse datasets.
Anyone with access to Whisper can now take a recording of audio, be it from a video, an audio or live recording and quickly turn it into text to print, read and modify at will. The implications of this are enormous. We’ve had stenographers for a long time, but we never had the ability to quickly convert the spoken word to written text so quickly, cost efficiently and at scale.
At least since the invention of the Guttenberg bible has the written word been paramount in transferring knowledge. Granted, for a while now we’ve been able to distribute audio and video as a means to dissemninating knowledge. However, printed text remains the way we produce, share and consume knowledge.
Use cases for transcription
Audio transcription offers numerous benefits across various domains and industries. One of the primary advantages is increased accessibility, as transcribing audio content makes it available to a wider audience, including individuals with hearing impairments. By providing a textual representation of the audio, transcripts allow people to consume the information without relying solely on their ability to hear. Moreover, having a transcript alongside the audio can significantly enhance comprehension, particularly when dealing with complex or technical content. This combination enables the audience to reinforce their understanding by visualizing the information while listening.
Transcripts also play a vital role in content indexing and searchability. They offer a searchable version of the audio, making it easier to index and find relevant information within the content. This is particularly valuable for managing large audio archives, podcasts, webinars, and other media where users need to quickly locate specific sections.
For creators and content producers, transcriptions are invaluable tools for reviewing and editing their audio content. They can identify errors, make necessary improvements, and maintain accuracy in their work. Transcripts also cater to language learners, offering a helpful resource for following along with the audio, improving pronunciation, vocabulary, and language comprehension. The question arises whether language learning will still be important in the future as new AI models don’t just transcribe, but also translate.
In professional settings, audio transcriptions have various practical applications. For legal purposes, transcriptions create official records of meetings, interviews, court proceedings, and depositions, aiding in legal documentation and compliance. Moreover, in market research and qualitative studies, transcriptions of interviews or focus groups facilitate data analysis. Researchers can easily code and categorize the content for further investigation, saving time and effort.
Transcriptions also support multilingual audiences, as translated versions of the transcripts allow content to transcend language barriers and reach global audiences. Overall, audio transcription serves as a valuable tool, offering a wide array of benefits that enhance accessibility, comprehension, information retrieval, and content distribution across numerous industries and applications.
The possibilities
We can take this further and consider all the conversations that can now be turned into text. Therapy and coaching sessions, lectures, chat messages and more. Sometimes it’s easier to talk and let the machines turn our voice commands into text like when multitasking while sending messages in chat apps. Other times the spoken word is all we have. Imagine a therapy session in which the therapist and patient talk for up to 1 hour. Often the details get lost and even if the patient wishes to revisit certain things they can only do this to the extent their memory permits. If we had an audio record of such conversations we can easily transcribe these into a text version. Not only can we re-read what the therapist said, but we can dig deeper, look up related literature, read summaries and review it with a third party.
I’m sure there are times when we’d rather just forget the conversations and reading a transcript is the last thing we want.
Of course Dave Chappelle already had a skit on this 20 years ago:
The newer language models like ChatGPT or Bard excel in processing and responding with text. So once audio has been converted, we can post process content further with AI. Which invalidates one of Socrates concerns.
“Writing is a crutch”
“I cannot help feeling, Phaedrus, that writing is unfortunately like painting; for the creations of the painter have the attitude of life, and yet if you ask them a question they preserve a solemn silence.” (source) We can run large volumes of texts like books or long PDFs through LLMs and interrogate the text through these interfaces. Text is more alive, moldable and accessible than ever before.
Thamus raised another concern about writing. “For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory.” (source)
Conversation is a key organisation tool for humans. By thinking together and sharing our wisdom we can unlock a collective force. It’s necessary for building relationships, for leadership, it can help solve problems, improve decision making, generate ideas, coach and mentor and improve understanding and sensemaking. Consider how opponents of remote work always bring up the lack of “hall way chats” that spark new ideas and help coordinate projects in the office. They bemoan the lack of informal conversation in remote work environments. According to these critics these cannot be replaced with chat or other online collaboration tools. On the other hand side writing is recommended as a safe method to organise thoughts and reflect on ideas.
Looking back 2000 years, it’s safe to say that the ability to write has helped form and share incredible ideas and thoughts that improved the state of mankind. This will only improve by our ability to transcribe the spoken word to text.
If you want to test converting the spoken to the written word try Transcription Tom.