Background

Accessibility is now an important aspect of digital learning. We need to take accessibility seriously both to satisfy the needs of an increasingly diverse student body and the requirements recently brought into law. Of course, digital learning often encompasses a wide variety of resources in a range of media. The challenge of bringing all these resources in line with regulations is considerable, both on a technical and organisational level. Fortunately technology can help to ease the burden, with a number of integrations available to help examine and fix content-related accessibility issues.

One particularly large challenge, and one that is particularly helped by the use of technology, is video. While it is possible to watch and transcribe a video manually, when faced with a library of nearly 8000 hours of video, the challenge becomes insurmountable! This is where technology can step in: it can automate the process and reduce the number of person-hours required.

For quite some time, YouTube has been able to automatically caption videos. In the past, however, the transcriptions produced by the algorithms have often been the subject of ridicule for the sometimes bizarre and hilarious interpretations. Thankfully things have moved on considerably, with increasingly advanced AI and machine learning helping to increase the reliability of computer transcription.

For the majority of our video content, we rely upon a home-spun system composed of a Wowza Streaming Media server and a custom-built front-end to manage content and metadata. While this system has the facility to allow subtitles to be added, it does not feature any way to automate the process of creating transcriptions. For this reason, we are currently investigating our options, with a view to either hosting our video content elsewhere or improving our current provision by implementing auto-transcription facilities.

The contenders

We have been investigating a few services to judge the accuracy of the transcription. We have tried each service with the same videos to see how accurately they can transcribe a variety of media content. Below are some details of three services we are currently examining.

Mozilla Deepspeech

An open-source option that can be run on-premises, Deepspeech requires a certain amount of technical skill in deploying and managing Linux servers. Being open-source and community driven, the more effort you put in, the better the output will be. It allows you to train your own neural network to increase the accuracy of transcriptions, so theoretically it would be possible to improve your transcription accuracy, although it may require a large investment of time and effort. As we are simply testing the out-of-box abilities, we have used the default models provided by the developers.

Google Speech to Text Engine

This is an API made available through the Google Cloud Platform. The service itself is used by YouTube to provide auto-transcriptions of uploaded videos. While using it through YouTube is free at the point of upload, utilising the API in your own projects can cause costs to rack up quickly (and remember that we have 8000 hours of video sitting on our servers, waiting to be transcribed). The pricing options are transparent, however, so we can easily calculate the cost of transcribing all of our existing content.

Amazon Transcribe

This cloud service is utilized by Amazon’s virtual assistant “Alexa” and works in a similar way to Google’s offering, with transcription charged based upon the number of seconds of content transcribed. The service is used by the content capture service Echo 360 to transcribe material. By our rough calculations, transcribing our 8000 hours of content through Amazon would be a little cheaper than through Google. 

The results

Here are some example transcriptions of one short piece of video content

Mozilla Deepspeech

so wee al seend apisode of the dragon tf dend where the ontroprenel holks in with a really great idea good looking numbers the dragons e recing out their hands and then one of the dragons pipes up let see your contract and os soddenly ontrepenelox exposed because they thought they had a contra they don’t what they have iser some verbal understanding your colercial contracts are really important to you business mey should be kept clear concise so the point to add value when seeking in bestment wor in ed if you come to sellin a business also commercial contracts areningportant to the void conslote because both sides of the contract should now wot their obligations are a more their rights are

Google Speech to Text (through youtube)

so we’ve all seen episodes of the Dragons Den where the entrepreneur walks in with a really great idea good-looking numbers the Dragons are eating out their hands and then one of the Dragons pipes up let’s see your contract and all the sudden the entrepreneur looks exposed because they thought they had a contract they don’t what they have is a some verbal understanding your commercial contracts are really important to your business they should be kept clear concise to the point to add value when seeking investment or indeed if you come to sell the business also commercial contracts are really important to avoid conflict because both sides of the contract should know what their obligations are and what their rights are

Amazon Transcribe

So we’ve all seen episodes of the Dragon’s Den, where the entrepreneur walks in with a really great idea, good looking numbers that dragons reaching out their hands. And then one of the dragons pipes up. Let’s see your contract over something. The entrepreneur let’s exposed because they thought they had a contract. They don’t. What they have is a some verbal understanding your commercial contracts of really important to your business. They should be kept clear, concise to the point. Add value when seeking investment, or indeed, if you come to sell the business. Also, commercial contracts are really important to avoid conflict because both sides of the contract should know what their obligations are, what their rights on.

Conclusion

As you can see from the output above, while the Mozilla software makes a good guess at a lot of the content, it also gets confused in other parts, inventing new words along the way and joining others together to form a rather useless text that does not represent what has been said at all well. I’m sure its abilities will improve as more time is spent by the community training the neural network. However, Google and Amazon clearly have the upper hand – which is not surprising, given their extensive user base and resources. 

While Amazon Transcribe makes a very good attempt, even adding punctuation where it predicts it should appear, it is not 100% accurate in this case. Some words are mis-interpreted and others are missing. However, in the main, the words that are confused are not essential to the understanding of the video.

Google Speech to Text makes the best attempt at transcribing the video, getting all words 100% correct, and even adding capital letters for proper nouns that it clearly recognises. There are options to insert punctuation when using the API, but this feature is not available in the YouTube conversion process.

From this (preliminary and admittedly small) test, it seems you get what you pay for: the most expensive service is the most accurate and the cheapest is the least accurate. Also, the headline cost of using Google Speech to Text on 8000 hours of video is not necessarily accurate. We need to remember that not all of this content is actively used: this is an accumulation of 8 years of content, and it’s possible that only a small fraction of it is still actually being watched. We now need to spend some time interrogating our video statistics to determine how much of the old content really needs to be transcribed. 

The best value compromise, if we choose to continue to host video ourselves, may be to transcribe all future videos and any that have been watched at some point in the last year. In addition, it should be possible to provide an ‘on-demand’ service, whereby videos are flagged by users as requiring a transcription at the click of a button. Once flagged, the video is queued for transcription and a few minutes later a transcription is made available and the user alerted.

Video title: Warner Goodman Commercial Contracts.
Copyright: Lynda Povey ( Enterprise Adviser) nest, The University of Portsmouth.

Image Credit: Photo by Jason Rosewell on Unsplash