How to use Google Speech to Text API to transcribe long audio files?

How to use Google Speech to Text API to transcribe long audio files?Sundar KrishnanBlockedUnblockFollowFollowingFeb 27Credit: PixabaySpeech recognition is a fun task.

A lot of API resources are available in market today which makes it easier for user to opt for one or another.

However, when it comes to audio files especially call center data, the task becomes little challenging.

Let’s make an assumption that a call center conversation takes roughly 10 minutes.

For this scenario, only a few API resources available in market can handle this type of data (Google, Amazon, IBM, Microsoft, Nuance, Rev.

ai, Open source Wavenet, Open source CMU Sphinx).

In this article, we will talk about Google speech to text API in detail.

Google Speech to text APIGoogle Speech to text has three types of API requests based on audio content.

Credit: GCPSynchronous RequestThe audio file content should be approximately 1 minute to make a synchronous request.

In this type of request, the user does not have to upload the data to Google cloud.

This provides the flexibility to users to store the audio file in their local computer or server and reference the API to get the text.

Asynchronous RequestThe audio file content should be approximately 480 minutes(8 hours).

In this type of request, the user have to upload their data to Google cloud.

This is exactly what we will cover in this article.

Streaming RequestIt is suitable for streaming data where the user is talking to microphone directly and needs to get it transcribed.

This type of request is apt for chatbots.

Again, the streaming data should be approximately a minute for this type of request.

Initial SetupBefore you begin, you need to do some initial setup.

Please follow the link below to complete the setup.

Quickstart: Using client libraries | Cloud Speech-to-Text API | Google Cloudsetvar launch_type %}api{% endsetvar %} {% setvar launch_name %}Cloud Speech-to-Text API{% endsetvar %} {% setvar…cloud.

google.

comI also wrote an article which explains the step 1 in detail.

Create your own Voice based application using PythonVoice based devices/applications are growing a lot.

Today, there are Google Assistant, Alexa which takes our voice as…medium.

comOnce you create the API client, the next step is to create a storage bucket.

You can use the link below to create a storage bucket.

For this project, I named the bucket as ‘callsaudiofiles’.

Google Cloud PlatformGoogle Cloud Platform lets you build, deploy, and scale applications, websites, and services on the same infrastructure…console.

cloud.

google.

comLet’s convert some speech to textStep 1: Import necessary packagesHere the ‘filepath’ variable contains the location of the audio files in your local computer.

So you can store multiple audio files in the path and it will still work.

The ‘output_filepath’ is where all the transcripts created by Google cloud will be stored later in your local computer.

In addition, provide the bucket name created in the step before in the ‘bucketname’ variable.

You need not upload your file to Google storage.

We will discuss about how to upload to Google storage in the later section.

Step 2: Audio file encodingGoogle Speech to text handles some specific types of audio encodings.

You can read in detail in the link below.

Introduction to audio encoding | Cloud Speech-to-Text API | Google CloudGoogle Cloud delivers secure, open, intelligent, and transformative tools to help enterprises modernize for today's…cloud.

google.

comCredit: GCPThis limits us to convert audio files before using Google Speech to text API if they are in a different format.

I provided a sample code for converting mp3 files to wav files below.

Step 3: Audio file specsOne other limitation is that the API does not support stereo audio files.

So the user needs to convert a stereo file to mono file before using the API.

In addition, the user has to provide the audio frame rate for the file.

The code below helps you figure it out for any ‘.

wav’ audio file.

Step 4: Upload files to Google storageAs we discussed before, in order to perform asynchronous request the file should be uploaded to google cloud.

The code below will accomplish the same.

Step 5: Delete files in Google storageOnce the speech to text operation is completed, the file can be deleted from Google cloud.

The code below can be used to delete files from Google cloud.

Step 6: TranscribeFinally, the transcribe function below performs all the operations necessary to get the final transcripts.

It calls the other functions described in the previous steps and stores the transcripts in the ‘transcript’ variable.

One thing to note here is the timeout option.

It is the number of seconds that the transcribe function will actively transcribe a current audio file.

You can adjust this setting to a larger number if the audio file seconds is larger than the number provided here.

Step 7: Write transcriptsOnce the Speech to text operation is completed and you need to store the final transcripts in a file, the code below can be used to perform the same.

Step 8: Execute your code.

Wait and watch the transcriptsThe code below starts the execution.

You can have multiple audio files in the filepath.

It executes each file sequentially.

The final transcripts generated look like below.

I was on the other roommate had to leave before I got half of them by him.

I guess no way to get a hold back to you.

Alright.

Yeah, kinda I mean like what?.What are you I have to I have to play with the other guys.

So yeah, go ahead and I with me I can let my people like the one that you were calling me, but I go ahead and do it cuz I'm sure he's not work for lunch, but they just had them or is it 10 o'clock?.I want to go ahead and get out with me.

Call me.

I understand.

They probably to talk about Mom and I need to call back or maybe I can just figured after taxes advertises.

It's 110 feet in so I guess alright.

Well shoot let me know and then maybe I'll just minus the weather.

Okay.

Well so much for your help.

Like I said, no problem.

Alright, you have a good day.

Okay.

Bye.

What if I need to separate speakers in my transcripts?Speaker Diarization is a process of distinguishing speakers in an audio file.

It turns you can use Google speech to text API to perform speaker diarization.

The final transcripts generated by Google after speaker diarization looks like below.

speaker 1: I was on the other roommate had to leave before I got half of them by him I guess no way to get a hold back to you alrightspeaker 2: yeah kinda I mean like what what are you Ispeaker 1: have to I have to play with the other guys so yeah go ahead and I with me I can let my people like the one that you were calling me but I go ahead and do it cuz I'm sure he's not work for lunch but they just had them or is it 10 o'clock I want to go ahead and get out with mespeaker 2: call me I understandspeaker 1: they probably to talk about Mom and I need to call back or maybe I can just figured after taxesspeaker 2: advertises it's 110 feet in so Ispeaker 1: guess alright well shoot let me know and then maybe I'll just minus the weather okay well so much for your help like I said no problem alright you have a good day okay byeTo perform this, you need to make some changes to the code described before.

Let us start with the package imports.

Now, let us talk about the changes in the transcribe part.

The entire code for both the projects can be found in the Github link.

Have fun!.

. More details

Leave a Reply