Inspiration

The inspiration comes from two places:

1) Poor UX for Text to Speech applications: Text to Speech (TTS) became quite useful to read aloud on specific websites both for convenience and also accessibility matters. However, TTS is not available on all websites, and some of them make it a premium feature (i.e. Medium). Where TTS is available, the quality of the speech is OK and far from high-quality speech such as OpenAI TTS or Google Vertex TTS. Speechify, which claims to be the most used app for TTS has questionable quality when narrating websites.

2) Few tools for consuming written content on the go: Quite often I find myself wanting to read something but my hands are busy when I'm commuting, on the bike, running, at the gym, or even cooking. Today, I can save a website to a reading list or Evernote, but I have no way to listen to any website on my phone.

What it does

VoiceMate reads the content of the website you are on and downloads a high-quality audible version of it that you can take with you while you commute, run, or drive.

How we built it

The proof of concept is a Chrome extension that reads the dom and extracts relevant HTML nodes such as article, section, p, and h1, among others.

Every node gets mapped to Markdown format, which is used in a Gemini prompt to generate a Speech Synthesis Markup Language document (SSML). The SSML is then passed to Vertex TTS service to generate a long audio file that is stored in Google Cloud Storage.

VoiceMate generates a signed URL to allow the user to download the audio file on their browser.

Challenges we ran into

1) Not all SSML tags are supported by Vertex, so the prompt to Gemini must contain allowed tags

2) Vertex allows long audio files, but Gemini's token limit doesn't match that and I need to generate the audio file in batches and merge them later. For the proof of concept, this is not implemented though.

3) I tried to feed 3 things to the prompt: the inner text of the page, HTML content of the body, and markdown. Ultimately, markdown worked best. The inner text on its own was good enough but it was missing emphasis on titles or pauses between paragraphs. The HTML content addressed this, but it contained significant fat that wasn't meant to be read aloud.

4) Sometimes, the tone of the narrator fluctuates within the same sentence and sounds weird.

Accomplishments that we're proud of

1) Building the extension took a couple of hours/days. Its infrastructure was easy to deploy as the backend is a Google Cloud Function and stores data in Google Cloud Storage.

2) Great feedback from friends who are surprised about the quality of the narration

3) Its quality is outstanding with news websites (i.e. BBC) and blogs (Medium, Substack). Still great quality with other websites whose content may be poorly organized.

What we learned

1) SSML and how to tag content for great quality TTS

2) The fact that Vertex is well integrated with Cloud Functions and Cloud Storage saved me hours of development

3) VoiceMate is a game changer with people with special accessibility needs

4) So far most of the TTS solutions targeted newspapers, professional bloggers, etc. Very little has cut through the noise for the consumers apart from Speechify.

What's next for VoiceMate

1) Prompt engineering: I will add a few more options to test out different prompts. As a user, I want to tell what type of narration I want: full article, "paraphrase it", the TLDR, etc.

2) Introduce user login to enable an audio library for users to browse and download previous audio files

3) Publish in the Chrome extension store

4) Make audio generation async/event based

5) Build a mobile app for users on the go: As a user, I receive a link to read from a friend but I'm driving, so I want to open it with the VoiceMate app so I can listen to it instead.

Built With

Share this project:

Updates