Inspiration

I love languages and language-learning and since I'm on exchange from the US right now, I've been trying to learn French. I've noticed how big of a difference in experience there is between current language learning tools including ones that AI-powered, and real-life conversational experiences. Essentially, I'm trying to close this gap and ultimately surpass it because I think AI interactions have the capacity to be even more educationally rich than human interactions.

What it does

Speakeasy is a bilingual AI language tutor that has perfect knowledge of both your target and native languages. You can thus ask it any questions you would ask a regular tutor in either language, and it is capable of retaining memory of your progress from call to call via summarization and customize its lessons accordingly based on your past mistakes. Since it's bilingual and artificial, Speakeasy isn't as intimidating as a real-life scenario but it mimics the same latency and memory recall demanded by a real-life conversation in the user's target language.

How we built it

I used a collection of conversational infrastructure APIs including OpenAI, Deepgram, and Eleven Labs along with the Twilio API which allows me to make phone calls. Since Speakeasy is a voice-based product, I start an open websocket when the user calls. This was done with FastAPI, a useful library for REST APIs that supports websockets. I tested it locally by connecting the websocket to Twilio by running it locally and exposing it publicly using ngrok. I also hosted the code on Render, so it's available publicly too. It's difficult to demo in a short video, but I implemented a custom LLM wrapper that handles different states in the conversation as well. Web calls are actually even faster than voice calls but I thought voice calls were a more interesting and convenient form factor for this software.

Challenges we ran into

Interpreting LLM inputs with consistency and reliability - when users request setting changes, it's difficult for me to know everytime, which is why I implemented a structured json parser for user input which adds a bit of latency but allows users to change settings dynamically during the call. In other words, I'm converting user inputs into JSON to classify whether a setting change is being requested - if a change is being requested, I can detect it from the type of the JSON object that gets parsed and change the settings accordingly, and since we're constraining the LLM to json it is extremely consistent. Latency is another challenge - I would definitely like the bot to be a lot faster, and in fact it would be faster if I didn't have to classify every input message.

Accomplishments that we're proud of

Got realtime voice calling deployed publicly on Render which supports websockets to a certain capacity with its free tier (I used a local deployment for the demo in the interest of observability). I also found a way to consistently parse user inputs in realtime as either regular conversational messages or requests to setting changes (like speed).

What we learned

I learned a lot about the value of realtime communication and also how much is still missing from these systems. We have a long way to go, but I'm pretty convinced that this is the future of language learning now, and I think realtime conversationally-focused AIs will soon be better than humans in educational use-cases like language learning.

What's next for Speakeasy

Our roadmap includes more language support including French and Spanish, as well as realtime pronunciation feedback in addition to the grammatical feedback built into the experience. I also want to add post-call analysis reports that identify patterns in mistakes so that users can examine their error tendencies when it comes to grammar and pronunciation. I also want to add more structure to the lesson including introduction of new vocabulary, simulations of different real-life scenarios, etc.

Built With

Share this project:

Updates