Integrated Diagnostic System Based on AI and Signal Processing for Identifying Articulation Disorders and Retrieval Difficulties in Children Transitioning to First Grade
This project is an AI and signal-processing-based system designed to identify articulation disorders and retrieval difficulties in children transitioning to first grade. The goal of the project is to provide an initial diagnostic tool intended for home use, which reduces the need for initial consultations with a speech therapist, saving time and money. Additionally, the system is designed to provide children with comfort by operating in a familiar environment and offering speech therapists detailed preliminary information about the child's speech, enabling comparisons between manual assessments and system-generated results.
- Automatic Speech Analysis: The system utilizes AI and signal processing to analyze voice recordings of children saying specific words and identifies articulation errors.
- User-Friendly Interface: The interface allows users (parents, speech therapists) to easily interact with the system and view the analysis results.
- Spectrogram Display: Each analyzed word is displayed with a spectrogram showing the phoneme identification, retrieval time, and position within the word.
- Preliminary Diagnostic Tool: Speech therapists can use the system's data as a first-stage diagnostic, helping them prepare more efficiently for future therapy sessions.
As part of the project, we developed a web application using Flask, a lightweight web framework for Python. The application serves as an interactive interface that allows users (both children and parents) to engage with the system in a simple, accessible way. Here's what the application does:
-
Interactive User Flow:
- The application presents images from a standardized set of objects, which the child is prompted to name.
- Once the image is displayed, the app automatically starts recording the child's speech and measures the response time from when the recording begins until the child speaks.
-
- After each session, the system processes the recording, identifies the phonemes, and displays a spectrogram with detailed information about the errors detected, if any.
- Users can replay the recordings, review the results, and analyze the articulation errors using the spectrogram provided.
-
Result Saving and Management:
- Users can choose where the results (audio recordings and analysis images) are saved on their computer, making it easier to organize the session data for later use by speech therapists or parents.
-
Standalone Application: The Flask app is designed to be packaged as a standalone executable file (.EXE), ensuring that it can be run on any computer without needing to install additional software dependencies or Python environments.
The application’s simple and intuitive interface makes it user-friendly for parents, children, and professionals alike, allowing speech therapists to easily monitor and track progress over time.
The methodology developed in this project incorporates advanced machine learning techniques and signal processing. The system is built on top of the Allosaurus library, a multilingual allophone-based acoustic model for phoneme recognition, which has been adapted specifically for Hebrew phoneme identification. Allosaurus is known for its versatility and high accuracy across different languages, making it a suitable choice for this project. You can find more about Allosaurus on its GitHub page or in the original research paper: "Universal Phone Recognition with a Multilingual Allophone System" by Xinjian Li et al. (2020).
- Data Collection: Voice recordings of children were collected and used to train and validate the model.
- Signal Processing: The recorded speech is processed to identify phonemes, and specific algorithms are applied to handle different speech conditions and background noise.
Several key algorithms were employed in the system to ensure accurate recognition of phonemes and efficient handling of audio data:
-
Voice Activity Detection (VAD): This algorithm detects segments of speech within an audio recording by analyzing the short-term energy (STE) and zero-crossing rate (ZCR) of the signal, allowing the system to filter out non-speech parts and focus on relevant audio.
-
Mel-Frequency Cepstral Coefficients (MFCCs): This technique is used to extract features from the speech signal by representing the sound spectrum in a way that aligns with human auditory perception. MFCCs help capture important frequency components for phoneme recognition.
-
Connectionist Temporal Classification (CTC): CTC is used to map the variable-length input audio features to the corresponding phoneme sequence without requiring perfect alignment between the audio and the phonemes. This is essential for handling the dynamic nature of spoken language.
-
Beam Search Decoding: The system uses beam search to efficiently decode the best possible sequence of phonemes from the probabilistic predictions of the model, improving the overall accuracy of the recognition process.
- Develop a robust, AI-based diagnostic tool for early detection of articulation disorders.
- Reduce the need for repeated visits to speech therapists by providing initial assessments that can be reviewed in a home setting.
- Present a graphical representation of the identified phonemes, including retrieval time and error categorization (semantic or phonological).
The system successfully identifies phonemes with a high degree of accuracy, making it a useful tool for both parents and speech therapists to detect potential speech impairments. However, challenges with the quality of recordings affected the overall performance, indicating room for improvement in future iterations.
The system can be expanded to support additional languages, improve recording quality by integrating advanced equipment, and add features like expression and intonation analysis. Future work may also focus on deploying the system in educational or mobile app formats to increase accessibility.