Artificial Intelligence (AI) is the new electricity" according to Andrew Ng, a professor
at Stanford University. Today AI is transforming every sector of industry,
academics, and business. From smartphones to supercomputers, the main
medium of operating and communicating with humans is through AI. Specifically,
voice-based access in digital devices, such as smartphones, is equipped with
an AI-enabled personal assistant, e.g., OK Google, Apple Siri, Microsoft Cortana,
Amazon Echo, etc. Along with AI, there are rapid advancements in the Internet of
Things (IoT) where many electronic devices are connected through the Internet.
Since an AI-enabled IoT framework is used in real-life noise scenarios, we need a
knowledge of both signal processing and machine learning to propose a feasible
solution to develop a noise robust voice-based access system in the IoT. Hence,
there is a great demand for speech and audio technology in the near future since
voice is the most important form of human communication and possibly humanmachine
interactions.
One of the important aspects of speech and audio processing applications is
to come up with a better representation of the sounds so that variabilities in the
signals are largely reduced. For example, in the speech recognition task, we need
a feature representation that reduces the variability in speakers, dialects, microphones,
and provide robustness to the background noise [