Unsupervised Spoken Term Discovery for Zero Resource Speech Processing

Bhati, Saurabh Chand and Kodukula, Sri Rama Murty (2017) Unsupervised Spoken Term Discovery for Zero Resource Speech Processing. Masters thesis, Indian Insitute of Technology Hyderabad.

[img] Text
EE12B16M100001.pdf - Submitted Version
Restricted to Registered users only until 23 July 2020.

Download (1MB) | Request a copy


Zero resource speech processing refers to a scenario where no or minimal transcribed data is available. Unsupervised Acoustic Segment Modelling (ASM) is a technique for unsupervised dis- covery of acoustic segments from speech and building corresponding acoustic models without any prior knowledge or manual transcriptions. ASM mainly comprises of three steps: a) segmentation of speech utterances into acoustic segments b) segment labelling c) segment modelling. This work focuses on improving initial segmentation and acoustic segment labelling (ASL). In the first step, we segment the speech signal into acoustically homogeneous regions, resulting in a large number of varying length segments. We propose a new kernel gram matrix-based approach for segmentation. It determines the number of segments automatically and delivers comparable performance to the state of the art algorithms. The second step involves clustering the varying-length segments into a finite number of clusters so that each segment can be labeled with a cluster index. To improve labelling, a new graph clustering based framework is proposed. A major problem in ASM is the estimation of the number of ASM units that should be used for modelling the speech data. It is often left unaddressed, or an empirical number of ASM units is adopted. Our algorithm estimates the number of ASM units to be used reasonably well. Performance comparison with baseline ap- proaches demonstrates the ability of our algorithm to model ASM with minimal supervision. In the third step, a deep neural network classifier is trained to map the feature vectors extracted from the signal to its corresponding virtual phone label. The virtual phone posteriors extracted from the DNN are used as features in the zero resource speech processing. The effectiveness of the proposed approach is evaluated on both ABX and spoken term discovery tasks (STD) using spontaneous American English and Tsonga language datasets, provided as part of zero resource 2015 challenge. It is observed that the proposed system outperforms baselines, supplied along the datasets, in both the tasks without any task specific modifications

[error in script]
IITH Creators:
IITH CreatorsORCiD
Kodukula, Sri Rama Murtyhttps://orcid.org/0000-0002-6355-5287
Item Type: Thesis (Masters)
Uncontrolled Keywords: spoken term discovery, unsupervised speech segmentation, resource speech processing, acoustic segment modelling, TD947
Subjects: Electrical Engineering
Divisions: Department of Electrical Engineering
Depositing User: Team Library
Date Deposited: 24 Jul 2017 04:49
Last Modified: 28 May 2019 08:57
URI: http://raiith.iith.ac.in/id/eprint/3420
Publisher URL:
Related URLs:

Actions (login required)

View Item View Item
Statistics for RAIITH ePrint 3420 Statistics for this ePrint Item