5.1 Introduction
All previous and current research into the historical phonology of Indo-European words is based almost exclusively upon written texts, including phonological transcriptions of modern languages, historical texts, and hypothesised reconstructions of the phonological make-up of words in ages past. In contrast, my research is based upon audio recordings of words in modern languages, as well as software-generated simulations of how ancestral pronunciations of modern words may have sounded. Closely guided by the published results and proposals of philologists, I attempt to add audio “flesh” to the “bones” of textual transcriptions. In ten years, I and those I have worked with have made slow but now substantial progress in modelling change over time of forms that can be described mathematically using continuous functions, i.e. curves and surfaces (The Functional Phylogenies Group 2012, Jones and Moriarty 2013, Hadjipantelis et al. 2013). As the acoustic parameters of spoken words are one instance of such functions, it has become possible to model phonetic history and prehistory by such methods; that is, to construct audible sound files instantiating simulations of hypothesized possible spoken forms from the past, including distant ancestral pronunciations and the intermediate pronunciations at each generation. I and my earlier collaborators developed quantitative, computational methods to (1) factor the acoustics of spoken words into the average or typical pronunciation of each word, the language-specific variation in those pronunciations, and the residue of variation that is characteristic of the individual speakers (Pigole et al. 2018); (2) synthesize sound changes by audio morphing (Moore and Coleman 2005): interpolation between audio recordings serving as proxies for present-day and historical pronunciations (Coleman et al. 2015); (3) simulate “ancestral” recordings of word pronunciations in proto-languages by “hybridization” of pairs of modern recordings, each of which partly records some aspect(s) of the postulated proto-word; and (4) model geographical variation that arises from and thus “fossilizes” historical-linguistic change (Tavakoli et al. 2019).
As sound recordings of ancestral forms from the distant past do not exist, I base my models of sound change on proxies of several kinds. We may use modern recordings that resemble how ancestral pronunciations are thought to have been, e.g. the [un-] portion of Spanish or Italian uno is a proxy for the corresponding part of Latin unus, or we may combine two modern recordings into various kinds of hybrids using signal processing. For example, I generate the sound of Proto-Indo-European *dwoh (“two”) from a continuum of sound files interpolated between a recording of Elfdalian twå [two] and a recording of Russian два [dva] that happens to have a laryngealized offset; because this operation is carried out across the whole extent of two words in contrast, I call it “paradigmatic” hybridization. As another example of recombining parts of two words, the first syllable of Italian quindici [kwín-] spliced with the second syllable of cinque [-kwe] simulates Latin quinque [kwínkwe]. Because this operation is carried out on the time axis, I call it “syntagmatic” hybridization. In related work, Hadjipantelis (2013) used statistical regression over a phylogenetic tree to extrapolate back from a set of modern recordings to a hypothetical ancestral form. Admittedly, this method does not yet yield very realistic-sounding reconstructions of ancestral pronunciations (for example, the “ancestor” soundfiles reconstructed from recordings of words in modern Romance languages do not much resemble how we believe Latin to have sounded), though it has subsequently proved more sucessful in reconstructing vocalizations from the past in other species, in particular bat echolocation signals (Meagher et al. 2019); further work is needed.
Constructing simulations of audio word-histories is not rapid work. Although it is done using signal processing tools and algorithms, it is not merely a matter of throwing the recordings into some well-established machine learning or statistical process and waiting for the results to churn out. Rather, relevant recordings have to be tracked down, processed into a common audio format, possibly stretched, squashed or divided into parts (as in the hybridization technique mentioned above), paired up with other recordings relating to different times or languages, and only finally submitted to the algorithm for generating continua of changing sounds. Though that final step is quick, taking only seconds, the preparatory work is mental and manual work, the aptitude for which has taken me decades of work to learn. The novelty and expertise required mean that it cannot be delegated to a research assistant; this is an individual effort, not (as in previous stages of this project) a collaboration. From start to finish, it takes the better part of 1 day's work to create each sound-change simulation. There are often dead ends; trials that do not work and which therefore necessitate a reexamination of the materials or techniques employed.
5.2. A work flow for creating sound-change continua
[Notes from a class for graduate students, 23/3/24]
I use the following software tools, in a Linux computing environment. It would not be too difficult to install and run many of these in the Apple OSX operating system, as that is also based on Unix; it is also possible, but takes a little more work to get it all to run under Windows.
5.2.1. Vowel and consonant changes in there
The etymology of there from Proto-Indo-European *to-r is:
there [ðɛ:] < [ðɛə] < [ðɛɚ] < Old
English þæ̅r [θæ:r] < Proto-Germanic *þar [θɑr] <
Proto-Indo-European *to-r [tʰor]
The modelling challenges involved in this sequence include:
• changes in vowel qualities, such as [ɛɚ] from [æ:r]
• reduction/loss of postvocalic [r], i.e. from Early Modern English [ðɛɚ] to Standard Southern British English [ðɛə]
• lenition of [tʰ] (Grimm’s Law), i.e. Proto-Indo-European [tʰ] > Proto-Germanic [θ]
Step 1: collect materials
For there I collected a combination of: my own recordings such as there-RP.wav for [ðɛ:]; for [ðɛɚ], there.wav (an American English male speaker from Glosbe.com); several other .mp3 recordings downloaded from forvo.com; two .ogg format recordings from Wikimedia Commons (https://en.wiktionary.org/wiki/there#Pronunciation), one a female speaker from the UK [ðɛ:], the other a rhotic US speaker [ðɛɚ]; and .mpeg files shared by two male speakers on dict.cc.
As proxy recordings for Old English þæ̅r I made a recording of that word as pronounced by an expert, an Old English scholar, and a second recording submitted by a contributor to forvo.com. As proxies for Proto-Germanic *þar we have examples of (modern) Icelandic þar from forvo.com and dict.cc. As a proxy for Proto-Indo-European *to-r I collected recordings of the English word tor and found the Lombard word torr on forvo.com.
Step 2: convert materials to a standard format
The audio files mentioned above are in a variety of formats, sampling rates, channel numbers etc. In order to be processed using ahocoder and my interpolation/morphing code, the audio files must be converted to 16-bit, 1 channel (i.e. monophonic), PCM i.e. .wav files, at a sampling rate of 16,000 bits/s. This may be done using e.g. sox in a terminal window on the command line, e.g.
for i in *.ogg; do sox -i $i -c 1 -r 16000 -b 16 -e signed-integer $i.wav; done
A tip: If the output .wav files seem to be just loud random noise, but with the correct overall duration, it probably means that the byte order is wrong. To fix that, add "-x" before $i.wav in the above sox command.
Since sox is not suitable for files in .mp3 or .mpeg format, ffmpeg is a useful alternative, e.g.
ffmpeg -i there-F2.mp3 -ac 1 -ar 16000 there-F2.wav
Step 3: carefully examine the audio recordings
Using e.g. wavesurfer, visually inspect the .wav files. Listen to them to verify characteristics of the speaker's voice and the recording quality, such as a high vs. low pitched voice, a rapid talker vs. a slow talker, good quality, clean recordings vs. poorer quality recordings that may need work to reduce extraneous noise, and so on. Trim leading and trailing silence/background noise portions before and after the spoken word. Compare the overall durations of the words, or in short words compare the durations of the vowels or consonants that will need to be morphed in the simulation. Often, it is possible to manually choose a pair of recordings that are most similar to each other in overall timing.
Step 4: time-align the paired audio files as well as possible
For monosyllabic words, it is necessary to time align their vowels. If any initial consonants are of different durations (as in e.g. tor.wav vs. thar.wav, calculate how much additional silence to add at the beginning of the shorter consonant in order to bring the onset of the two vowels to the same starting time.
For example, the waveforms of Lombard torr and Icelandic þar plotted below are of unequal durations and are not time-aligned. The recording of torr is 470 ms long, and that of þar is 744 ms long. The vowel of torr begins at around 62 ms into the recording, and that of þar at around 301 ms, because the initial fricative [θ] is much longer than the initial aspirated stop [tʰ]. By inserting a silence of (301 − 62) = 239 ms at the start of torr, the vowels of these two recordings can be made to begin at the same time. To manually insert such a silence, the Transform > Silence... menu option of wavesurfer may be used.
Above: waveform and spectrogram of Lombard torr, a candidate proxy for Proto-Indo-European *tor. Below: waveform and spectrogram of Icelandic þar, a candidate proxy for Proto-Germanic *þar. |
If one word is longer than the other (even with the vowel onsets time-aligned), is that because the vowels are of very different durations? If so, it may be useful to calculate the ratio of those vowel durations and then use Praat to adjust the timescale of one word so that its vowel becomes the same duration as the vowel of the other word. This may be done either by reducing the longer vowel by a factor less than 1, or increasing the shorter vowel by a factor greater than 1. It is advisable not to time-stretch recordings by less than 0.5 or more than 1.5. If the ratio of the two vowels lies outside those limits, it is wise to look for other recordings that are more similar in their durations/speaking rates.
With the vowel start-times and vowel durations well-aligned, it may then be necessary to either delete extraneous silence from the end of one file, or add additional silence to the end of one file, in order to ensure that the two sound files are of equal duration. This is easily done using wavesurfer.
The vowel of torr is about 195 ms long; it is hard to be precise because it fades into the final [r]. The vowel of þar is about 205 ms long, which is only very slightly different from that of torr. A difference of 10 ms (or even a bit longer e.g. 20 or 25 ms) is very slight, so it is not necessary to stretch or squash either of these vowels in order to time-align them quite well. Following the addition of initial silence to torr, that file is now 709 ms. To make it the same duration as þar, it is sufficient to add (744 − 709) = 35 ms of silence at the end of torr. Having added the needed silences, I saved the two files under different (simpler) names, tor.wav and thar.wav. Since some software tools do not like special characters or symbols, I recommend using only ASCII roman letters, hyphens, or numerals in such file names.
In more complex cases, such as polysyllabic words, it may be helpful to use stretchwarp.m to dynamically time-align one recording to more closely fit the other, but this is not always simple and is not guaranteed to work as desired, especially when the temporal mis-matches are great.
Finally, it is notable that tor.wav and thar.wav are of quite different amplitudes. The signal in thar.wav has a maximum of 30873 units and the signal in tor.wav has a maximum amplitude of 12352 units. When the disparity is that great (more than a 2:1 ratio), I recommend reducing the amplitude of the louder one to e.g. 20000 and increasing the amplitude of the quieter one to the same amount. In wavesurfer you can use Transform > Amplify to do that. 20000/30873 = 64.78%, and 20000/12352 = 161.92%. Alternatively you can use sox to do it; the syntax is:
sox -v <scaling_factor> input.wav scaled_version.wav
The <scaling_factor> is a number, not a percentage, so to quieten down thar.wav to a maximum level of 20000, you would use:
sox -v 0.6478 thar.wav thar-scaled.wav
(It is up to you what you call the output file; as long as it is different from the original filename and is a name that is meaningful to you, any name is fine.)
To increase the amplitude of tor.wav to match thar-scaled.wav, you could use:
sox -v 1.6192 tor.wav tor-scaled.wav
Step 5: use continua3.m to create the series of interpolants between one of the time-aligned files and the other.
continua3.m is an Octave/Matlab script, so you need to:
(a) start Octave,
(b) navigate to the directory or folder where the sound files of interest are, together with the continua3.m script,
(c) in the Octave "command" window, type continua3 (no need for the .m suffix)
(d) as prompted, type in the the filename of the first word (without the .wav suffix - the script will add that)
(e) as prompted, type in the filename of the second word (without the .wav suffix)
The steps that continua3.m carries out, in brief, are:
(i) calling ahocoder to convert each specified .wav files to a pair of MFCC encoded files, with the suffixes .f0 and .cc.
(ii) loading the .f0 and .cc files into Octave, and then calculating all the interpolants
(iii) writing the interpolated .f0 and .cc files to your working folder,
(iv) calling ahodecoder to convert each pair of interpolated .f0 and .cc files into an interpolated (morphed) .wav file.
Troubleshooting: what could possibly go wrong?
The most common problems experienced by students and other researchers in using the continua3.m script are:
- Files are not in the directory where Octave expects to find them (check step 5b)
- You don't have the right permissions (read+write access) for the directory. This is less of a problem on personal laptops etc, but can arise in a lab setting where the file permissions may be set by local system administrators.
- Files are inadvertently mis-named (e.g. spelling mistakes, spurious punctuation etc); remember that you don't need to include .wav in the prompted file names, as the script adds them (5d and 5e)
- .wav files are not in the correct format for ahocoder; they must be 16-bit, 1 channel, at a sampling rate of 16,000 bits/s. Anything else won't work.
- .wav files are somewhat different in overall durations. The script checks for this and if they are unequal it will trim the longer sound to the same length as the other; this may result in the end of some of the interpolant outputs sounding as if they have ended too soon (missing sounds at the end of the file).