YahooGroup Archive

From: John Burrows <burrows@...>
Date: 2009-02-14 15:11:02 #
Subject: Re: Naming dot

I know the problem from work preparing texts for transliteration and
translation. An ad hoc solution is to link all the words in a name or
title by underline characters, using a single namer dot to initialize
the string.
.mister_dzhon_barouz_v_youngsheila (Mr.John Burrows, Ljungskile)

No virus found in this outgoing message
Checked by PC Tools AntiVirus (4.0.0.20 - 10.100.005).
http://www.pctools.com/free-antivirus/

From: "ed_shapard" <ed_shapard@...>
Date: 2009-02-23 00:02:09 #
Subject: Re: Simple transliterator

Toggle Shavian

Thomas,

It's always good to see more people working on computerized
transliteration.

It looks like there's a problem in your transliterations relating to
stressed vowel sounds; especially ligatures.

Take for example, your transliteration of "fire", which you
transliterate as Fee-Ice-ERR: In my opinion, this should be
Fee-Ice-ARray. ERR and ARray have the same sound, but ERR is used for
stressed vowel sounds. Similarly, Up and Ado have the same sound, but
Up is used for stressed vowel sounds.

Figuring out the "difference" between these sounds took me a very long
time when I first started learning, but through reading discussions in
this group and verifying that this is consistent with the spellings in
Androcles, this is the conclusion I've come to. Unfortunately, this
seems to be the worst documented feature of the alphabet.

Stress also creates complications with words that have one spelling,
but two pronunciations. e.g. convict (noun vs verb). Your
transliteration program gives me the noun form, but not the verb form.
In my opinion, this is the biggest hurdle to automated
transliteration. Words like this have to be identified as noun or verb
based on context before you can transliterate them.

That's beyond my ability to program, so in transliterating The Wizard
of Oz (in the files section), I just transliterated all such words by
hand.

Other web-based transliteration programs have gotten around this
problem by having multiple entries for these words. Here's an example:
http://www.saytheword.org.uk/shavian/phpghotifilleter/index.php

--- In shawalphabet@yahoogroups.com, "Thomas Thurman" <tthurman@...>
wrote:
>
> Not going public with this yet, but it's fun to play with:
>
> http://marnanel.org/shavian/transliterate
>
> Let me know if you find any glaring mistakes, or anything really.
>

From: Philip Newton <philip.newton@...>
Date: 2009-02-23 06:07:00 #
Subject: Re: [shawalphabet] Re: Simple transliterator

Toggle Shavian

2009/2/23 ed_shapard <ed_shapard@...>:
> It looks like there's a problem in your transliterations relating to
> stressed vowel sounds; especially ligatures.

That's a result of the electronic pronouncing dictionary that's at the
core of the program -- unfortunately, there only seem to be two big,
freely available electronic pronouncing dictionaries for the English
language, and neither of them uses the accent used in "Androcles".

Compiling a pronouncing dictionary that's comprehensive enough to be
useful (more than just a couple of thousand entries, I'd say) takes a
lot of time and effort, which is why it's easiest to go with what's
already there.

If you (or anyone else in the group) wants to cooperate and produce a
pronouncing dictionary that reflects "standard" Shavian pronunciation
(whatever we agree that to be), I'm sure many people would be very
happy. (The format could be as simple as a text file with English and
Shavian spellings on the same line, separated by a tab or something.)

> Stress also creates complications with words that have one spelling,
> but two pronunciations. e.g. convict (noun vs verb). Your
> transliteration program gives me the noun form, but not the verb form.
> In my opinion, this is the biggest hurdle to automated
> transliteration. Words like this have to be identified as noun or verb
> based on context before you can transliterate them.

This, of course, would still remain a problem. Or spellings with more
than one pronunciation in general -- see, for example, "read", which
is pronounced differently in the past and present tenses, yet both
would be verb forms, so even part-of-speech recognition wouldn't help
here.

Cheers,
--
Philip Newton <philip.newton@...>

From: Thomas Thurman <tthurman@...>
Date: 2009-02-23 12:02:37 #
Subject: Re: [shawalphabet] Re: Simple transliterator

Toggle Shavian

Thank you for your comments, both of you.

2009/2/23 Philip Newton <philip.newton@...>:
> Compiling a pronouncing dictionary that's comprehensive enough to be
> useful (more than just a couple of thousand entries, I'd say) takes a
> lot of time and effort, which is why it's easiest to go with what's
> already there.

It has recently occurred to me that perhaps taking the RP
pronunciations out of en.wiktionary.org might be helpful, at least for
a basic core of words. I'll play with this and see what I come up
with.

>> Stress also creates complications with words that have one spelling,
>> but two pronunciations. e.g. convict (noun vs verb).
>
> This, of course, would still remain a problem. Or spellings with more
> than one pronunciation in general -- see, for example, "read", which
> is pronounced differently in the past and present tenses, yet both
> would be verb forms, so even part-of-speech recognition wouldn't help
> here.

To some extent I could deal with the convict/convict problem using a
simple POS tagger; there are at least a couple in CPAN already.
Lingua::EN::Tagger copes admirably with read/read:

'<prp>I</prp> <in>like</in> <to>to</to> <vb>read</vb> <nns>books</nns>';
'<prp>I</prp> <vbd>read</vbd> <det>a</det> <nn>book</nn>';

but doesn't know convict/convict:

'<prp>He</prp> <vbd>was</vbd> <det>a</det> <vb>convict</vb>';
'<det>The</det> <nn>judge</nn> <md>will</md> <vb>convict</vb> <prp>him</prp>';

although presumably if it can do read/read it wouldn't be terribly
hard to teach it.

The Brill tagger fares much better:

He/PRP was/VBD a/DT convict/NN
The/DT judge/NN will/MD convict/VB him/PRP
I/PRP like/VB to/TO read/VB books/NNS
I/PRP read/VBP a/DT book/NN

As to the pronunciation side, Moby knows the convict/convict distinction:

convict/n 'k/A/nv/I/kt
convict/v ',k/A/n'v/I/kt

although it doesn't give variations for read/read. Again, I think
we'd have to put them in separately.

So, I think a lot of the tools we need are already there.

Thomas

From: Thomas Thurman <tthurman@...>
Date: 2009-02-23 12:04:29 #
Subject: Re: [shawalphabet] Re: Simple transliterator

Toggle Shavian

2009/2/22 ed_shapard <ed_shapard@...>:
> ERR and ARray have the same sound, but ERR is used for
> stressed vowel sounds. Similarly, Up and Ado have the same sound, but
> Up is used for stressed vowel sounds.

I'm interested to know whether this is a consistent and accepted rule,
and whether all the vowel sounds have stressed/unstressed versions.
If so I can just use stress information from the pronunciation
dictionary to make this distinction.

Thomas

From: Philip Newton <philip.newton@...>
Date: 2009-02-23 13:00:52 #
Subject: Re: [shawalphabet] Re: Simple transliterator

Toggle Shavian

On Mon, Feb 23, 2009 at 13:04, Thomas Thurman <tthurman@...> wrote:
> 2009/2/22 ed_shapard <ed_shapard@...>:
>> ERR and ARray have the same sound, but ERR is used for
>> stressed vowel sounds. Similarly, Up and Ado have the same sound, but
>> Up is used for stressed vowel sounds.
>
> I'm interested to know whether this is a consistent and accepted rule,
> and whether all the vowel sounds have stressed/unstressed versions.

Those are the only two stressed/unstressed pairs in Shavian that come
to my mind.

Cheers,
--
Philip Newton <philip.newton@...>

From: "ed_shapard" <ed_shapard@...>
Date: 2009-02-26 07:54:31 #
Subject: Re: Simple transliterator

Toggle Shavian

Thomas,

That's great news that you know of some tools to identify parts of
speach! If you can tie that in to your database, it should go a long
way toward automating transliteration.

I think I can find a backup copy of my mySQL database of shaw
spellings if I look hard enough. I lost it when my hard-drive crashed
about a year ago. I used CMUdict to get started, but then corrected by
hand, the words that I actually used. What I ended up doing was
creating a list of all the unique words used in The Wizard of Oz, and
then going through them one-by-one making corrections and deleting
words with multiple pronunciations. I think I found a list of words
with the same spelling on wikipedia that was very useful.

Ah... Here it is: http://en.wikipedia.org/wiki/Heteronym_(linguistics)

Lionel Ghoti's database is pretty good from what I remember. He's had
it for a long time. You might want to go to
http://www.saytheword.org.uk/shavian/phpghotifilleter/index.php and
ask him if you can have a dump of the database. I'll post a dump of
mine if I find it... and get around to it.

Good luck hacking.
-Ed

--- In shawalphabet@yahoogroups.com, Thomas Thurman <tthurman@...> wrote:
>
> Thank you for your comments, both of you.
>
> 2009/2/23 Philip Newton <philip.newton@...>:
> > Compiling a pronouncing dictionary that's comprehensive enough to be
> > useful (more than just a couple of thousand entries, I'd say) takes a
> > lot of time and effort, which is why it's easiest to go with what's
> > already there.
>
> It has recently occurred to me that perhaps taking the RP
> pronunciations out of en.wiktionary.org might be helpful, at least for
> a basic core of words. I'll play with this and see what I come up
> with.
>
> >> Stress also creates complications with words that have one spelling,
> >> but two pronunciations. e.g. convict (noun vs verb).
> >
> > This, of course, would still remain a problem. Or spellings with more
> > than one pronunciation in general -- see, for example, "read", which
> > is pronounced differently in the past and present tenses, yet both
> > would be verb forms, so even part-of-speech recognition wouldn't help
> > here.
>
> To some extent I could deal with the convict/convict problem using a
> simple POS tagger; there are at least a couple in CPAN already.
> Lingua::EN::Tagger copes admirably with read/read:
>
> '<prp>I</prp> <in>like</in> <to>to</to> <vb>read</vb> <nns>books</nns>';
> '<prp>I</prp> <vbd>read</vbd> <det>a</det> <nn>book</nn>';
>
> but doesn't know convict/convict:
>
> '<prp>He</prp> <vbd>was</vbd> <det>a</det> <vb>convict</vb>';
> '<det>The</det> <nn>judge</nn> <md>will</md> <vb>convict</vb>
<prp>him</prp>';
>
> although presumably if it can do read/read it wouldn't be terribly
> hard to teach it.
>
> The Brill tagger fares much better:
>
> He/PRP was/VBD a/DT convict/NN
> The/DT judge/NN will/MD convict/VB him/PRP
> I/PRP like/VB to/TO read/VB books/NNS
> I/PRP read/VBP a/DT book/NN
>
> As to the pronunciation side, Moby knows the convict/convict
distinction:
>
> convict/n 'k/A/nv/I/kt
> convict/v ',k/A/n'v/I/kt
>
> although it doesn't give variations for read/read. Again, I think
> we'd have to put them in separately.
>
> So, I think a lot of the tools we need are already there.
>
> Thomas
>

From: "Thomas Thurman" <tthurman@...>
Date: 2009-02-26 17:05:16 #
Subject: Re: Simple transliterator

Toggle Shavian

--- In shawalphabet@yahoogroups.com, "ed_shapard" <ed_shapard@...> wrote:
> Ah... Here it is: http://en.wikipedia.org/wiki/Heteronym_(linguistics)

That's very, very useful: thank you.

If you still have a copy of the Shavian text which can be aligned with
the Latin-alphabet text of the same document, I can extract a mapping
from that. In fact, given a few such documents we could work up quite
a useful basic dictionary quite quickly.

Thomas

From: "Thomas Thurman" <tthurman@...>
Date: 2009-02-27 16:25:49 #
Subject: Vowels in automated transliteration

Toggle Shavian

I spent some of a long aeroplane trip the other day comparing the
vowels allowed in the Moby and CMUDict lexicons with those used in
Shavian. In each case I looked up the name of the Shavian letter in
the given lexicon, and these are my results:

Shavian Moby CMUDict
code example code example
IF I /I/f IH IH1 F
EGG E /E/g EH EH1 G
ASH & /&/S/ AE AE1 SH
ADO @ /@/'d/u/ AH AH0 D UW1
ON A /A/n AA AA1 N
WOOL U w/U/l UH W UH1 L
OUT AU /AU/t AW AW1 T
AH A /A/ AA AA1
EAT i /i/t IY IY1 T
AGE eI /eI/dZ/ EY EY1 JH
ICE aI /aI/s AY AY1 S
UP @ /@/p AH AH1 P
OAK oU /oU/k OW OW1 K
OOZE u /u/z UW UW1 Z
OIL Oi /Oi/l OY OY1 L
AWE O /O/ AA AA1

Moby merges: Ado/Up, On/Ah (this is the father/bother merger)
CMUDict merges: Ado/Up, On/Ah/Awe

So it would seem that Moby was a generally better choice for a
lexicon, although we still can't get away from the father/bother merger.

As noted in an earlier thread, Ado/Up can still be distinguished using
stress, which both lexicons mark.

Thomas

From: "ed_shapard" <ed_shapard@...>
Date: 2009-03-01 06:50:29 #
Subject: Re: Simple transliterator

Toggle Shavian

--- In shawalphabet@yahoogroups.com, "Thomas Thurman" <tthurman@...>
wrote:
>
> --- In shawalphabet@yahoogroups.com, "ed_shapard" <ed_shapard@> wrote:
> > Ah... Here it is: http://en.wikipedia.org/wiki/Heteronym_(linguistics)
>
> That's very, very useful: thank you.
>
> If you still have a copy of the Shavian text which can be aligned with
> the Latin-alphabet text of the same document, I can extract a mapping
> from that. In fact, given a few such documents we could work up quite
> a useful basic dictionary quite quickly.
>
> Thomas
>

Drat! It looks like I lost the list of words and transliterations I
created. my transliteration is here:
http://f1.grp.yahoofs.com/v1/YCSqSVLmVEH4nOvA3XPHQCB9gF4sRT3cMRPo1ErgwuizZUbzd-P-Wk5UPqPEEHW0rfw19VUWw0dQRN-CRA9U9PI5EvLnKpRihQ/Texts/WizardofOz_ShawAlphabet.pdf

and the original is here: http://www.gutenberg.org/files/55/55.txt

due to formatting, page numbers, and a couple words that were replaced
with more modern versions, there isn't a perfect one-to-one
relationship between those two texts. Also, I swap ha-ha with hung and
air with err.

Shawalphabet YahooGroup Archive Browser