Umlaut on top of letter is counted as two letters?

Dear,

I coded a reading experiment, in which words are presented one by one on a screen.
These words are drawn from .txt files, containing German words that are divided into syllables.
The division between syllables is done by means of ‘-’.
So the .txt file contains words like this:
zwei-ter
Die-ner
künst-li-che
Würst-chen

The aim of the program is to correctly show the word, so without the ‘-’:
zweiter Diener etc.

When a word does not contain an Umlaut everything goes fine, using the following code:

# This ensures proper placement of the syllable on the screen#
padded_syllable = ' ' * letter_count + syllable + ' ' * ( len( word ) - letter_count - len( syllable ) )

however, the spacing between wordparts goes wrong when a letter contains an umlaut, then this letter seems to be counted as two.
“künst-li-che” is then shown as “künst li che”, where it should be künstliche.

Is there a way to count a letter with Umlaut as one letter only?

Best regards,
Koen

Hi Koen,

PsychoPy is still based on Python 2, which doesn’t handle unicode text as well as Python 3 (which we’re in the slow process of upgrading to). What you should probably do here is declare your string variables (like word) to be unicode explicitly.

e.g.

a = 'u' # string of length 1 byte
b = 'ü' # string of length 2 bytes
c = u'ü' # unicode object, of length 1 character

Above a and b are regular strings. len(a) returns 1 and len(b) returns 2, because this reflects the number of bytes required to store the representations of these characters. len(c) returns 1, because prefixing the string with a u makes it a unicode string object. These are clever enough to know that you’re not really interested in how many bytes they require to be represented, but how many characters they contain.

Also, you might find the .split('-') method useful to break up your strings into lists of syllables, which can be easier to work with. e.g.

word = u'künst-li-che'
syllables = word.split('-')

1 Like

Hi Michael,

Thank you very much for your answer.
So, saving the .txt file where I draw the words from in UTF-8 format is not sufficient?

Best regards,
Koen

No. Python is clearly reading your UTF-8 text file correctly. The issue is with the old fashioned string objects that define their length in terms of bytes rather than characters. In UTF-8, ASCII characters take 1 byte to represent. Characters outside that set of 127 require more than one byte to represent.

I’m not sure how you’re reading your data in originally, but if you need to convert the result from strings to unicode objects, you can do this:

s = 'künst-li-che' # let's say your initial file import gives you a string object
type(s) # string
len(s) # 13 bytes

t = unicode(s, 'utf-8')
type(t) # unicode
len(t) # 12 characters

In Python 3, this will all go away (all strings will be unicode), but for the timebeing, we need to do this sort of dance.

2 Likes