Umlaut on top of letter is counted as two letters?

Koen_Rummens · November 29, 2017, 8:33pm

Dear,

I coded a reading experiment, in which words are presented one by one on a screen.
These words are drawn from .txt files, containing German words that are divided into syllables.
The division between syllables is done by means of ‘-’.
So the .txt file contains words like this:
zwei-ter
Die-ner
künst-li-che
Würst-chen

The aim of the program is to correctly show the word, so without the ‘-’:
zweiter Diener etc.

When a word does not contain an Umlaut everything goes fine, using the following code:

# This ensures proper placement of the syllable on the screen#
padded_syllable = ' ' * letter_count + syllable + ' ' * ( len( word ) - letter_count - len( syllable ) )

however, the spacing between wordparts goes wrong when a letter contains an umlaut, then this letter seems to be counted as two.
“künst-li-che” is then shown as “künst li che”, where it should be künstliche.

Is there a way to count a letter with Umlaut as one letter only?

Best regards,
Koen

Michael · November 29, 2017, 8:55pm

Hi Koen,

PsychoPy is still based on Python 2, which doesn’t handle unicode text as well as Python 3 (which we’re in the slow process of upgrading to). What you should probably do here is declare your string variables (like word) to be unicode explicitly.

e.g.

a = 'u' # string of length 1 byte
b = 'ü' # string of length 2 bytes
c = u'ü' # unicode object, of length 1 character

Above a and b are regular strings. len(a) returns 1 and len(b) returns 2, because this reflects the number of bytes required to store the representations of these characters. len(c) returns 1, because prefixing the string with a u makes it a unicode string object. These are clever enough to know that you’re not really interested in how many bytes they require to be represented, but how many characters they contain.

Also, you might find the .split('-') method useful to break up your strings into lists of syllables, which can be easier to work with. e.g.

word = u'künst-li-che'
syllables = word.split('-')

Koen_Rummens · November 29, 2017, 9:01pm

Hi Michael,

Thank you very much for your answer.
So, saving the .txt file where I draw the words from in UTF-8 format is not sufficient?

Best regards,
Koen

Michael · November 29, 2017, 9:31pm

No. Python is clearly reading your UTF-8 text file correctly. The issue is with the old fashioned string objects that define their length in terms of bytes rather than characters. In UTF-8, ASCII characters take 1 byte to represent. Characters outside that set of 127 require more than one byte to represent.

I’m not sure how you’re reading your data in originally, but if you need to convert the result from strings to unicode objects, you can do this:

s = 'künst-li-che' # let's say your initial file import gives you a string object
type(s) # string
len(s) # 13 bytes

t = unicode(s, 'utf-8')
type(t) # unicode
len(t) # 12 characters

In Python 3, this will all go away (all strings will be unicode), but for the timebeing, we need to do this sort of dance.

Topic		Replies	Views
Present text stimuli that contain umlauts (ä,ö,ü,ß) Coding	2	804	July 23, 2019
Umlaute in input file Coding	4	7694	June 13, 2017
# -- coding: utf-8 -- ruins my code Coding	0	1355	October 4, 2019
Displaying unicode text strings in Psychopy 3 Builder	8	1810	December 13, 2019
Text Stimuli in other language Builder	7	1542	June 7, 2018

Umlaut on top of letter is counted as two letters?

Related topics