Umlaut on top of letter is counted as two letters?

Michael · November 29, 2017, 9:31pm

No. Python is clearly reading your UTF-8 text file correctly. The issue is with the old fashioned string objects that define their length in terms of bytes rather than characters. In UTF-8, ASCII characters take 1 byte to represent. Characters outside that set of 127 require more than one byte to represent.

I’m not sure how you’re reading your data in originally, but if you need to convert the result from strings to unicode objects, you can do this:

s = 'künst-li-che' # let's say your initial file import gives you a string object
type(s) # string
len(s) # 13 bytes

t = unicode(s, 'utf-8')
type(t) # unicode
len(t) # 12 characters

In Python 3, this will all go away (all strings will be unicode), but for the timebeing, we need to do this sort of dance.

Topic		Replies	Views
Present text stimuli that contain umlauts (ä,ö,ü,ß) Coding	2	804	July 23, 2019
Umlaute in input file Coding	4	7695	June 13, 2017
# -- coding: utf-8 -- ruins my code Coding	0	1355	October 4, 2019
Displaying unicode text strings in Psychopy 3 Builder	8	1810	December 13, 2019
Text Stimuli in other language Builder	7	1542	June 7, 2018

Umlaut on top of letter is counted as two letters?

Related topics