Umlaut on top of letter is counted as two letters?

No. Python is clearly reading your UTF-8 text file correctly. The issue is with the old fashioned string objects that define their length in terms of bytes rather than characters. In UTF-8, ASCII characters take 1 byte to represent. Characters outside that set of 127 require more than one byte to represent.

I’m not sure how you’re reading your data in originally, but if you need to convert the result from strings to unicode objects, you can do this:

s = 'künst-li-che' # let's say your initial file import gives you a string object
type(s) # string
len(s) # 13 bytes

t = unicode(s, 'utf-8')
type(t) # unicode
len(t) # 12 characters

In Python 3, this will all go away (all strings will be unicode), but for the timebeing, we need to do this sort of dance.

2 Likes