Umlaute in input file

Ria12 · May 28, 2017, 3:08pm

Hi everyone,

I’m trying to code an experiment for German speakers, where stimuli are read in from a .csv file, which contains special characters like äöü and ß. However, despite reading extensively about inputting this type of text, I simply can’t figure out how to do it.

My input file is this one:
data.csv (82 Bytes)

And it looks something like this:

1	Auffahrunfall	1
2	Überholen	1
3	Balkon	1
4	Traktor	1

My current attempt at reading in this file is the following:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs
f = codecs.open('data.csv', encoding='utf-8')
for line in f:
    print str(line)

However, the program gets only as far as outputting

1,Auffahrunfall,1

and then gives me the error message:

UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xdc’ in position 2: ordinal not in range(128)

I don’t understand why an ascii codec is at work here, after all I’m using a utf-8 encoded input file, and I’m asking PsychoPy to use utf-8 encoding when reading the file.

Moreover, I can’t seem to find a solution to the problem. I thought perhaps adding

errors='ignore'

to the codecs.open() command would help, but it doesn’t.

I would be grateful if you could push me in the right direction on this one…

Thanks you!

jon · May 30, 2017, 10:08am

I think the problem is in the print statement. Just avoid doing that (or alternatively you could run the script from the terminal and it should work). The problem is caused by a bug in the library that we used to build the app (wx), specifically that the output window didn’t handle the unicode character correctly. It looks like that bug is fixed in the latest version of wx, at least on my mac, so I think this will be fixed for you in 1.85.2

Out of interest, what platform are you using?

thanks,
Jon

hubertv · June 8, 2017, 11:57am

The .csv file might not be encoded as utf-8.

If you are using Windows, try opening the .csv file in Notepad and save (as a copy of) it with encoding UTF-8.

Ria12 · June 11, 2017, 2:32pm

Jon, thank you so much!

Indeed I got completely stuck on the print command. Without that, everything runs fine.

I apologize for taking a while to answer. And I’m running 1.84.2 on my mac.

Thanks again, you’re a life saver!

daniel.riggs1 · June 13, 2017, 4:34pm

Sorry to butt in, but I think the problem is actually caused by calling str() .

Python 2 was created with some design mistakes in how they treat strings and unicode characters. If I’m remembering correctly, a string in python 2 is in fact a byte sequence in disguise. A “unicode” object in python 2 is actually the object you need to deal with when using fun characters, not a string. if you were to change:

str(line)

to:

unicode(line)

You shouldn’t have a problem (unless that bug @jon was talking about comes up).

Thinking about it further, I don’t think you need to do any conversion. When you call this:

f = codecs.open("data.csv", encoding="utf-8")

The objects in f are already unicode objects! So just do this:

print(line)

But as long as I’m not minding my own business , I might recommend a slightly different way of reading the file, for a couple of reasons. I would usually do this:

with codecs.open('data.csv', encoding='utf-8') as f:
    data = f.read().splitlines()
    # read() returns a unicode object,
    # splitlines() splits this object
    #    into a list of unicode objects 
    #    split by line ending characters

for dl in data:
    print(data)

One reason is that the “with” statement will take care of closing the file for you: if you forget to call f.close() after your code above you could end up corrupting your files, I think … But as a general rule you want to close the files you open, and the with statement does this automatically for us.

The second reason is that if you iterate over the file object (here “f”), you won’t be able to iterate over it again (I think technically, because you’re working with an “iterator”, rather than an “iterable”). Run this:

f = codecs.open('data.csv', encoding='utf-8')
for line in f:
    print(line)
for line in f:
    # this won't print anything!
    print(line)

If you iterate twice over the list we created with .splitlines(), you won’t run into this problem.

Topic		Replies	Views
# -- coding: utf-8 -- ruins my code Coding	0	1355	October 4, 2019
Accept non-ascii characters from MRI trigger Coding	20	3294	January 11, 2017
Present text stimuli that contain umlauts (ä,ö,ü,ß) Coding	2	804	July 23, 2019
UTF-8 encoded .csv being read as ISO-8859-1 Online experiments	2	1325	March 22, 2020
Unable to show unicode from excel files Coding	15	4467	November 8, 2017

Umlaute in input file

Related topics