Ian Oszvald: Cleaning Confused Collections of Characters

The world is a messy place, and trying to make sense of it can be quite demanding for a program — or the programmer writing that program. If you’re trying to make sense of text files, such as Word documents or PDF, then it’s particularly difficult to extract useful meaning. Adjacent words in the final output might not really be adjacent in the file, character encodings might not be set correctly, and poorly standardized things such as measurements and dates can also cause trouble. For this reason, any sort of serious data analysis starts with the cleaning up of the data source, turning it into something that can be handled reasonably. In this talk, Ian Oszvald describes some of the Python programming techniques he employs in his job to clean data, so that he can then manipulate and work with it.

Leave a Reply