Category Archives: Data science

[Video 168] Joris Van Den Bossche: Introduction to Pandas

Pandas is a Python library for reading and manipulating structured data, and is quickly becoming a standard among a large number of data scientists looking to work with, clean, and analyze data. Indeed, that’s what many data scientists (and other analysts) spend a great deal of time doing: They have to take dirty data sets and clean them.  Then, after cleaning the data sets, the data has to be manipulated.  Finally, the results need to be displayed for others to see and use.  Pandas makes all of these tasks fairly easy, but also efficient — thanks in no small part to its use of NumPy arrays.

In this talk by researcher Joris Van den Bossche, we’re introduced to Pandas, learning about its functionality but also where and how to use it. If you’re a data scientist, or experimenting with such manipulations, then this talk will help you to understand Pandas from the perspective of someone who uses it every day.

Slides for the talk are at

Ian Oszvald: Cleaning Confused Collections of Characters

The world is a messy place, and trying to make sense of it can be quite demanding for a program — or the programmer writing that program. If you’re trying to make sense of text files, such as Word documents or PDF, then it’s particularly difficult to extract useful meaning. Adjacent words in the final output might not really be adjacent in the file, character encodings might not be set correctly, and poorly standardized things such as measurements and dates can also cause trouble. For this reason, any sort of serious data analysis starts with the cleaning up of the data source, turning it into something that can be handled reasonably. In this talk, Ian Oszvald describes some of the Python programming techniques he employs in his job to clean data, so that he can then manipulate and work with it.

Adam Shook: Hadoop Basics for Big Data Rookies

This talk introduces Hadoop, the open-source system for storing and analyzing big data. How does it work? And (perhaps most importantly) what are some of the tools that are now included in the Hadoop ecosystem, which allow us to analyze data in new and different ways? In this talk, Hadoop expert Adam Shook introduces the entire Hadoop ecosystem, demonstrating simple (but telling) examples of how and where to use Hadoop (and related tools) in your applications.

Reynold Xin and Aaron Davidson: Mining Big Data with Apache Spark

Big data” is big, and tools to work with it are also big — in that they’re both numerous, and are growing in popularity and sophistication. One of the latest technologies aimed at making big data accessible and easier to analyze is Apache Spark — which operates in memory, is highly parallel, working with a number of programming languages (including Java, Scala, and Python), uses a variety of back ends, and can be queried using familiar tools such as SQL. In this talk, Spark developers Reynold Xin and Aaron Davidson introduce Spark, and describe what it can do for you and your organization.

Michael Stonebreaker: Big Data is (at least) Three Different Problems

Michael Stonebreaker is known for many advances in the world of databases: He created Ingres, Postgres (which was later used as the basis for PostgreSQL), and even more recently SciDB, VoltDB, and Tamr. He was, last week, awarded the Turing Award by the ACM, for his contributions for the world of databases. In this talk, Stonebreaker describes the term “big data,” and what it really means for people implementing and using databases.


Ville Tuulos: A Billion Rows per Second — Metaprogramming Python for Big Data

Python is used in a wide variety of settings. Somewhat surprisingly, it’s becoming particularly popular for “big data” analysis, looking for trends and correlations in large data sets. In this talk, Ville Tuulos describes his company’s uses of Python in big-data analysis. He compares different techniques and technologies that can be applied to big data, and talks about how (and why) his company uses Python, which is often seen as a slow language — not appropriate or such large data sets.