Category Archives: Data science

[Video 168] Joris Van Den Bossche: Introduction to Pandas

May 8, 2015 reuven Leave a comment

Pandas is a Python library for reading and manipulating structured data, and is quickly becoming a standard among a large number of data scientists looking to work with, clean, and analyze data. Indeed, that’s what many data scientists (and other analysts) spend a great deal of time doing: They have to take dirty data sets and clean them. Then, after cleaning the data sets, the data has to be manipulated. Finally, the results need to be displayed for others to see and use. Pandas makes all of these tasks fairly easy, but also efficient — thanks in no small part to its use of NumPy arrays.

In this talk by researcher Joris Van den Bossche, we’re introduced to Pandas, learning about its functionality but also where and how to use it. If you’re a data scientist, or experimenting with such manipulations, then this talk will help you to understand Pandas from the perspective of someone who uses it every day.

Slides for the talk are at http://www.slideshare.net/PoleSystematicParisRegion/track-13-joris-van-den-bossche.

Data science, Python

Ian Oszvald: Cleaning Confused Collections of Characters

April 26, 2015 reuven Leave a comment

The world is a messy place, and trying to make sense of it can be quite demanding for a program — or the programmer writing that program. If you’re trying to make sense of text files, such as Word documents or PDF, then it’s particularly difficult to extract useful meaning. Adjacent words in the final output might not really be adjacent in the file, character encodings might not be set correctly, and poorly standardized things such as measurements and dates can also cause trouble. For this reason, any sort of serious data analysis starts with the cleaning up of the data source, turning it into something that can be handled reasonably. In this talk, Ian Oszvald describes some of the Python programming techniques he employs in his job to clean data, so that he can then manipulate and work with it.

Data science, Functional programming, Open source

Adam Shook: Hadoop Basics for Big Data Rookies

April 21, 2015 reuven Leave a comment

This talk introduces Hadoop, the open-source system for storing and analyzing big data. How does it work? And (perhaps most importantly) what are some of the tools that are now included in the Hadoop ecosystem, which allow us to analyze data in new and different ways? In this talk, Hadoop expert Adam Shook introduces the entire Hadoop ecosystem, demonstrating simple (but telling) examples of how and where to use Hadoop (and related tools) in your applications.

Data science, Open source

Reynold Xin and Aaron Davidson: Mining Big Data with Apache Spark

April 18, 2015 reuven Leave a comment

“Big data” is big, and tools to work with it are also big — in that they’re both numerous, and are growing in popularity and sophistication. One of the latest technologies aimed at making big data accessible and easier to analyze is Apache Spark — which operates in memory, is highly parallel, working with a number of programming languages (including Java, Scala, and Python), uses a variety of back ends, and can be queried using familiar tools such as SQL. In this talk, Spark developers Reynold Xin and Aaron Davidson introduce Spark, and describe what it can do for you and your organization.

Data science, Databases

Michael Stonebreaker: Big Data is (at least) Three Different Problems

March 29, 2015 reuven Leave a comment

Michael Stonebreaker is known for many advances in the world of databases: He created Ingres, Postgres (which was later used as the basis for PostgreSQL), and even more recently SciDB, VoltDB, and Tamr. He was, last week, awarded the Turing Award by the ACM, for his contributions for the world of databases. In this talk, Stonebreaker describes the term “big data,” and what it really means for people implementing and using databases.

Data science, Python

Ville Tuulos: A Billion Rows per Second — Metaprogramming Python for Big Data

March 15, 2015 reuven Leave a comment

Python is used in a wide variety of settings. Somewhat surprisingly, it’s becoming particularly popular for “big data” analysis, looking for trends and correlations in large data sets. In this talk, Ville Tuulos describes his company’s uses of Python in big-data analysis. He compares different techniques and technologies that can be applied to big data, and talks about how (and why) his company uses Python, which is often seen as a slow language — not appropriate or such large data sets.

Data science, Python, Scientific computing

Taavi Burns on Pandas and logfile analysis

December 10, 2014 reuven Leave a comment

The previous video introduced Pandas. In this relatively short video,Taavi Burns show us how we can use Pandas for a specific task, namely the analysis of logfiles.

Data science, Python, Scientific computing

Wes McKinney on Pandas

December 9, 2014 reuven Leave a comment

Wes McKinney describes Pandas, a library for data analysis written in Python. Pandas sits on top of NumPy, and provides many of the same manipulation possibilities as the R language.

Daily Tech Video

Category Archives: Data science

[Video 168] Joris Van Den Bossche: Introduction to Pandas

Ian Oszvald: Cleaning Confused Collections of Characters

Adam Shook: Hadoop Basics for Big Data Rookies

Reynold Xin and Aaron Davidson: Mining Big Data with Apache Spark

Michael Stonebreaker: Big Data is (at least) Three Different Problems

Ville Tuulos: A Billion Rows per Second — Metaprogramming Python for Big Data

Taavi Burns on Pandas and logfile analysis

Wes McKinney on Pandas

Daily technology videos, curated by Reuven Lerner.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Daily technology videos, curated by Reuven Lerner.