Category Archives: Data science

[Video 355] Dafna Shahaf: Inside Jokes

Can we use a computer to identify jokes? Or to identify what’s likely to be a good joke, rather than a bad joke? In this talk, Dafna Shahaf discusses a data-science project on which she worked with the New Yorker magazine, looking to identify potentially funny cartoon captions faster and more reliably than a human reader. What aspects of a cartoon’s caption is likely to make it funnier? Is it possible to automatically identify funnier captions? The preliminary results are promising — and even if you’re not convinced by the algorithms, at least you’ll get a chance to read some cartoons, and call it work.

[Video 346] Leah Hanson: How Julia Goes Fast

Julia is a relatively new programming language, one meant for use with data analysis (aka “data science”). Julia aims to simultaneously provide a low threshold for entry (so that non-programmers can analyze their data without learning too much about programming) and high-performance execution. How does Julia manage to do this, and how do the results compare with a language such as Python‘s NumPy and SciPy? In this talk, Leah Hanson describes Julia’s aims, the differences between Julia and other languages (such as Python), and what how these design decisions have affected the resulting language.

[Video 343] Eric Carlisle: Dazzling Data Depiction with D3.js

Having lots of data isn’t enough; you have to present it in ways that are interesting and compelling. D3 is a popular JavaScript library that makes it straightforward to turn data into complex visualizations, in a variety of formats. In this talk, Eric Carlisle introduces D3, describing and demonstrating many of its capabilities.

[Video 332] Ronny Kohavi: Online Controlled Experiments

Many of us know that in order for a Web application to improve, it’s often a good idea to run A/B experiments: Given enough visitors to a site, you can present different versions to users, and see which one is most effective. This technique is just one way in which you can conduct online experiments. How can we control these experiments, and thus learn more from the results? What sorts of experiments can we run? What sorts of experiments have companies successfully used in the last few years? In this talk, Ronny Kohavi describes the history of controlled experiments, and of online controlled experiments, and provides examples of how they have helped to improve a number of businesses. He also gives us hints for how to create our own experiments, and how to make those provide us with powerful and useful results.

[Video 328] Stefan Behnel: Get Native with Cython

Python‘s speed is often good enough for many purposes. But in some cases, you really wish that you could run your Python code at C-language speed. You could always write a Python extension in C, but if you’re not a fluent C programmers, then that’s not really an option. Plus, you’d lose all of the Python expressiveness that you already enjoy. Enter Cython, which translates your Python into C — and then provides you with a Python library that you can import into your Python code. In this talk, Stefan Behnel introduces Cython, demonstrates what it can do, and describes how it can fit into a Python development shop aiming to increase execution performance while working in a Python (and Python-like) environment.

[Video 327] Bryan Van de Ven: Bokeh for Data Storytelling

Data science is all about finding insights in large data sets. In many cases, these insights are easier to find, or at least more persuasive, when they are visualized. A number of graphics libraries exist in the data-science world, some of which are written in Python. In this talk, Brian Van de Ven introduces Bokeh, a visualization library for Python that works with other elements of the SciPy stack to create beautiful charts and graphs.  Bokeh goes further than other libraries, in that it not only produces static images, but also dynamic visualizations in which someone can explore the data.

[Video 324] Pam Selle: Streams — The data structure we need

How big is the data you’re processing? If it’s large, then you will probably not want to put all of it in a single data structure, in order to save memory. Instead, you can use a lazy list, aka a “stream,” which allows us to consume very small amounts of memory while working with very large, or even infinitely large, data structures. In this talk, Pam Selle describes streams, demonstrates why they are useful in general, and then talks about ways in which we can work with streams in JavaScript — including a summary of the standards and data structures that will be included in upcoming versions of JavaScript. If you’re planning to work with large amounts of data, regardless of the language with which you’ll be working, then this talk will be of interest to you.

[Video 311] Rodrigo Schmidt: Scaling Instagram Data Systems

Instagram is a huge Web application, and has needed to scale very rapidly. In this talk, Rodrigo Schmidt asks the questions: What do you scale, how do you scale, and why do you even need to scale? You can’t scale everything at once, so how do you prioritize which things should be scaled first? How can you grow from a Web application to a mix of Web and mobile users?

[Video 294] Britta Weber: Make Sense of your Logs

Nearly every server program produces logs — and while they’re often useful for debugging, they can be even more useful in helping to understand, review, and gain insights into the nature of your software and users.   In this talk, Britta Weber describes the “ELK stack,” consisting of Elasticsearch + logstash + kibana, three open-source tools which can work together to provide useful insights from your logfiles.

[Video 292] Anthony Goldbloom: Latest developments at the cutting edge of Data Science

Data science is a growing discipline; there are so many techniques, data sets, and people in the field that it’s increasingly hard to keep track of what is happening.  In this talk, Anthony Goldbloom describes the sorts of data sets that Kaggle (his company) has been publishing as part of its data-science contests, aimed at crowdsourcing solutions. If you’re interested in how data science techniques are being used to solve problems, or are trying to understand just where and how data science methods differ from others, this talk should be of interest.