Can we use a computer to identify jokes? Or to identify what’s likely to be a good joke, rather than a bad joke? In this talk, Dafna Shahaf discusses a data-science project on which she worked with the New Yorker magazine, looking to identify potentially funny cartoon captions faster and more reliably than a human reader. What aspects of a cartoon’s caption is likely to make it funnier? Is it possible to automatically identify funnier captions? The preliminary results are promising — and even if you’re not convinced by the algorithms, at least you’ll get a chance to read some cartoons, and call it work.
Julia is a relatively new programming language, one meant for use with data analysis (aka “data science”). Julia aims to simultaneously provide a low threshold for entry (so that non-programmers can analyze their data without learning too much about programming) and high-performance execution. How does Julia manage to do this, and how do the results compare with a language such as Python‘s NumPy and SciPy? In this talk, Leah Hanson describes Julia’s aims, the differences between Julia and other languages (such as Python), and what how these design decisions have affected the resulting language.
Many of us know that in order for a Web application to improve, it’s often a good idea to run A/B experiments: Given enough visitors to a site, you can present different versions to users, and see which one is most effective. This technique is just one way in which you can conduct online experiments. How can we control these experiments, and thus learn more from the results? What sorts of experiments can we run? What sorts of experiments have companies successfully used in the last few years? In this talk, Ronny Kohavi describes the history of controlled experiments, and of online controlled experiments, and provides examples of how they have helped to improve a number of businesses. He also gives us hints for how to create our own experiments, and how to make those provide us with powerful and useful results.
Python‘s speed is often good enough for many purposes. But in some cases, you really wish that you could run your Python code at C-language speed. You could always write a Python extension in C, but if you’re not a fluent C programmers, then that’s not really an option. Plus, you’d lose all of the Python expressiveness that you already enjoy. Enter Cython, which translates your Python into C — and then provides you with a Python library that you can import into your Python code. In this talk, Stefan Behnel introduces Cython, demonstrates what it can do, and describes how it can fit into a Python development shop aiming to increase execution performance while working in a Python (and Python-like) environment.
Data science is all about finding insights in large data sets. In many cases, these insights are easier to find, or at least more persuasive, when they are visualized. A number of graphics libraries exist in the data-science world, some of which are written in Python. In this talk, Brian Van de Ven introduces Bokeh, a visualization library for Python that works with other elements of the SciPy stack to create beautiful charts and graphs. Bokeh goes further than other libraries, in that it not only produces static images, but also dynamic visualizations in which someone can explore the data.
Instagram is a huge Web application, and has needed to scale very rapidly. In this talk, Rodrigo Schmidt asks the questions: What do you scale, how do you scale, and why do you even need to scale? You can’t scale everything at once, so how do you prioritize which things should be scaled first? How can you grow from a Web application to a mix of Web and mobile users?
Nearly every server program produces logs — and while they’re often useful for debugging, they can be even more useful in helping to understand, review, and gain insights into the nature of your software and users. In this talk, Britta Weber describes the “ELK stack,” consisting of Elasticsearch + logstash + kibana, three open-source tools which can work together to provide useful insights from your logfiles.
Data science is a growing discipline; there are so many techniques, data sets, and people in the field that it’s increasingly hard to keep track of what is happening. In this talk, Anthony Goldbloom describes the sorts of data sets that Kaggle (his company) has been publishing as part of its data-science contests, aimed at crowdsourcing solutions. If you’re interested in how data science techniques are being used to solve problems, or are trying to understand just where and how data science methods differ from others, this talk should be of interest.