Challenge Problem

Select one of the problems from below that you will enjoy working on.  Ideally perform your analysis in an ipython notebook.  Post the notebook on Github and submit your results.

Language Detection

European Parliament Proceedings Parallel Corpus is a text dataset used for evaluating language detection engines. The 1.5GB corpus includes 21 languages spoken in EU.  

Create a machine learning model trained on this dataset to predict the following test set.

Stochastic Volatility 

In this test, you are given the daily value of a SP500 index. Please calculate the stochastic volatility of the index and do rolling one-step-ahead forecast.

Trajectories of Taxis


This dataset contains the trajectories of thousands of taxis operating in China. Your task is to read through the following  paper and produce the first graphs (distribution of distances and sampling time interval).

Next, please pick a trajectory for a particular trip and determine its smoothed trajectory (using Kalman filter for example or splines)

Airline On-Time Arrivals

Use the US Dept. of Transportation on-time arrival data for non-stop domestic flights by major air carriers to predict arrival delays.

Build a binary classification model for predicting arrival delays or a regression model that predicts the extent of the delay.  Do not use departure delay as an input feature.

Global Terrorist Attacks

Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.

Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.