Challenge Problem

We require fellows to work on a small challenge problem to assess problem solving and coding capabilities. Select a problem from the list below.  Ideally perform your analysis in a jupyter notebook.  Post the notebook on Github and submit your results.

Some hints for hacking our challenge:

  • Ask yourself why would they have selected this problem for the challenge? What are some gotchas in this domain I should know about?
  • What is the highest level of accuracy that others have achieved with this dataset or similar problems / datasets ?
  • What types of visualizations will help me grasp the nature of the problem / data?
  • What feature engineering might help improve the signal?
  • Which modeling techniques are good at capturing the types of relationships I see in this data?
  • Now that I have a model, how can I be sure that I didn't introduce a bug in the code? If results are too good to be true, they probably are!
  • What are some of the weakness of the model and and how can the model be improved with additional work?

Language Detection

European Parliament Proceedings Parallel Corpus is a text dataset used for evaluating language detection engines. The 1.5GB corpus includes 21 languages spoken in EU.  

Create a machine learning model trained on this dataset to predict the following test set.


Stochastic Volatility 

In this test, you are given the daily value of a SP500 index. Please calculate the stochastic volatility of the index and do rolling one-step-ahead forecast.


Airline On-Time Arrivals

Use the US Dept. of Transportation on-time arrival data for non-stop domestic flights by major air carriers to predict arrival delays.

Build a binary classification model for predicting arrival delays or a regression model that predicts the extent of the delay.  Do not use departure delay as an input feature.


Global Terrorist Attacks

Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.

Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.