Challenge Problem

We require fellows to work on a small challenge problem to assess problem solving and coding capabilities. Select a problem from the list below.  Ideally perform your analysis in a jupyter notebook.  Post the notebook on Github and submit your results.

Some hints for hacking our challenge:

  • Ask yourself why would they have selected this problem for the challenge? What are some gotchas in this domain I should know about?
  • What is the highest level of accuracy that others have achieved with this dataset or similar problems / datasets ?
  • What types of visualizations will help me grasp the nature of the problem / data?
  • What feature engineering might help improve the signal?
  • Which modeling techniques are good at capturing the types of relationships I see in this data?
  • Now that I have a model, how can I be sure that I didn't introduce a bug in the code? If results are too good to be true, they probably are!
  • What are some of the weakness of the model and and how can the model be improved with additional work?

Select Your Challenge Problem


Transfer Learning

Use a pre-trained deep learning vision model to predict STL-10 dataset.

Image Segmentation

Apply an automatic portrait segmentation model (aka image matting) to celebrity face dataset

Language Detection

European Parliament Proceedings Parallel Corpus is a text dataset used for evaluating language detection engines. The 1.5GB corpus includes 21 languages spoken in EU.  

Create a machine learning model trained on this dataset to predict the following test set.

Global Terrorist Attacks

Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.

Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident. 

Recommender System

Build a basic recommender system based on any of the Lab41 dataset references.