We require fellows to work on a small challenge problem to assess problem solving and coding capabilities. Select a problem from the list below. Ideally perform your analysis in a jupyter notebook. Post the notebook on Github and submit your results.
Some hints for hacking our challenge:
- Ask yourself why would they have selected this problem for the challenge? What are some gotchas in this domain I should know about?
- What is the highest level of accuracy that others have achieved with this dataset or similar problems / datasets ?
- What types of visualizations will help me grasp the nature of the problem / data?
- What feature engineering might help improve the signal?
- Which modeling techniques are good at capturing the types of relationships I see in this data?
- Now that I have a model, how can I be sure that I didn't introduce a bug in the code? If results are too good to be true, they probably are!
- What are some of the weakness of the model and and how can the model be improved with additional work?
Airline On-Time Arrivals
Use the US Dept. of Transportation on-time arrival data for non-stop domestic flights by major air carriers to predict arrival delays.
Build a binary classification model for predicting arrival delays or a regression model that predicts the extent of the delay. Do not use departure delay as an input feature.
Global Terrorist Attacks
Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.
Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.