Ask better questions
If you are working on the wrong problem, the quality of even the best neural networks, probabilistic and gradient boosted models would be useless. However, in data science, it is assumed that the problem being worked on is always clear. It is not.
In his post Tukey, Design Thinking, and Better Questions, Roger Peng points to the importance of asking better questions as a data scientist.
The best neural networks, probabilistic and gradient boosted models are worthless, if you are working on the wrong problem.
But in data science the problem should always be clear, right?
Quite the opposite. Let's consider the following scenario:
🛳️
A port needs you to predict how the water levels rise when the tide rolls in, to ensure that the planned loading dock won't get flodded. You get access to a ton of data on the water level during peak flood. You add more data on the daily weather, day of the year and number of ships in the port, and use these as your features. You regress the current water-level on yesterdays water-level and these additional data. The model turns out great! It predicts the water levels very precisely in your test data. You are very confident in your results.
Within the first week after construction the new loading dock gets flooded. Your model's prediction was way off – Where did it go wrong?
The theme of the post gives away what the answer to this riddle might be: Somehow, this is about the question you did (or did not) ask yourself when working on this project.
In fact, it is about a buried question. 🍃