I was recently introduced to new term: "data drift". Data drift is fundamentally just changes in data that cause more work for data scientists and engineers. Of course, from a learning standpoint, some changes in data are good. If we didn't have a variety of data, we wouldn't be able to learn because we would just have a bunch of copies of a single observation. Data drift might refer to changes in the way the data are collected or stored (changing formats of schema), or what certain data mean (using a catchall category to signify 'missing' or 'other'). The kind of data drift that scares data scientist the most is the kind you cannot see; the kind that breaks your model in production and causes all kinds of damage. This kind of unseen data drift occurs when the data process itself changes. Change is constant. Pitiless and relentless. You will encounter it. You are encountering it now. And even though global change is an inescapable universal phenomenon, machine learning works only if it does not happen.
Why does data change?
Example: Population Genomics
How we deal with change?
How we should deal with change?
- What is data drift? The data change over time.
- Why do the data change over time? Because everything changes over time.
- Sources of data drift -- relate to novelty.
- What does data drift do to ML. It makes it not work.
- How do we deal with data drift today?
- Choose 'stable' features (ignore it)
- constantly re-train
- lots of manual human decision making
- How we should be dealing with it
- AI should be dealing with it
- Population genomics