# The Redpoll Platform vs. Standard Machine Learning

Here we demonstrate some of the differences between standard machine learning approaches and the Redpoll discovery platform. This is not meant to be a rigorous analysis, it is meant to showcase the ease of question-asking the Redpoll platform enables. Because we do not wish to spend months on this analysis, when we are forced to make decisions about which model or method to use under the standard machine learning model we will choose what is easiest, not what is best. To put it differently: **we will be making nearly every bad, lazy decision to make using standard machine learning seem simpler.**

## The Dataset

This dataset comes to us from the Union of Concerned Scientists. It is a dataset of Earth satellites and contains information regarding the owner, operator, hardware characteristics, and orbital characteristics of 1167 satellites.

In the figure above the black cells represent present data, and the white cells represent missing data. There are several columns that have more than half their data missing and there are no rows that have zero missing cells. To further complicate matters, the data comprise a combination of continuous-valued columns like `Apogee_km`

(the highest point of an orbit), and categorical-valued columns like `Class_of_Orbit`

, which typically requires manual preprocessing and restrict modeling decisions.

## Data Preparation and Training

### Redpoll

We give the platform the raw data and ask it to learn, or `run`

. The platform decides what types of data are in each column and builds a model of the dataset. It has no problem with mixed data types or missing data.

>>> import redpoll >>> redpoll.run( ... input="data.csv", ... output="satellites.rp" ... )

### Random Forest

The data pose a number of problems for standard machine learning approaches.

First there are continuous columns like `period_min`

(which describes the time in minutes it takes for a satellite to complete and orbit), and there are categorical columns like `Country_of_Contractor`

. Standard machine learning models usually accept one or the other. Deep Learning and Random Forests work only for continuous inputs and outputs, so we'll have to figure out a way to transform the data to fit the model. You might think that we could simply convert the country into an integer and use that. But is `Switzerland + 1 = France`

? No. So what we have to do is to expand the `Country_of_Contractor`

variable into N - 1 binary variables, where N is the number of possible countries. This is called one-hot encoding. Using one-hot for this one column adds 53 variables. Using one-hot for all categorical columns increases our total variable count from 20 to 432. This is really, really bad because for each additional variable, we need exponentially more data to be able to learn. This is the curse of dimensionality.

The second problem is that there are columns with tons of missing data. Random Forests and Deep Learning require complete data. There are a few ways to handle this. We could drop all columns with missing values. This would leave us with **zero** data. No good. We could fill in the data. Using the mean or median is common for continuous data, but it is biasing because we are feeding the model data we made up; data that were not generated by the process we're trying to learn about. But what about the categorical columns? What is the mean or median country? We could fill in using the most common value (the mode), but this is extremely biasing. We can't just assume that the `Class_of_Orbit`

for every satellites is geosynchronous.

Another option is to create a imputation model that uses the other columns to infer the value of the missing column. So now we are predicting before we predict, and will still have to navigate all the data processing issues to make these imputation models.

Since we must make a decision, we will impute continuous columns with the median and we will impute categorical columns with the mode

>>> import pandas as pd >>> df = pd.read_csv("data.csv", index_col=0) >>> df.shape (1167, 20) >>> continuous_columns = [ ... 'Perigee_km', 'Apogee_km', 'Eccentricity', ... 'Period_minutes', 'Launch_Mass_kg', ... 'Dry_Mass_kg', 'Power_watts', ... 'Date_of_Launch', 'Expected_Lifetime', ... 'longitude_radians_of_geo', ... 'Inclination_radians' ... ] >>> >>> target_column = "Expected_Lifetime" >>> >>> for col in list(df.columns): ... # don't mess w/ the target ... if col == target_column: ... continue ... ... if col in continuous_columns: ... # fill in missing values with mean ... df[col].fillna(df[col].median(), inplace=True) ... ... else: ... # fill in missing values with mode ... df[col].fillna(df[col].mode()[0], inplace=True) ... ... # convert strings to categories ... df[col] = df[col].astype('category').cat.codes ... ... # create and lable one-hot features ... one_hot = pd.get_dummies(df[col]) ... one_hot.columns = ["%s_%d" % (col, label) for ... label in one_hot] ... ... # replace categorical column w/ one-hot dummies ... df.drop([col], axis=1) ... df = pd.concat([df, one_hot], axis=1)

We split the data into a training set and a test set based on which rows are missing the value of `Expected_Lifetime`

.

>>> df_el_in = df.loc[~df[target_column].isnull(), :] >>> df_el_out = df.loc[df[target_column].isnull(), :] >>> df_el_in.shape (878, 432)

We can see that we have increased the number of features from 20 to 432 — a 21.6x increase in dimensionality.

Now we can break the input/training dataset into inputs (x) and outputs (y), which we can feed to our models for training.

>>> from sklearn.ensemble import RandomForestRegressor >>> >>> el_predictor_columns = [col for col in df_el_in.columns if ... col != target_column] >>> el_x = df_el_in.drop(target_column, axis=1) >>> el_y = df_el_in[target_column] >>> >>> rf_el = RandomForestRegressor() >>> rf_el.fit(el_x, el_y)

## Parameter Selection, Feature Selection, and Validation

### Redpoll

The Redpoll platform does not require tuning or validating. This does not mean that tuning and validation occur under the hood, out of sight of the user; the mathematical foundations of the Redpoll platform make these procedures unnecessary. That said, it is easy to discover what the platform thinks is important. For example, we visualize which features are important to every possible prediction. This is called *structure learning* and is one of the most difficult problems in machine learning. The Redpoll platform does this automatically.

>>> rp.heatmap("importance")

Each cell tells us the probability that a statistical dependence exists between the two columns. A higher number (darker color) indicates higher probability.

### Random Forests

The above model was our first pass. It is virtually guaranteed to be suboptimal, especially because we have an average of only two data points per feature. We need to do a couple of things: 1) tweak the random forest parameters, and 2) reduce the number of features so we're not overburdening our data.

If we look at the documentation for the scikit-learn `RandomForestRegressor`

we see that there are 14 parameters to tweak. We're not going to tweak all of those. It's too much work for a questionable payoff. We're going to focus on `n_estimators`

(the number of trees in the forest), `criterion`

(the method for scoring the model), and `max_features`

(the number of features considered). To determine the best parameter set to use we must do a search. We will test a large number of parameter sets and determine which one produces the best performance. We measure performance with cross-validation, which breaks the training set up in to chunks, trains on part of the training set and tests on the left out set. There are many strategies for this, but cross-validation is a must because standard ML models are prone to fit to noise, or to overfit. Overfitting is a common reason that machine learning fails in production.

>>> from sklearn.ensemble import RandomForestRegressor >>> from sklearn.model_selection import cross_validate

The first thing we need to do is to define the space of parameters for our search. Rather than doing a random search, we're going to pre-define the values for each parameter and perform a grid search. We're going to optimize the sum of test R^{2} (an arbitrary decision).

>>> import itertools as it >>> >>> max_features = ['auto', 'sqrt', 'log2'] >>> criterion = ['mse', 'mae'] >>> n_estimators = [10, 100, 500] >>> >>> best_score = None >>> best_params = None >>> for params in it.product(max_features, criterion, n_estimators): ... ftrs, crit, n_est = params ... ... estimator = RandomForestRegressor( ... max_features=ftrs, ... criterion=crit, ... n_estimators=n_est ... ) ... ... scores = cross_validate( ... estimator, ... el_x, ... el_y, ... scoring=('r2',) ... ) ... score = sum(scores["test_r2"]) ... ... if best_score is None: ... best_score = score ... best_params = params ... ... if score > best_score: ... best_score = score ... best_params = params ... >>> best_params ('sqrt', 'mae', 100) >>> best_score 2.7996706065580783

This search over 18 parameter sets took about eight minutes to run. We now need to do feature selection. We will pull the feature importances from the random forest and choose the set of 20 features that are most important to the regression. Twenty features being another arbitrary, suboptimal choice made purely to save up work.

>>> el_model = RandomForestRegressor( ... max_features='sqrt', ... criterion='mae', ... n_estimators=100 ... ) >>> el_model.fit(el_x, el_y) >>> >>> el_importance = pd.DataFrame({ ... "Feature": el_x.columns, ... "Importance": el_model.feature_importances_ ... }) >>> >>> el_importance = el_importance\ ... .sort_values(by=['Importance'], ascending=False) >>> >>> el_features = el_importance.Feature.head(20).values >>> el_x2 = el_x[el_features] >>> >>> scores = cross_validate( ... el_model, ... el_x2, ... el_y, ... scoring=('r2',) ... ) >>> score = sum(scores["test_r2"]) >>> score 2.6586558550652244

## Prediction

### Redpoll

We've done our learning already. All we need to do is to spin up an interface to the platform, connect a client, and ask questions. Because we were able to include the data with the missing targets, we can ask to `impute`

those missing targets rather than treat them as predicting hypotheticals. This allows us to exploit the information that exist in those partial data.

>>> server = redpoll.launch(input="satellites.rp") >>> rp = redpoll.Client("localhost:8000") >>> el_imputations = rp.impute("Expected_Lifetime") >>> el_imputations.head() id uncertainty Expected_Lifetime -- ----------- ----------------- Advanced Orion 2 (NR... 0.0581857 8.38443 Advanced Orion 3 (NR... 0.0497906 8.54544 Advanced Orion 4 (NR... 0.033794 8.22257 Advanced Orion 5 (NR... 0.240803 13.8332 Advanced Orion 6 (NR... 0.0794021 8.10581

### Random Forests

This is pretty simple. We have trained and validated our model. Now all we need to do is feed the model the predictor data.

>>> predictors = [col for col in df_el_out.columns if ... col != target_column] >>> el_predictors = df_el_out.drop(target_column, axis=1) >>> el_predictions = el_model.predict(el_predictors)

## Classification

Let's try to predict the missing values for a categorical variable, `Type_of_Orbit`

. In standard machine learning predicting a categorical variable is called *classification*.

### Redpoll

Asking another question with the Redpoll platform is as simple as asking the question.

>>> ot_imputations = rp.impute("Type_of_Orbit") >>> ot_imputations.head() id uncertainty Type_of_Orbit -- ----------- ------------- AAUSat-3 0.177889 Sun-Synchronous ABS-1 (LMI-1, Lockhee... 0.518497 Sun-Synchronous ABS-1A (Koreasat 2, M... 0.664371 Sun-Synchronous ABS-2i (MBSat, Mobile... 0.575258 Sun-Synchronous ABS-7 (Koreasat 3, Mu... 0.523633 Sun-Synchronous

### Random Forests

We have to start from square one. We have to go back to the raw data because we filled in the missing values in this column. Then we have to fill in the missing values for the old target, so it can be used as a predictior; and we have to spit the dataset into one set were the target is present (training data), and the set for which the target is missing.

We also need to re-do the parameter search and validation.

First, we'll process the data using the same procedure as above.

>>> df = pd.read_csv("data.csv", index_col=0) >>> target_column = "Type_of_Orbit" >>> >>> for col in list(df.columns): ... # don't mess w/ the target ... if col == target_column: ... continue ... ... if col in continuous_columns: ... # fill in missing values with mean ... df[col].fillna(df[col].median(), inplace=True) ... ... else: ... # fill in missing values with mode ... df[col].fillna(df[col].mode()[0], inplace=True) ... ... # convert string categories to int codes ... df[col] = df[col].astype('category').cat.codes ... ... # create and lable one-hot features ... one_hot = pd.get_dummies(df[col]) ... one_hot.columns = ["%s_%d" % (col, label) for ... label in one_hot] ... ... # replace categorical column w/ one-hot dummies ... df.drop([col], axis=1) ... df = pd.concat([df, one_hot], axis=1) ... >>> df_co_in = df.loc[~df[target_column].isnull(), :] >>> df_co_out = df.loc[df[target_column].isnull(), :] >>> >>> df_co_in.shape (521, 425) >>> df_co_out.shape (646, 425)

Now we do parameter selection and cross-validation. The parameters and objective change a bit because we're doing classification rather than regression, but the procedure is the same.

>>> co_x = df_co_in[[col for col in df_co_in.columns if ... col != target_column]] >>> co_y = df_co_in[target_column] >>> from sklearn.ensemble import RandomForestClassifier >>> >>> max_features = ['auto', 'sqrt', 'log2'] >>> criterion = ['gini', 'entropy'] >>> n_estimators = [10, 100, 500] >>> >>> best_score = None >>> best_params = None >>> all_scores = [] >>> for params in it.product(max_features, criterion, ... n_estimators): ... ftrs, crit, n_est = params ... ... estimator = RandomForestClassifier( ... max_features=ftrs, ... criterion=crit, ... n_estimators=n_est, ... ) ... ... scores = cross_validate(estimator, co_x, co_y, ... scoring=('f1_micro',)) ... score = sum(scores["test_f1_micro"]) ... all_scores.append(score) ... if best_score is None: ... best_score = score ... best_params = params ... ... if score > best_score: ... best_score = score ... best_params = params UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning)

We got a lot of warnings. It seems that the random forest algorithm doesn't like when there is only one example of a class. In this case, the `Type_of_Orbit`

s `Cislunar`

and `Retrograde`

each appear only once in the training set.

In any case, we have our best parameter set (unless the warnings mean that the model is compromised) so we continue on to feature selection.

>>> (best_score, best_params) (4.510531135531136, ('auto', 'entropy', 10)) >>> co_model = RandomForestClassifier( ... max_features='auto', ... criterion='entropy', ... n_estimators=10 ... ) >>> co_model.fit(co_x, co_y) >>> co_importance = pd.DataFrame({ ... "Feature": co_x.columns, ... "Importance": co_model.feature_importances_ ... }) >>> co_importance = co_importance\ ... .sort_values(by=['Importance'], ascending=False) >>> co_features = co_importance.Feature.head(20).values >>> co_x2 = co_x[co_features] >>> scores = cross_validate(co_model, co_x2, co_y, ... scoring=('f1_micro',)) >>> score = sum(scores["test_f1_micro"]) >>> score py:667: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) 4.6069597069597075

And we get the same warning again. But it seems that reducing the number of features has produced a better-scoring model.

## Prediction certainty

How certain are we in our predictions?

### Redpoll

As a part of computing the imputation, we have already computed the imputation uncertainty. But what do those numbers mean?

>>> el_predictions.head(3) id uncertainty Expected_Lifetime -- ----------- ----------------- Advanced Orion 2 (NR... 0.0581857 8.38443 Advanced Orion 3 (NR... 0.0497906 8.54544 Advanced Orion 4 (NR... 0.033794 8.22257

Redpoll's prediction certainty is a measure of the platform's certainty in its ability to model a prediction. It has nothing to do with the variance of the distribution — the model might be very certain there is a high variance. We can visualize impute uncertainty to get a sense of what it means. For a high-uncertainty imputation:

>>> rp.impute_uncertainty( ... row="O3b PFM", ... column="Expected_Lifetime", ... render=True ... );

And for a low-uncertainty imputation:

>>> rp.impute_uncertainty( ... row="Advanced Orion 4 (NRO L-26, USA 202)", ... column="Expected_Lifetime", ... render=True ... );

We see that there is a substantial variance in the imputation distribution in both cases, but that the consensus for `Advanced Orion 4 (NRO L-26, USA 202)`

is much higher, thus its imputation has much less uncertainty.

### Random Forests

A Random Forest classifier provides a softmax function that produces "probabilities" for each class. These probabilities are not uncertainties — though they are often confused for them. We would like to know the uncertainty of each class over softmax functions.

There are few methods for quantifying uncertainty in random forests, but there is nothing in the machine learning library we're using, so we have to bring in another library, forestci.

>>> import forestci >>> forestci.random_forest_error( ... rf_el, ... el_x, ... el_predictors, ... ) RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask array([15.77225733, 16.83583305, 16.14404858, ..., 24.89957166, 16.18654164, 15.23748171])

This gives us a "confidence interval" number for each prediction, but it does not give us a way to visualize what that number means. And there were some warnings. We will hope these warning do not mean our certainty information is compromised.

## Anomaly detection

### Redpoll

Again, we've done our learning already. We have created a model of satellites, which means we have an idea of how features of satellites affect other features. Because we have this type of intuitive model, it is easy to see when a satellite violates our intuitive model. We say the degree to which a data point violates our expectation is its novelty — or anomalousness.

First, we ask which satellites have unexpected lifetimes.

>>> rp.novelty("Expected_Lifetime")\ ... .sort_values(by=["novelty"], ascending=False)\ ... .head() id Expected_Lifetime novelty -------------------------- ----------------- ------- International Space Sta... 30 9.62866 Landsat 7 15 5.30769 Optus B3 0.5 4.4782 Intelsat 701 0.5 4.2953 Sicral 1A 0.5 4.28725

Redpoll finds that the ISS is an unusual satellite. Low-Earth-Orbit satellites are not meant to last long, and the ISS is has the longest expected lifetime of any satellite in the dataset.

Next, we ask which satellites have an unexpected orbital period.

>>> rp.novelty("Period_minutes")\ ... .sort_values(by=["novelty"], ascending=False)\ ... .head() id Period_minutes novelty -------------------------- -------------- ------- Wind (International Sol... 19700.5 10.9548 Spektr-R/RadioAstron 0.22 9.18558 Interstellar Boundary E... 0.22 9.16484 Chandra X-Ray Observato... 3808.92 9.14082 Tango (part of Cluster ... 3442 8.89269

The Redpoll platform finds some unusual satellites. Two are reported to orbit the Earth in less than 15 seconds (0.22 minutes). These are data entry errors. Spektr's orbital period is 12769.93 minutes according to Wikipedia. The Wind satellite is another oddity. It sits on a stable gravitational point called a *Lagrange point* so it doesn't move much, so it has a long orbital period. Using the platform, we have found both outliers and data entry errors.

### Random Forests

There is no way to do this with a Random Forest regressor. The standard way that we do this in statistics is to arbitrarily classify all data that are greater than three standard deviations away from the mean as outliers.

>>> pm = df.Period_minutes[~df.Period_minutes.isnull()] >>> pm_dist = ((pm-pm.mean())/pm.std()).abs() >>> outliers = pm_dist[pm_dist > 3] >>> pm = df.Period_minutes.loc[list(outliers.index)]\ ... .sort_values(ascending=False) >>> pm.head() id Period_minutes -- -------------- Wind (International Solar-Terres... 19700.45 Integral (INTErnational Gamma-Ra... 4032.86 Chandra X-Ray Observatory (CXO) 3808.92 Tango (part of Cluster quartet, ... 3442.00 Rumba (part of Cluster quartet, ... 3431.10

Since the data are skewed, this is giving us the satellites with the highest orbital period.

# Conclusion

We have seen that the Redpoll Discovery Platform makes discovery trivial. Standard machine learning is built to model questions. This requires an immense amount of effort from the user to find a valuable question and to frame that question in a way that is both compatible with a specific machine learning method, and will not bias the answer. On the other hand, the Redpoll Discovery Platform is built to model data, which means that it is general to data in real-world condition (messy, mixed, and full of holes), which eliminates most data preprocessing; and once the learning is done, the user may any number of questions with very little effort.