AutoML Engine

Rassket's AutoML engine automates the entire machine learning pipeline, from data preprocessing to model deployment. This page explains what's automated, what decisions Rassket makes, and what you can control.

What is Automated

1. Data Preprocessing

  • Missing Value Handling: Automatic imputation strategies based on data type and domain
  • Duplicate Removal: Identifies and removes duplicate rows
  • Data Type Detection: Automatically identifies numeric, categorical, and datetime columns
  • Encoding: Converts categorical variables to numeric (one-hot, label, target encoding)
  • Scaling: Normalizes features for model compatibility

2. Feature Engineering

  • Domain-Specific Features: Creates features based on selected domain
  • Temporal Features: Extracts time-based features from datetime columns
  • Interaction Terms: Creates feature interactions when beneficial
  • Polynomial Features: Generates polynomial combinations (when appropriate)
  • Feature Selection: Removes redundant or low-importance features

3. Problem Type Detection

  • Automatic Detection: Identifies regression vs. classification
  • Binary vs. Multi-class: Distinguishes binary from multi-class classification
  • Target Analysis: Analyzes target variable characteristics

4. Model Selection

  • Algorithm Library: Selects from XGBoost, LightGBM, CatBoost, Random Forest, Linear Models, and more
  • Context-Aware Selection: Chooses models appropriate for your problem type and domain
  • Ensemble Methods: Combines multiple models when beneficial

5. Hyperparameter Optimization

  • Automated Tuning: Uses Optuna for Bayesian optimization
  • Search Space: Automatically defines reasonable hyperparameter ranges
  • Cross-Validation: Uses k-fold cross-validation for robust evaluation
  • Early Stopping: Stops optimization when improvements plateau

6. Model Evaluation

  • Metric Calculation: Computes comprehensive evaluation metrics
  • Cross-Validation: Validates model performance robustly
  • Feature Importance: Calculates SHAP values for interpretability
  • Model Diagnostics: Performs statistical diagnostics

7. Model Comparison

  • Multi-Model Training: Trains multiple models when requested
  • Performance Comparison: Compares models side-by-side
  • Best Model Selection: Identifies top performer automatically
Full Automation: Rassket handles all of the above automatically. You can optionally override some decisions (like model selection), but defaults work well for most cases.

What Decisions Rassket Makes Automatically

Preprocessing Decisions

  • Imputation Strategy: Chooses median/mean/mode based on data distribution
  • Encoding Method: Selects one-hot vs. label encoding based on cardinality
  • Scaling Method: Chooses standardization vs. normalization based on model needs

Feature Engineering Decisions

  • Feature Creation: Decides which domain-specific features to create
  • Interaction Terms: Identifies beneficial feature interactions
  • Feature Selection: Removes low-importance features automatically

Model Selection Decisions

  • Algorithm Choice: Selects best-performing algorithms for your problem
  • Model Complexity: Balances model complexity with performance
  • Ensemble Strategy: Decides when to use ensemble methods

Hyperparameter Decisions

  • Search Space: Defines reasonable ranges for each hyperparameter
  • Optimization Strategy: Chooses optimization algorithm (TPE, CMA-ES, etc.)
  • Stopping Criteria: Determines when optimization is complete

Evaluation Decisions

  • Metric Selection: Chooses appropriate metrics for your problem type
  • Validation Strategy: Selects train/validation/test split ratios
  • Cross-Validation Folds: Determines number of folds for robust evaluation

How Domain Awareness Improves AutoML Results

Domain-Specific Feature Engineering

When you select a domain, Rassket applies domain-specific feature engineering:

  • Energy Domain: Temporal features, lag variables, peak indicators, seasonal patterns
  • Research Domain: Statistical transformations, interaction terms, experimental features

These features improve model accuracy because they capture domain-specific patterns that generic AutoML might miss.

Domain-Aware Model Selection

Domain selection influences model recommendations:

  • Energy Forecasting: May prioritize time-series models
  • Research Analysis: May prioritize interpretable models

Domain-Contextual Explanations

Domain awareness enables explanations that make sense in your context:

  • Uses domain-appropriate terminology
  • Provides context-specific insights
  • Explains implications in your domain language

What Users Can Control

1. Domain Selection

You choose the domain (Energy or Research) and optionally a sub-domain. This is the primary way you influence AutoML behavior.

2. Target Column

You select which column to predict. This is required for training but optional for initial analysis.

3. Problem Type

You can override automatic problem type detection if needed (though automatic detection is usually accurate).

4. Model Selection

You can choose specific models to train:

  • Select from recommended models
  • Train multiple models for comparison
  • Let Rassket choose automatically (default)

5. Training Data

You control what data to upload and use for training. Rassket handles everything else automatically.

Control vs. Automation: Rassket provides sensible defaults for everything, but you can override key decisions when needed. Most users find the defaults work well.

What is Handled for You

Technical Complexity

  • No need to write Python code
  • No need to understand ML algorithms deeply
  • No need to tune hyperparameters manually
  • No need to implement cross-validation
  • No need to calculate metrics manually

Best Practices

  • Proper train/validation/test splits
  • Cross-validation for robust evaluation
  • Feature scaling and encoding
  • Hyperparameter optimization
  • Model selection and comparison

Production Readiness

  • Model serialization
  • Preprocessing pipeline preservation
  • Export-ready model packages
  • Comprehensive documentation

AutoML Pipeline Flow

  1. Data Ingestion: Upload and validate CSV file
  2. Domain Selection: Choose domain for specialized processing
  3. Preprocessing: Handle missing values, duplicates, encoding
  4. Feature Engineering: Create domain-specific features
  5. Problem Detection: Identify regression vs. classification
  6. Model Selection: Choose algorithms to train
  7. Hyperparameter Tuning: Optimize model parameters
  8. Model Training: Train selected models
  9. Evaluation: Calculate comprehensive metrics
  10. Comparison: Compare models and select best
  11. Interpretability: Generate SHAP values and explanations
  12. Export: Create deployment-ready packages

Performance Considerations

Training Time

  • Small Datasets (<10K rows): Typically completes in minutes
  • Medium Datasets (10K-100K rows): May take 10-30 minutes
  • Large Datasets (>100K rows): Can take 30+ minutes
  • Multiple Models: Training time multiplies by number of models

Optimization Strategies

  • Early Stopping: Stops optimization when improvements plateau
  • Parallel Training: Trains multiple models in parallel when possible
  • Efficient Search: Uses Bayesian optimization for faster convergence
Training Time: Complex models and large datasets require patience. Keep the browser tab open during training. Training progress is shown in real-time.

Model Library

Rassket's AutoML engine includes:

  • Gradient Boosting: XGBoost, LightGBM, CatBoost
  • Tree-Based: Random Forest, Extra Trees
  • Linear Models: Linear Regression, Logistic Regression, Ridge, Lasso
  • Ensemble Methods: Voting, Stacking
  • Neural Networks: Multi-layer Perceptrons (when appropriate)

The engine automatically selects the best models for your problem type and trains them with optimized hyperparameters.

Next Steps

Now that you understand the AutoML engine: