AutoML Engine

Rassket's AutoML engine automates the entire machine learning pipeline, from data preprocessing to model deployment. This page explains what's automated, what decisions Rassket makes, and what you can control.

What is Automated

1. Data Preprocessing

Missing Value Handling: Automatic imputation strategies based on data type and domain
Duplicate Removal: Identifies and removes duplicate rows
Data Type Detection: Automatically identifies numeric, categorical, and datetime columns
Encoding: Converts categorical variables to numeric (one-hot, label, target encoding)
Scaling: Normalizes features for model compatibility

2. Feature Engineering

Domain-Specific Features: Creates features based on selected domain
Temporal Features: Extracts time-based features from datetime columns
Interaction Terms: Creates feature interactions when beneficial
Polynomial Features: Generates polynomial combinations (when appropriate)
Feature Selection: Removes redundant or low-importance features

3. Problem Type Detection

Automatic Detection: Identifies regression vs. classification
Binary vs. Multi-class: Distinguishes binary from multi-class classification
Target Analysis: Analyzes target variable characteristics

4. Model Selection

Algorithm Library: Selects from XGBoost, LightGBM, CatBoost, Random Forest, Linear Models, and more
Context-Aware Selection: Chooses models appropriate for your problem type and domain
Ensemble Methods: Combines multiple models when beneficial

5. Hyperparameter Optimization

Automated Tuning: Uses Optuna for Bayesian optimization
Search Space: Automatically defines reasonable hyperparameter ranges
Cross-Validation: Uses k-fold cross-validation for robust evaluation
Early Stopping: Stops optimization when improvements plateau

6. Model Evaluation

Metric Calculation: Computes comprehensive evaluation metrics
Cross-Validation: Validates model performance robustly
Feature Importance: Calculates SHAP values for interpretability
Model Diagnostics: Performs statistical diagnostics

7. Model Comparison

Multi-Model Training: Trains multiple models when requested
Performance Comparison: Compares models side-by-side
Best Model Selection: Identifies top performer automatically

Full Automation: Rassket handles all of the above automatically. You can optionally override some decisions (like model selection), but defaults work well for most cases.

What Decisions Rassket Makes Automatically

Preprocessing Decisions

Imputation Strategy: Chooses median/mean/mode based on data distribution
Encoding Method: Selects one-hot vs. label encoding based on cardinality
Scaling Method: Chooses standardization vs. normalization based on model needs

Feature Engineering Decisions

Feature Creation: Decides which domain-specific features to create
Interaction Terms: Identifies beneficial feature interactions
Feature Selection: Removes low-importance features automatically

Model Selection Decisions

Algorithm Choice: Selects best-performing algorithms for your problem
Model Complexity: Balances model complexity with performance
Ensemble Strategy: Decides when to use ensemble methods

Hyperparameter Decisions

Search Space: Defines reasonable ranges for each hyperparameter
Optimization Strategy: Chooses optimization algorithm (TPE, CMA-ES, etc.)
Stopping Criteria: Determines when optimization is complete

Evaluation Decisions

Metric Selection: Chooses appropriate metrics for your problem type
Validation Strategy: Selects train/validation/test split ratios
Cross-Validation Folds: Determines number of folds for robust evaluation

How Domain Awareness Improves AutoML Results

Domain-Specific Feature Engineering

When you select a domain, Rassket applies domain-specific feature engineering:

Energy Domain: Temporal features, lag variables, peak indicators, seasonal patterns
Research Domain: Statistical transformations, interaction terms, experimental features

These features improve model accuracy because they capture domain-specific patterns that generic AutoML might miss.

Domain-Aware Model Selection

Domain selection influences model recommendations:

Energy Forecasting: May prioritize time-series models
Research Analysis: May prioritize interpretable models

Domain-Contextual Explanations

Domain awareness enables explanations that make sense in your context:

Uses domain-appropriate terminology
Provides context-specific insights
Explains implications in your domain language

What Users Can Control

1. Domain Selection

You choose the domain (Energy or Research) and optionally a sub-domain. This is the primary way you influence AutoML behavior.

2. Target Column

You select which column to predict. This is required for training but optional for initial analysis.

3. Problem Type

You can override automatic problem type detection if needed (though automatic detection is usually accurate).

4. Model Selection

You can choose specific models to train:

Select from recommended models
Train multiple models for comparison
Let Rassket choose automatically (default)

5. Training Data

You control what data to upload and use for training. Rassket handles everything else automatically.

Control vs. Automation: Rassket provides sensible defaults for everything, but you can override key decisions when needed. Most users find the defaults work well.

What is Handled for You

Technical Complexity

No need to write Python code
No need to understand ML algorithms deeply
No need to tune hyperparameters manually
No need to implement cross-validation
No need to calculate metrics manually

Best Practices

Proper train/validation/test splits
Cross-validation for robust evaluation
Feature scaling and encoding
Hyperparameter optimization
Model selection and comparison

Production Readiness

Model serialization
Preprocessing pipeline preservation
Export-ready model packages
Comprehensive documentation

AutoML Pipeline Flow

Data Ingestion: Upload and validate CSV file
Domain Selection: Choose domain for specialized processing
Preprocessing: Handle missing values, duplicates, encoding
Feature Engineering: Create domain-specific features
Problem Detection: Identify regression vs. classification
Model Selection: Choose algorithms to train
Hyperparameter Tuning: Optimize model parameters
Model Training: Train selected models
Evaluation: Calculate comprehensive metrics
Comparison: Compare models and select best
Interpretability: Generate SHAP values and explanations
Export: Create deployment-ready packages

Performance Considerations

Training Time

Small Datasets (<10K rows): Typically completes in minutes
Medium Datasets (10K-100K rows): May take 10-30 minutes
Large Datasets (>100K rows): Can take 30+ minutes
Multiple Models: Training time multiplies by number of models

Optimization Strategies

Early Stopping: Stops optimization when improvements plateau
Parallel Training: Trains multiple models in parallel when possible
Efficient Search: Uses Bayesian optimization for faster convergence

Training Time: Complex models and large datasets require patience. Keep the browser tab open during training. Training progress is shown in real-time.

Model Library

Rassket's AutoML engine includes:

Gradient Boosting: XGBoost, LightGBM, CatBoost
Tree-Based: Random Forest, Extra Trees
Linear Models: Linear Regression, Logistic Regression, Ridge, Lasso
Ensemble Methods: Voting, Stacking
Neural Networks: Multi-layer Perceptrons (when appropriate)

The engine automatically selects the best models for your problem type and trains them with optimized hyperparameters.

Next Steps

Now that you understand the AutoML engine:

See the Dashboard Walkthrough to see it in action
Learn about Outputs & Artifacts that AutoML generates
Explore Use Cases to see real-world applications