AutoML Engine
Rassket's AutoML engine automates the entire machine learning pipeline, from data preprocessing to model deployment. This page explains what's automated, what decisions Rassket makes, and what you can control.
What is Automated
1. Data Preprocessing
- Missing Value Handling: Automatic imputation strategies based on data type and domain
- Duplicate Removal: Identifies and removes duplicate rows
- Data Type Detection: Automatically identifies numeric, categorical, and datetime columns
- Encoding: Converts categorical variables to numeric (one-hot, label, target encoding)
- Scaling: Normalizes features for model compatibility
2. Feature Engineering
- Domain-Specific Features: Creates features based on selected domain
- Temporal Features: Extracts time-based features from datetime columns
- Interaction Terms: Creates feature interactions when beneficial
- Polynomial Features: Generates polynomial combinations (when appropriate)
- Feature Selection: Removes redundant or low-importance features
3. Problem Type Detection
- Automatic Detection: Identifies regression vs. classification
- Binary vs. Multi-class: Distinguishes binary from multi-class classification
- Target Analysis: Analyzes target variable characteristics
4. Model Selection
- Algorithm Library: Selects from XGBoost, LightGBM, CatBoost, Random Forest, Linear Models, and more
- Context-Aware Selection: Chooses models appropriate for your problem type and domain
- Ensemble Methods: Combines multiple models when beneficial
5. Hyperparameter Optimization
- Automated Tuning: Uses Optuna for Bayesian optimization
- Search Space: Automatically defines reasonable hyperparameter ranges
- Cross-Validation: Uses k-fold cross-validation for robust evaluation
- Early Stopping: Stops optimization when improvements plateau
6. Model Evaluation
- Metric Calculation: Computes comprehensive evaluation metrics
- Cross-Validation: Validates model performance robustly
- Feature Importance: Calculates SHAP values for interpretability
- Model Diagnostics: Performs statistical diagnostics
7. Model Comparison
- Multi-Model Training: Trains multiple models when requested
- Performance Comparison: Compares models side-by-side
- Best Model Selection: Identifies top performer automatically
What Decisions Rassket Makes Automatically
Preprocessing Decisions
- Imputation Strategy: Chooses median/mean/mode based on data distribution
- Encoding Method: Selects one-hot vs. label encoding based on cardinality
- Scaling Method: Chooses standardization vs. normalization based on model needs
Feature Engineering Decisions
- Feature Creation: Decides which domain-specific features to create
- Interaction Terms: Identifies beneficial feature interactions
- Feature Selection: Removes low-importance features automatically
Model Selection Decisions
- Algorithm Choice: Selects best-performing algorithms for your problem
- Model Complexity: Balances model complexity with performance
- Ensemble Strategy: Decides when to use ensemble methods
Hyperparameter Decisions
- Search Space: Defines reasonable ranges for each hyperparameter
- Optimization Strategy: Chooses optimization algorithm (TPE, CMA-ES, etc.)
- Stopping Criteria: Determines when optimization is complete
Evaluation Decisions
- Metric Selection: Chooses appropriate metrics for your problem type
- Validation Strategy: Selects train/validation/test split ratios
- Cross-Validation Folds: Determines number of folds for robust evaluation
How Domain Awareness Improves AutoML Results
Domain-Specific Feature Engineering
When you select a domain, Rassket applies domain-specific feature engineering:
- Energy Domain: Temporal features, lag variables, peak indicators, seasonal patterns
- Research Domain: Statistical transformations, interaction terms, experimental features
These features improve model accuracy because they capture domain-specific patterns that generic AutoML might miss.
Domain-Aware Model Selection
Domain selection influences model recommendations:
- Energy Forecasting: May prioritize time-series models
- Research Analysis: May prioritize interpretable models
Domain-Contextual Explanations
Domain awareness enables explanations that make sense in your context:
- Uses domain-appropriate terminology
- Provides context-specific insights
- Explains implications in your domain language
What Users Can Control
1. Domain Selection
You choose the domain (Energy or Research) and optionally a sub-domain. This is the primary way you influence AutoML behavior.
2. Target Column
You select which column to predict. This is required for training but optional for initial analysis.
3. Problem Type
You can override automatic problem type detection if needed (though automatic detection is usually accurate).
4. Model Selection
You can choose specific models to train:
- Select from recommended models
- Train multiple models for comparison
- Let Rassket choose automatically (default)
5. Training Data
You control what data to upload and use for training. Rassket handles everything else automatically.
What is Handled for You
Technical Complexity
- No need to write Python code
- No need to understand ML algorithms deeply
- No need to tune hyperparameters manually
- No need to implement cross-validation
- No need to calculate metrics manually
Best Practices
- Proper train/validation/test splits
- Cross-validation for robust evaluation
- Feature scaling and encoding
- Hyperparameter optimization
- Model selection and comparison
Production Readiness
- Model serialization
- Preprocessing pipeline preservation
- Export-ready model packages
- Comprehensive documentation
AutoML Pipeline Flow
- Data Ingestion: Upload and validate CSV file
- Domain Selection: Choose domain for specialized processing
- Preprocessing: Handle missing values, duplicates, encoding
- Feature Engineering: Create domain-specific features
- Problem Detection: Identify regression vs. classification
- Model Selection: Choose algorithms to train
- Hyperparameter Tuning: Optimize model parameters
- Model Training: Train selected models
- Evaluation: Calculate comprehensive metrics
- Comparison: Compare models and select best
- Interpretability: Generate SHAP values and explanations
- Export: Create deployment-ready packages
Performance Considerations
Training Time
- Small Datasets (<10K rows): Typically completes in minutes
- Medium Datasets (10K-100K rows): May take 10-30 minutes
- Large Datasets (>100K rows): Can take 30+ minutes
- Multiple Models: Training time multiplies by number of models
Optimization Strategies
- Early Stopping: Stops optimization when improvements plateau
- Parallel Training: Trains multiple models in parallel when possible
- Efficient Search: Uses Bayesian optimization for faster convergence
Model Library
Rassket's AutoML engine includes:
- Gradient Boosting: XGBoost, LightGBM, CatBoost
- Tree-Based: Random Forest, Extra Trees
- Linear Models: Linear Regression, Logistic Regression, Ridge, Lasso
- Ensemble Methods: Voting, Stacking
- Neural Networks: Multi-layer Perceptrons (when appropriate)
The engine automatically selects the best models for your problem type and trains them with optimized hyperparameters.
Next Steps
Now that you understand the AutoML engine:
- See the Dashboard Walkthrough to see it in action
- Learn about Outputs & Artifacts that AutoML generates
- Explore Use Cases to see real-world applications