menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

The Prepro...
source image

Medium

2w

read

187

img
dot

Image Credit: Medium

The Preprocessing Survival Guide for New Data Scientists

  • New data scientists often face challenges with real-world messy data compared to clean, toy datasets used in training.
  • Understanding the true context of missing values is crucial for accurate preprocessing.
  • Imputing nulls with zeros in specific cases is more meaningful than using means or medians.
  • Choosing between mean and median imputation depends on data distribution, favoring median for skewed data.
  • Category-wise null imputation can provide more accurate results than overall imputation.
  • Drop_duplicates function may overlook subtle differences and requires thoughtful parameter selection.
  • Scaling data using StandardScaler or MinMaxScaler is essential for models sensitive to feature magnitudes.
  • Feature engineering helps control the explosion of columns in categorical data, improving model performance and explainability.
  • Utilizing feature decomposition algorithms like PCA can manage excessive one-hot encoded columns effectively.
  • Careful consideration is needed when removing outliers to avoid discarding valuable insights within the data.

Read Full Article

like

11 Likes

For uninterrupted reading, download the app