Making a predictive model more robust to outliers is crucial for improving its accuracy and generalization to real-world data. Outliers are extreme data points that significantly differ from the rest of the data and can negatively impact the model's performance. Here are some methods to enhance the robustness of a predictive model to outliers:
1. Data Preprocessing:
- Outlier Detection: Use outlier detection techniques to identify and handle outliers separately. Common methods include Z-score, modified Z-score, and Interquartile Range (IQR).
- Truncate or Winsorize: Truncate or cap extreme values to
Making a predictive model more robust to outliers is crucial for improving its accuracy and generalization to real-world data. Outliers are extreme data points that significantly differ from the rest of the data and can negatively impact the model's performance. Here are some methods to enhance the robustness of a predictive model to outliers:
1. Data Preprocessing:
- Outlier Detection: Use outlier detection techniques to identify and handle outliers separately. Common methods include Z-score, modified Z-score, and Interquartile Range (IQR).
- Truncate or Winsorize: Truncate or cap extreme values to a predefined threshold, or winsorize (replace outliers with the nearest non-outlier value) to minimize their impact on the model.
2. Data Transformation:
- Log Transformation: Apply log transformation to skewed features, which can reduce the effect of extreme values.
- Box-Cox Transformation: The Box-Cox transformation can stabilize the variance and handle outliers by applying a power transformation.
3. Robust Algorithms:
- Use algorithms that are inherently more robust to outliers, such as decision trees, random forests, and gradient boosting, as they do not heavily rely on the mean or standard deviation.
4. Model Regularization:
- Incorporate regularization techniques like L1 (Lasso) and L2 (Ridge) regularization, which penalize large coefficients, making the model less sensitive to extreme values.
5. Ensemble Methods:
- Employ ensemble methods like bagging and boosting, which combine multiple models to reduce the impact of individual outliers on the final prediction.
6. Weighted Loss Function:
- Use weighted loss functions during training, where the model assigns lower weight to the samples with extreme values, minimizing their influence on the model's optimization.
7. Robust Statistics:
- Utilize robust statistical measures like median and percentile instead of the mean and standard deviation, which are sensitive to outliers.
8. Cross-Validation:
- Employ robust cross-validation techniques like stratified k-fold or leave-one-out cross-validation to ensure that the model generalizes well to different data subsets, including those containing outliers.
9. Data Augmentation:
- For smaller datasets, use data augmentation techniques to increase the sample size artificially, reducing the impact of individual outliers.
10. Domain Knowledge:
- Leverage domain knowledge to understand the data and the potential reasons for outliers. In some cases, outliers might be legitimate data points that need special consideration.
It's essential to choose the appropriate combination of methods based on the specific characteristics of the data and the problem at hand. Evaluating the model's performance using relevant metrics and comparing it against baseline models without outlier handling will help determine the effectiveness of these techniques.
Outliers can throw a curveball to any predictive model. How will you make sure that your model can handle outliers or the data that does not contribute to your model in a healthy manner.
I think the short and practical answer is to think “Tree”.
A Tree is well rooted and its branches have a limit. So is the case with a Tree based model.
Whether it is a Decision Tree or a Bagging Tree or a Random Forest, all these models can handle outliers very effectively.
Your model can be an effective learner by following these approaches.
And the good part with Random Forest is that you can use it for classific
Outliers can throw a curveball to any predictive model. How will you make sure that your model can handle outliers or the data that does not contribute to your model in a healthy manner.
I think the short and practical answer is to think “Tree”.
A Tree is well rooted and its branches have a limit. So is the case with a Tree based model.
Whether it is a Decision Tree or a Bagging Tree or a Random Forest, all these models can handle outliers very effectively.
Your model can be an effective learner by following these approaches.
And the good part with Random Forest is that you can use it for classification as well as regression problems. In such a case, it can reduce the importance of outliers or a long tail.
I hope it helps!
Follow me at Gautam Gupta to learn more about Data Science related questions.

Making a predictive model more robust to outliers involves several strategies that can be employed during data preprocessing, model selection, and evaluation. Here are some effective methods:
1. Data Preprocessing
- Outlier Detection and Removal: Use statistical methods (e.g., Z-scores, IQR) to identify and remove outliers from the dataset before training the model.
- Transformation: Apply transformations (e.g., logarithmic, square root) to reduce the impact of outliers. This can compress the range of data and make the distribution more normal.
- Winsorization: Replace extreme values with the nearest va
Making a predictive model more robust to outliers involves several strategies that can be employed during data preprocessing, model selection, and evaluation. Here are some effective methods:
1. Data Preprocessing
- Outlier Detection and Removal: Use statistical methods (e.g., Z-scores, IQR) to identify and remove outliers from the dataset before training the model.
- Transformation: Apply transformations (e.g., logarithmic, square root) to reduce the impact of outliers. This can compress the range of data and make the distribution more normal.
- Winsorization: Replace extreme values with the nearest values within a specified percentile range, reducing their influence on the model.
2. Robust Models
- Use Robust Algorithms: Some algorithms are inherently more robust to outliers. For example:
- Decision Trees: They are less sensitive to outliers compared to linear models.
- Random Forests: An ensemble of decision trees can average out the influence of outliers.
- Support Vector Machines (SVM): Using a robust kernel can help reduce sensitivity to outliers.
- Regularization: Techniques such as Ridge or Lasso regression can help reduce the influence of outliers by penalizing large coefficients.
3. Evaluation Metrics
- Use Robust Loss Functions: Instead of using mean squared error (MSE), which is sensitive to outliers, consider using robust loss functions like Huber loss or quantile loss.
- Cross-Validation: Conduct cross-validation to ensure that the model's performance is consistent across different subsets of data, which can help identify the influence of outliers.
4. Ensemble Methods
- Bagging: Bootstrap aggregating (bagging) can help reduce the variance caused by outliers by training multiple models on different subsets of the data.
- Boosting: Use boosting methods that focus on correcting errors made by previous models, which can be less affected by outliers.
5. Feature Engineering
- Create Robust Features: Develop features that are less sensitive to outliers, such as using median or trimmed mean instead of mean for aggregating data.
- Binning: Convert continuous variables into categorical bins to reduce sensitivity to extreme values.
6. Sensitivity Analysis
- Robustness Checks: Perform sensitivity analysis to understand how changes in input data affect model predictions. This can help identify the impact of outliers.
By combining these methods, you can create a predictive model that maintains performance and accuracy, even in the presence of outliers.
You should have a look at robust estimators. Robust estimators minimize the sum of the absolute values of the errors instead of the sum of squares. This makes them more resistant to outliers. The reason sum of squares is preferred is that it has an analytic derivative and so is easier to minimize. Contrast the median as an example of a robust estimator, as opposed to the arithmetic mean.
Making a predictive model more robust to outliers is important to ensure the model's stability and accuracy when dealing with extreme data points. Outliers can significantly impact the model's performance, leading to less reliable predictions. Here are some methods to enhance the robustness of a predictive model to outliers:
- Data Preprocessing: Carefully preprocess the data by identifying and handling outliers. Some common techniques include:Trimming: Removing extreme values beyond a certain threshold.Winsorizing: Capping the extreme values to a specified percentile.Imputation: Replacing outlie
Making a predictive model more robust to outliers is important to ensure the model's stability and accuracy when dealing with extreme data points. Outliers can significantly impact the model's performance, leading to less reliable predictions. Here are some methods to enhance the robustness of a predictive model to outliers:
- Data Preprocessing: Carefully preprocess the data by identifying and handling outliers. Some common techniques include:Trimming: Removing extreme values beyond a certain threshold.Winsorizing: Capping the extreme values to a specified percentile.Imputation: Replacing outliers with more representative values, such as the mean or median.
- Feature Scaling: Apply feature scaling techniques such as normalization or standardization to bring all features to a similar scale. Scaling can reduce the impact of extreme values on the model's performance.
- Robust Algorithms: Use algorithms that are inherently less sensitive to outliers, such as robust regression techniques (e.g., RANSAC, Theil-Sen regression) and tree-based models (e.g., Random Forest, Gradient Boosting).
- Weighted Loss Functions: Modify the loss function of the predictive model to assign higher weights to the errors associated with regular data points and lower weights to the errors from outliers.
- Ensemble Methods: Employ ensemble techniques like bagging or boosting that combine multiple models to reduce the impact of individual outliers on the final prediction.
- Transformations: Apply data transformations to make the data less sensitive to outliers. For example, using logarithmic transformations can compress the range of extreme values.
- Cross-Validation: Use robust cross-validation methods, such as k-fold cross-validation or stratified cross-validation, to evaluate the model's performance more accurately and minimize the effect of outliers on the validation process.
- Remove Outliers: In some cases, it may be appropriate to remove extreme outliers if they are likely due to data entry errors or anomalies and not representative of the underlying pattern.
- Data Augmentation: Consider data augmentation techniques that generate additional training samples based on existing data to make the model more robust.
- Train on Robust Subsets: If possible, create subsets of the data with reduced outliers or remove outliers from the training set entirely when building the model.
It's essential to strike a balance between making the model robust to outliers while retaining the ability to capture valuable information from the data. Careful experimentation and evaluation of different techniques on a validation set will help identify the most effective approach for the specific predictive modeling task at hand.
Where do I start?
I’m a huge financial nerd, and have spent an embarrassing amount of time talking to people about their money habits.
Here are the biggest mistakes people are making and how to fix them:
Not having a separate high interest savings account
Having a separate account allows you to see the results of all your hard work and keep your money separate so you're less tempted to spend it.
Plus with rates above 5.00%, the interest you can earn compared to most banks really adds up.
Here is a list of the top savings accounts available today. Deposit $5 before moving on because this is one of th
Where do I start?
I’m a huge financial nerd, and have spent an embarrassing amount of time talking to people about their money habits.
Here are the biggest mistakes people are making and how to fix them:
Not having a separate high interest savings account
Having a separate account allows you to see the results of all your hard work and keep your money separate so you're less tempted to spend it.
Plus with rates above 5.00%, the interest you can earn compared to most banks really adds up.
Here is a list of the top savings accounts available today. Deposit $5 before moving on because this is one of the biggest mistakes and easiest ones to fix.
Overpaying on car insurance
You’ve heard it a million times before, but the average American family still overspends by $417/year on car insurance.
If you’ve been with the same insurer for years, chances are you are one of them.
Pull up Coverage.com, a free site that will compare prices for you, answer the questions on the page, and it will show you how much you could be saving.
That’s it. You’ll likely be saving a bunch of money. Here’s a link to give it a try.
Consistently being in debt
If you’ve got $10K+ in debt (credit cards…medical bills…anything really) you could use a debt relief program and potentially reduce by over 20%.
Here’s how to see if you qualify:
Head over to this Debt Relief comparison website here, then simply answer the questions to see if you qualify.
It’s as simple as that. You’ll likely end up paying less than you owed before and you could be debt free in as little as 2 years.
Missing out on free money to invest
It’s no secret that millionaires love investing, but for the rest of us, it can seem out of reach.
Times have changed. There are a number of investing platforms that will give you a bonus to open an account and get started. All you have to do is open the account and invest at least $25, and you could get up to $1000 in bonus.
Pretty sweet deal right? Here is a link to some of the best options.
Having bad credit
A low credit score can come back to bite you in so many ways in the future.
From that next rental application to getting approved for any type of loan or credit card, if you have a bad history with credit, the good news is you can fix it.
Head over to BankRate.com and answer a few questions to see if you qualify. It only takes a few minutes and could save you from a major upset down the line.
How to get started
Hope this helps! Here are the links to get started:
Have a separate savings account
Stop overpaying for car insurance
Finally get out of debt
Start investing with a free bonus
Fix your credit
Any outlier needs to be understood. You may have to talk to your stakeholders to understand the reason for it. If a reason is available, add that reason as a “factor” in your regression so that it will absorb the shock in the regression.
However, if no info is available, then do not include it in the model.
If you choose not to include the outlier then → statsmodels package in python has a way of querying a “built model” for outliers. I learnt this from ChatGPT and it works like a charm. So first build a model, query for outliers, and then re-build the model without outliers.
Boosted regression is a good choice, as boosting is designed to fit the next iteration's model to the error term of the previous model. This means that outliers in the original model are given priority for fit in the next iteration. For quick-and-easy predictive modeling, this is one of the first I consider for that reason.
Topological data analysis methods, particularly Morse-Smale-based methods, are a good choice, as well. Initial clustering will separate out univariate and multivariate outliers, and subsequent models will fit to each partition, such that the majority of models will be fit to
Boosted regression is a good choice, as boosting is designed to fit the next iteration's model to the error term of the previous model. This means that outliers in the original model are given priority for fit in the next iteration. For quick-and-easy predictive modeling, this is one of the first I consider for that reason.
Topological data analysis methods, particularly Morse-Smale-based methods, are a good choice, as well. Initial clustering will separate out univariate and multivariate outliers, and subsequent models will fit to each partition, such that the majority of models will be fit to non-outliers. One of my adapted frameworks was accepted by the Casualty Actuarial Society as a new risk model, particularly focused on outlier subgroups (see here for a short overview: https://www.slideshare.net/ColleenFarrelly/morsesmale-regression-for-risk-modeling).
There are a glut of statistical methods to identify and remove outliers, and the other answers cover these quite well. These are good options prior to modeling, particularly if you aren't going to use methods that are fairly robust to outliers.
I once met a man who drove a modest Toyota Corolla, wore beat-up sneakers, and looked like he’d lived the same way for decades. But what really caught my attention was when he casually mentioned he was retired at 45 with more money than he could ever spend. I couldn’t help but ask, “How did you do it?”
He smiled and said, “The secret to saving money is knowing where to look for the waste—and car insurance is one of the easiest places to start.”
He then walked me through a few strategies that I’d never thought of before. Here’s what I learned:
1. Make insurance companies fight for your business
Mos
I once met a man who drove a modest Toyota Corolla, wore beat-up sneakers, and looked like he’d lived the same way for decades. But what really caught my attention was when he casually mentioned he was retired at 45 with more money than he could ever spend. I couldn’t help but ask, “How did you do it?”
He smiled and said, “The secret to saving money is knowing where to look for the waste—and car insurance is one of the easiest places to start.”
He then walked me through a few strategies that I’d never thought of before. Here’s what I learned:
1. Make insurance companies fight for your business
Most people just stick with the same insurer year after year, but that’s what the companies are counting on. This guy used tools like Coverage.com to compare rates every time his policy came up for renewal. It only took him a few minutes, and he said he’d saved hundreds each year by letting insurers compete for his business.
Click here to try Coverage.com and see how much you could save today.
2. Take advantage of safe driver programs
He mentioned that some companies reward good drivers with significant discounts. By signing up for a program that tracked his driving habits for just a month, he qualified for a lower rate. “It’s like a test where you already know the answers,” he joked.
You can find a list of insurance companies offering safe driver discounts here and start saving on your next policy.
3. Bundle your policies
He bundled his auto insurance with his home insurance and saved big. “Most companies will give you a discount if you combine your policies with them. It’s easy money,” he explained. If you haven’t bundled yet, ask your insurer what discounts they offer—or look for new ones that do.
4. Drop coverage you don’t need
He also emphasized reassessing coverage every year. If your car isn’t worth much anymore, it might be time to drop collision or comprehensive coverage. “You shouldn’t be paying more to insure the car than it’s worth,” he said.
5. Look for hidden fees or overpriced add-ons
One of his final tips was to avoid extras like roadside assistance, which can often be purchased elsewhere for less. “It’s those little fees you don’t think about that add up,” he warned.
The Secret? Stop Overpaying
The real “secret” isn’t about cutting corners—it’s about being proactive. Car insurance companies are counting on you to stay complacent, but with tools like Coverage.com and a little effort, you can make sure you’re only paying for what you need—and saving hundreds in the process.
If you’re ready to start saving, take a moment to:
- Compare rates now on Coverage.com
- Check if you qualify for safe driver discounts
- Reevaluate your coverage today
Saving money on auto insurance doesn’t have to be complicated—you just have to know where to look. If you'd like to support my work, feel free to use the links in this post—they help me continue creating valuable content.
Another way to transform your data to be robust to outliers is to do a spatial sign transformation, which works as follows:
[math]x^*_{ij} = \frac{x_{ij}}{\sum_{j=1}^{P}x^2_{ij}}[/math]
As shown in this website below, after the transformation, the predictors are projected to a unit circle, which is evidently robust to outliers.
You can do that easily using the 'caret' package in R. Before doing that, as pointed out by the author, you'll typically need to center and scale the predictors first and, since it's a group transformation, it's better not to remove any predictors afterwards.
Most people deal with outliers choosing between an L1 over an L2 regularization, as noted in other answers
You can also deal with this if you use a weighted SVM, and you have estimate of the confidence of the labels (and the outliers would presumably have lower confidence)
In regression models you can account for outliers by using an error distribution with fatter tails than the Normal distribution (for instance a t-distribution with low degrees of freedom). This can for instance be done using Bayesian approaches. See here for an example:
This world is far from Normal(ly distributed): Bayesian Robust Regression in PyMC3
To make a predictive model more robust to outliers:
- Use special algorithms that are not affected by extreme values.
- Transform the data or limit extreme values to reduce their impact.
- Detect and handle outliers before building the model.
- Combine multiple models to improve accuracy and reduce outlier influence.
if you are dealing with classification problem, you can use SVMs with Ramp Loss. Ramp Loss is effective while training data contains outliers.
You need to understand that you can learn from outliers or only focus on the main phenomena.
There are many algorithms in the field to learn outlier, specialiy in mobility pattern.
I can't mention here the method.
You can search in Google outlier and the field you are interested in, and you will find enough method.
You should implement the following ways to make your predictive model more resilient to outliers: You should use a model that is immune to outsiders. Tree-based models are usually not affected by outliers, whereas regression-based models are. If you are doing a predictive evaluation, try a non-parametric test instead of a parametric test.
yes. and not just XGBoost, but any decision tree based classifier or ensemble of classifiers are not influenced by outliers, if you mean outliers that are exceptionally large or exceptionally small values (the most common meaning of outliers). So single trees, (CART, C5, CHAID, etc.), ensembles of trees (Random Forests, Gradient Boosting Machines, Bagging, Boosting) are all robust to outliers. Outliers literally don’t matter to trees because what makes an outlier an outlier (using the common meaning) is distance from other data points…outliers are far away, most often measured in Euclidean Dis
yes. and not just XGBoost, but any decision tree based classifier or ensemble of classifiers are not influenced by outliers, if you mean outliers that are exceptionally large or exceptionally small values (the most common meaning of outliers). So single trees, (CART, C5, CHAID, etc.), ensembles of trees (Random Forests, Gradient Boosting Machines, Bagging, Boosting) are all robust to outliers. Outliers literally don’t matter to trees because what makes an outlier an outlier (using the common meaning) is distance from other data points…outliers are far away, most often measured in Euclidean Distance. Trees do not care about distance. They only care if the data is on one side or the other of a split. It doesn’t matter at all how far from the split the data point is.
As an aside, it’s possible have an outlier (an unusual or atypical value) that is in the middle of a distribution. I’ve never seen this written up in the literature (let me know if you have a good reference) but I’ve dealt with this phenomenon in the past. These outliers are harder to detect because you have to know more about the multi-modal distribution of the variable to detect the unusual value. For example, if you have a bi-modal distribution, there may be a few values and only a few between the modes (think of a composition of two Normal distributions separated from one another). Trees would still be robust to these outliers.
Well, outliers can be identified just by plotting the data, that is simplest. The observation that doesn’t match with rest of observation would be outlier
The red dot is outlier
The more technical way is boxplot. Several softwares can make boxplot of data. In a boxplot, the data beyond whiskers may be treated as outlier
How to handle outlier? Well, this is a separate science and there is no unique solution to this. Sometimes the log transformation may work, sometime, you need to trim the data and sometime, you don’t need any treatment of the outlier
Well, outliers can be identified just by plotting the data, that is simplest. The observation that doesn’t match with rest of observation would be outlier
The red dot is outlier
The more technical way is boxplot. Several softwares can make boxplot of data. In a boxplot, the data beyond whiskers may be treated as outlier
How to handle outlier? Well, this is a separate science and there is no unique solution to this. Sometimes the log transformation may work, sometime, you need to trim the data and sometime, you don’t need any treatment of the outlier
The question isn’t entirely pinned down, so it’s hard to know how to answer it rigorously. That said, they are roughly inverse to one another: a model that is not robust to outliers at all is very likely overfitting the data.
Consider a common example: estimating central tendency by the median, which is far more robust to outliers than the mean. One might argue that this underfits the data, since a true, extreme outlier can help invalidate your model (e.g., you have assumed the data are normal, and therefore have thin tails, making such an outlier nearly impossible).
But it’s not quite so simple
The question isn’t entirely pinned down, so it’s hard to know how to answer it rigorously. That said, they are roughly inverse to one another: a model that is not robust to outliers at all is very likely overfitting the data.
Consider a common example: estimating central tendency by the median, which is far more robust to outliers than the mean. One might argue that this underfits the data, since a true, extreme outlier can help invalidate your model (e.g., you have assumed the data are normal, and therefore have thin tails, making such an outlier nearly impossible).
But it’s not quite so simple. Underparameterized models — like assuming a parsimoniously parameterized distribution for complex univariate data — can also be non-robust while not overfitting the data. Adding an extreme outlier to a large set of otherwise normally-distributed data could have an enormous impact on both the mean and variance, but not really “overfit” the data, since it is now explaining all the data, including the outlier, badly. It’s basically misspecified.
Nonparametric models can be specifically designed to “detect” outliers and put them, so to speak, in a separate bucket, effectively making the model for the rest of the data robust. They would only grow in complexity (i.e., number of parameters) if the data warrant it, but the analyst could still say something like “the vast majority of the data are captured well by a normal (or Poisson, etc.) model, but we also have detected some anomalies.” I don’t think anyone would view that as “overfitting” the data, in the way that, say, polynomial regression might.
In machine learning, regularization is a standard way to help avoid overfitting, and basically enacts a penalty for doing so. It’s a big topic, but hopefully the description above helps sketch a broad relationship between the two.
I guess one big reason is that they do a slice on the data, and then after that slice, it doesn't matter how big of a value you have. If you had five data points, and one of their features looked like [math] \{ 1, 2, 3, 4, 1000000\} [/math], you might choose a split point at x = 2.5. At that point, 3,4, and a million all go into the same bucket, and their values are treated the same way. You could replace one million with something orders of magnitude bigger and it wouldn't matter, or you could change its value to 5 and it wouldn't matter. This restricts how much influence the outlying point can have.
I guess one big reason is that they do a slice on the data, and then after that slice, it doesn't matter how big of a value you have. If you had five data points, and one of their features looked like [math] \{ 1, 2, 3, 4, 1000000\} [/math], you might choose a split point at x = 2.5. At that point, 3,4, and a million all go into the same bucket, and their values are treated the same way. You could replace one million with something orders of magnitude bigger and it wouldn't matter, or you could change its value to 5 and it wouldn't matter. This restricts how much influence the outlying point can have. Contrast with linear regression, where the bigger that point gets, the more influence it will have on the entire model.
I don't know if this is a common terminology, but the tree based methods are kind of like [math]L^0 [/math] flavored, which is the most robust you can get.
The best approach is to take a course on data cleansing.
Instead of asking people, most who have no clue, you’ll actually learn how to handle them.
I know, some novel shit I tell you.
Here’s some insight for you.
The best approach is to take a course on data cleansing.
Instead of asking people, most who have no clue, you’ll actually learn how to handle them.
I know, some novel shit I tell you.
Here’s some insight for you.
Here’s the graph of some popular loss functions:
Here, the blue curve is hinge loss, red curve is logistic loss, and green curve is least squares loss.
The x-axis corresponds to [math]y f(x)[/math], that is, the product of the true label and the predicted label. Ideally, we want these to be both +1 or both -1, so that when the product is 1, there is no penalty. As you deviate from 1, there are penalties. Two things to observe here w.r.t logistic loss and square loss:
- Square loss diverges to infinity much faster as [math]y f(x)[/math] goes below zero. This is the reason it is less robust to outliers compared to the logisti
Here’s the graph of some popular loss functions:
Here, the blue curve is hinge loss, red curve is logistic loss, and green curve is least squares loss.
The x-axis corresponds to [math]y f(x)[/math], that is, the product of the true label and the predicted label. Ideally, we want these to be both +1 or both -1, so that when the product is 1, there is no penalty. As you deviate from 1, there are penalties. Two things to observe here w.r.t logistic loss and square loss:
- Square loss diverges to infinity much faster as [math]y f(x)[/math] goes below zero. This is the reason it is less robust to outliers compared to the logistic loss. As you can guess, hinge loss is even better. (More details here — Prasoon Goyal's answer to When does Logistic Regression perform poorly and Support Vector Machine (SVM) should be preferred?)
- Square loss penalizes points even if they are correctly classified. So if the true label [math]y[/math] is 1 and the prediction [math]f(x)[/math] is 2, you still pay a price (although this does not directly contribute to sensitivity to outliers).
(Image source: What are the impacts of choosing different loss functions in classification to approximate 0-1 loss)
Usually, supervised learning algorithm finds an estimate which minimizes the cost function. Linear regression uses square loss function and logistic regression uses inverse logistic loss function (cost function of logistic regression)
[math]yf(x) [/math]in the x-axis is nothing but product of actual label (y)and predicted label[math](f(x))[/math].
For example: Decision boundary by linear regression (Square loss function)
Due to some of the outlier observations in the second graph, linear regression gives a decision boundary which classifies the labels poorly. In order to reduce the square loss, it chooses an estimate at t
Usually, supervised learning algorithm finds an estimate which minimizes the cost function. Linear regression uses square loss function and logistic regression uses inverse logistic loss function (cost function of logistic regression)
[math]yf(x) [/math]in the x-axis is nothing but product of actual label (y)and predicted label[math](f(x))[/math].
For example: Decision boundary by linear regression (Square loss function)
Due to some of the outlier observations in the second graph, linear regression gives a decision boundary which classifies the labels poorly. In order to reduce the square loss, it chooses an estimate at the cost of predicting labels incorrectly. In an other hand logistic cost function doesn’t penalise the estimate for outlier observation.
Overfitting happens when a model trains too long, failing to generalize to new data. Also, when the model becomes complex, it can learn from ‘noise’ in data and overfit, leading to poor performance.
My take on the question with that background: Models robust to outliers can be affected by outliers in the dataset when they get too complex.
For example, a decision tree is (robust to outliers) but can get to split outliers if it can't split any data points and overfit. It’s recommended to prune it to lessen any outlier effect.
Always test approaches to find out what gives you the best results, and I
Overfitting happens when a model trains too long, failing to generalize to new data. Also, when the model becomes complex, it can learn from ‘noise’ in data and overfit, leading to poor performance.
My take on the question with that background: Models robust to outliers can be affected by outliers in the dataset when they get too complex.
For example, a decision tree is (robust to outliers) but can get to split outliers if it can't split any data points and overfit. It’s recommended to prune it to lessen any outlier effect.
Always test approaches to find out what gives you the best results, and I’d suggest you use the deepchecks since it offers a faster way to validate your data and models.
[Segment performance code sample]
- from deepchecks.tabular.checks.performance import SegmentPerformance
- SegmentPerformance(feature_1='workclass', feature_2='hours-per-week').run(validation_ds, model)
I hope this helps.
Outlier detection is a crucial step in data analysis and machine learning to identify observations that deviate significantly from the rest of the data. Once outliers are detected, several techniques can be employed for their treatment. Here's an overview:
Outlier Detection Techniques:
- Z-Score or Standard Score: Detection Method: Calculate the Z-score for each data point and identify those with a Z-score beyond a certain threshold. Treatment: Remove or adjust data points with high Z-scores.
- Interquartile Range (IQR): Detection Method: Define a range based on the interquartile range and identify o
Outlier detection is a crucial step in data analysis and machine learning to identify observations that deviate significantly from the rest of the data. Once outliers are detected, several techniques can be employed for their treatment. Here's an overview:
Outlier Detection Techniques:
- Z-Score or Standard Score: Detection Method: Calculate the Z-score for each data point and identify those with a Z-score beyond a certain threshold. Treatment: Remove or adjust data points with high Z-scores.
- Interquartile Range (IQR): Detection Method: Define a range based on the interquartile range and identify outliers outside this range. Treatment: Trim or winsorize (capping extreme values) the outliers.
- Density-Based Methods (e.g., DBSCAN):Detection Method: Identify regions with lower data point density as potential outliers. Treatment: Remove or adjust data points in low-density regions.
- Isolation Forest: Detection Method: Construct trees isolating instances, and outliers are expected to have shorter paths. Treatment: Remove instances isolated in the forest.
- Local Outlier Factor (LOF):Detection Method: Compares the local density of a data point with that of its neighbors. Treatment: Adjust or remove points with significantly lower density.
Outlier Treatment Techniques:
- Removal: Approach: Simply exclude the outlier from the dataset. Consideration: May lead to data loss and impact statistical analysis.
- Transformation: Approach: Apply mathematical transformations (e.g., log transformations) to reduce the impact of outliers. Consideration: The transformed data may better adhere to assumptions of statistical methods.
- Imputation: Approach: Replace outliers with estimated values based on the rest of the data. Consideration: Imputation methods should be chosen carefully to maintain data integrity.
- Capping/Clipping: Approach: Set a threshold beyond which values are capped or clipped. Consideration: Reduces the impact of extreme values without complete removal.
- Winsorizing: Approach: Replace extreme values with values at a specified percentile. Consideration: A less harsh alternative to trimming.
- Model-Based Correction: Approach: Use statistical or machine learning models to predict values for outliers. Consideration: Requires a model that is not influenced by outliers.
- Separate Analysis: Approach: Analyze the data both with and without outliers to understand their impact. Consideration: Provides insights into the sensitivity of results to outliers.
The choice of outlier detection and treatment techniques depends on the characteristics of the data and the goals of the analysis. It's often valuable to explore multiple methods and assess their impact on the overall analysis.
Q: How do outliers affect regression?
It depends on the type of regression. In the most widely taught form, you are minimizing the sum of squared errors, but this is equivalent to minimizing the average squared error, and like all averages, a single huge value can dominate the calculation. So the regression can be overwhelmed by one or two outliers.
There is an alternative measure of central tendency, the median, that is immune to outliers. It makes no difference how small the smallest value is; or how large the largest value, the median is just the middle value when everything is sorted.
Roussee
Q: How do outliers affect regression?
It depends on the type of regression. In the most widely taught form, you are minimizing the sum of squared errors, but this is equivalent to minimizing the average squared error, and like all averages, a single huge value can dominate the calculation. So the regression can be overwhelmed by one or two outliers.
There is an alternative measure of central tendency, the median, that is immune to outliers. It makes no difference how small the smallest value is; or how large the largest value, the median is just the middle value when everything is sorted.
Rousseeuw introduced Least Median of Squares Regression in 1984; you can read his paper here: http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/LeastMedianOfSquares.pdf
It takes longer, but it also can find the line that fits the majority of points, and excludes outliers.
In general this is part of what is called “robust” statistics, a technical term meaning statistics with some immunity to outliers. It is quite useful if you analyze data that is often subject to error, or occasionally has natural real outliers. For example, if you are in economics and trying to relate some behaviors with income levels, sometimes people have extreme income that you don’t necessarily want to overwhelm your data.
X 4 6 8 10
(X-mean)^2 =d ^2= 3 ^2 +1^2+1^2+3^2=9+1+4+9=23
Mean =28/4=7
(Sd )^2 = 23/3 =7.66
Sd =(7.66)^1/2 =2.766
Coefficient of Variance = sd/Mean ×100 =2.766/7 ×100 =39.5%
Data with outliers
X 4 12 8 20
Mean =44/4=11
(X-mean)^2=d^2= 7^2 +1^2+3^2+9^2 =49+1+9+81=140
Sd^2= 140/3 =4 46.66
Sd = 6.83
COEFFICIENT OF VARIANCE =6.83/11 ×100=57.27
By looking at calculations
We can notice the Mean And Sd are affected by the out lies
X1 mean 7 Sd 2.766 CV 39.5%
X2
With outliers
Mean 11 Sd 6.83 CV 57.27%
Oultliers effect mean and Sd and cause Skewness
XGBoost and boosting in general are very sensitive to outliers.This is because boosting builds each tree on previous trees' residuals/errors. Outliers will have much larger residuals than non-outliers, so boosting will focus a disproportionate amount of its attention on those points
The answer to this question lies in the functionality of least square method.
The trouble with outliers in the least-squares method is that the Least-squares method only knows data in terms of their mean and their squared differences from the mean. Outliers will distort (either amplify or radically diminish) means in the first place. Then, in the second place, squaring these differences will only accentuate the distortion.
So presence of outliers will make a huge impact on LS method.
Now, for logistic regression, decision boundary takes into consideration only the points that are closer to it, h
The answer to this question lies in the functionality of least square method.
The trouble with outliers in the least-squares method is that the Least-squares method only knows data in terms of their mean and their squared differences from the mean. Outliers will distort (either amplify or radically diminish) means in the first place. Then, in the second place, squaring these differences will only accentuate the distortion.
So presence of outliers will make a huge impact on LS method.
Now, for logistic regression, decision boundary takes into consideration only the points that are closer to it, hence effect of outliers on the decision boundary is very less. This doesn’t mean a outlier can not put a significant effect on boundary.
Hope this helps.
Outliers are unusual values that one can see in a dataset. They can be caused by measurement or execution errors. Usually when a data point significantly varies from the other data point in a dataset then it is called as an outlier. Whereas determining an outlier might altogether be subjective based on the extent of understanding the collected information.
E.g., One can consider different age groups. Here I have taken ages between 0-11, 18-28, 65-74, or maybe 90-100.
But an age of a person will never be called 298. 298. As it is not valid. But a proper number would be like 32.
This is an example
Outliers are unusual values that one can see in a dataset. They can be caused by measurement or execution errors. Usually when a data point significantly varies from the other data point in a dataset then it is called as an outlier. Whereas determining an outlier might altogether be subjective based on the extent of understanding the collected information.
E.g., One can consider different age groups. Here I have taken ages between 0-11, 18-28, 65-74, or maybe 90-100.
But an age of a person will never be called 298. 298. As it is not valid. But a proper number would be like 32.
This is an example but there are several more examples that one can choose to. It is called a univariate which means only one variable i.e. age is considered here.
Now let’s dive deeper and understand the multiple variables or multivariate data.
Detecting outliers and anomalies is very important as they are very useful while applying statistical techniques or for training the Machine Learning (ML) algorithms.
There are quite a few ways or techniques to detect and remove outliers. Below I have mentioned a few. Hope it will be useful for you.
Ways for outlier detection and removal
The analysis for outlier detection is referred to as outlier mining. There are many ways to detect outliers, and the removal process is the data frame. For removing the outlier, one must follow the same process of removing an entry from the dataset with the help of its exact position in the dataset because in all the below mentioned techniques of detecting the outliers end result is the list of all those data items that satisfy the outlier definition according to the method one chooses to use.
1) Python library – NumPy
If you’re a Python user, you’d probably be comfortable using this method. It could be a piece of cake for outlier detection. Here’s an example:
Image source: GitHub
This could be the code you’ll be installing NumPy and running it on a dataset, provided the dataset is already imported in Python. The result that returns above will be your outliers.
2) Interquartile Range (IQR)
Interquartile Range (IQR) method is mainly to measure variability. This is done by dividing data into quartiles. These data are split into four equal parts and in ascending order. Q1, Q2, Q3 are also called the first, second, and third quartiles are values that separate the four equal parts.
Now IQR is the range that falls between the first and the third quartile i.e. Q1 and Q3.
So, IQR=Q3-Q1.
Now any data point that falls above Q3 + 1.5 and below Q1 + 1.5 IQR are detected as outliers.
3) Numeric Outlier
Numeric Outlier detects and treats the outliers for each of the selected columns individually by means of the interquartile range (IQR).
To detect the outliers for a given column, the first and third quartile (Q1, Q3) is computed. An observation is flagged an outlier if it lies outside the range R = [Q1 - k(IQR), Q3 + k(IQR)] with IQR = Q3 - Q1 and k >= 0. Setting k = 1.5 the smallest value in R corresponds, mainly, to the lower end of a boxplot's whisker and the largest value to its upper end.
To offer grouping information this allows for detecting outliers only within their respective groups. If an observation is flagged as an outlier, one can either replace it with some other value or remove/retain the corresponding row. Whereas, the missing values contained in the data will be ignored, i.e., they will neither be used for the outlier computation nor will they be flagged as an outlier.
4) Z-Score treatment
Z- Score is also called a standard score. It is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.
This value/score helps to understand how far is the data point from the mean. And after setting up a threshold value one can utilize z-score values of data points to define the outliers.
Zscore = (data_point -mean) / std. deviation
The intuition behind Z-score is to describe any data point by finding its relationship with the Standard Deviation and Mean of the group of data points. Z-score is finding the distribution of data where the mean is 0 and the standard deviation is 1 i.e. normal distribution.
While calculating the Z-score we re-scale and center the data and look for data points that are too far from zero. These data points which are way too far from zero will be treated as outliers. In most of the cases, a threshold of 3 or -3 is used i.e., if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as an outlier. One can also use the Z-score function defined in the SciPy library to detect the outliers.
According to my observations, the easiest way to remove outliers is by deleting observations as such when the value entered is due to a data entry error or data processing error or when the value observations are small in numbers. Other methods include imputing and transforming and binning values.
I hope this information will help you. All the best!
Outliers are anomalous values in the data. Outliers tend to increase the estimate of sample variance, thus decreasing the calculated F statistic for the ANOVA and lowering the chance of rejecting the null hypothesis. Run ANOVA on the entire data. Remove outlier(s) and rerun the ANOVA. If the results are the same then you can report the analysis on the full data and report that the outliers did not influence the results. This study finds evidence that the estimates in ANOVA are sensitive to outliers, i.e. that the procedure is not robust. Samples with a larger portion of extreme outliers have a
Outliers are anomalous values in the data. Outliers tend to increase the estimate of sample variance, thus decreasing the calculated F statistic for the ANOVA and lowering the chance of rejecting the null hypothesis. Run ANOVA on the entire data. Remove outlier(s) and rerun the ANOVA. If the results are the same then you can report the analysis on the full data and report that the outliers did not influence the results. This study finds evidence that the estimates in ANOVA are sensitive to outliers, i.e. that the procedure is not robust. Samples with a larger portion of extreme outliers have a higher type-I error probability than the expected level. An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. There is, of course, a degree of ambiguity. Qualifying a data point as an anomaly leaves it up to the analyst or model to determine what is abnormal—and what to do with such data points. An ANOVA is quite robust against violations of the normality assumption, which means the Type 1 error rate remains close to the alpha level specified in the test. Violations of the homogeneity of variances assumption can be more impactful, especially when sample sizes are unequal between conditions.
Tree based methods divide the predictor space, that is, the set of possible values for X1, X2,… Xp ,into J distinct and non-overlapping regions, R1, R2….. RJ. In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model
The goal is to find boxes R1,R2, ….. RJ that minimize the Residual sum of Squares (RSS) , given by
Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes. For th
Tree based methods divide the predictor space, that is, the set of possible values for X1, X2,… Xp ,into J distinct and non-overlapping regions, R1, R2….. RJ. In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model
The goal is to find boxes R1,R2, ….. RJ that minimize the Residual sum of Squares (RSS) , given by
Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes. For this reason, we take a top-down, greedy approach that is known as recursive binary splitting. The approach is top-down because it begins at the top of the tree and then successively splits the predictor space; each split is indicated via two new branches further down on the tree.
It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.
We first select the predictor Xj and the cutpoint s such that splitting the predictor space into the regions {X|Xj < s } leads to the greatest possible reduction in RSS.
Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions.
However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. Again, we look to split one of these three regions further,so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.
Example :-
Since, extreme values or outliers, never cause much reduction in RSS, they are never involved in split.
Hence, tree based methods are insensitive to outliers.
No, no ML model is robust to outliers. Outliers is a property of data and regardless of whatever model you use you need to remove it. However, when you are using XGBoost, you don’t need to standardize unless its a very big dataset.
The essential problem with outliers is their “undue” (whatever that means), influence on summary statistics you are calculating.
One approach to dealing with this is to calculate your summary statistic and eliminate a fixed percentage of the most extreme data-points as judged by the statistic and then recalculate the statistic. If the same extreme datapoints would be excluded using the new version of the statistic, you’re done. If not, modify the excluded data points and repeat.
Here are two articles I wrote years ago doing this for correlation analysis:
The essential problem with outliers is their “undue” (whatever that means), influence on summary statistics you are calculating.
One approach to dealing with this is to calculate your summary statistic and eliminate a fixed percentage of the most extreme data-points as judged by the statistic and then recalculate the statistic. If the same extreme datapoints would be excluded using the new version of the statistic, you’re done. If not, modify the excluded data points and repeat.
Here are two articles I wrote years ago doing this for correlation analysis:
If you are not familiar with covariance ellipses, read this:
I am not advocating this because a datapoint is either wrong - mis-measured, instrument failure, etc., or it is real, so why are you excluding it?
A point has to lie out from something.
If you did a study of IQ among Nobel Prize winners, you might say that all Nobel Prize winners are outliers among the general population. But you can’t use in-sample evidence to identify them, you have to use exogenous evidence.
Another case is you may have data that seems so idiosyncratic that you can’t do any useful analysis. For example, if you wanted to predict auction prices of art that sells for over $10 million, you might decide that each piece was unique, with its own story, that there were no common elements that drove simple price relations.
The best way isn’t taught much in university courses. Namely to “teach” what we already know, before we try to “learn” from our data. We need hybrid methods.
Hundreds of examples could be offered here. Two should make the point:
- In a live Big Data demonstration of predictions of lung diseases, clustering and correlation showed cancer caused people to smoke. The speaker calmly swapped the cause/effect arrow because “we all know” that’s right. But he let the data drive all the other causal indicators. Of course, what was needed was some expert advice on any mistakes or missing realities.
- A retailer
The best way isn’t taught much in university courses. Namely to “teach” what we already know, before we try to “learn” from our data. We need hybrid methods.
Hundreds of examples could be offered here. Two should make the point:
- In a live Big Data demonstration of predictions of lung diseases, clustering and correlation showed cancer caused people to smoke. The speaker calmly swapped the cause/effect arrow because “we all know” that’s right. But he let the data drive all the other causal indicators. Of course, what was needed was some expert advice on any mistakes or missing realities.
- A retailer’s data science team, looking at seasonal stocking patterns, saw a big demand for plywood in a short seasonal window, in one particular city, and recommended positioning a large stock for the next season’s sales. The inventory managers patiently explained the city in question had been hit by hurricanes on the same week two years in a row. The odds of it repeating were very low. What the retailer did (already) was to stock plywood in regional distribution centers for hurricane season, and move the inventory into the predicted path of named storms.
So the lessons are:
- Ask the experts what rules, relationships and limitations are known. Build those in, or at least check for them. This will save you the embarrassment of thinking you have noticed faster than speed of light travel times, or other silly things.
- When a pattern or prediction seems to appear, ask an expert why you are seeing it BEFORE you offer formal recommendations. This will save you the experience of helping the experts “prove” your analysis was a waste of time and money.
in the real world we never have data as clean as a Kagle competition or your homework assignment, and there are always factors at play which aren’t fully explained by your data set. Never and always are strong words but they seem to be true in this case.
Mean value of data changes drastically with the presence of outliers.
As Standard Deviation is calculated using mean value,
Hence, Standard Deviation is also affected by Outliers very badly.
Mean value of data changes drastically with the presence of outliers.
As Standard Deviation is calculated using mean value,
Hence, Standard Deviation is also affected by Outliers very badly.
The effects of outliers in a dataset can result in longer training times, skewed results, and ultimately less accurate models.
Overall, outliers cause bias, reduce statistical test power, and negatively influence predictions.
Some models, such as regression-based models, are sensitive to outliers, while some are robust to outliers, such as tree-based models. Robust models usually handle outliers automatically, but overfitting can occur when the tree fits all the training data set samples. For decision trees, pruning helps overcome overfitting.
Usually, there’s never one way to handle outliers in
The effects of outliers in a dataset can result in longer training times, skewed results, and ultimately less accurate models.
Overall, outliers cause bias, reduce statistical test power, and negatively influence predictions.
Some models, such as regression-based models, are sensitive to outliers, while some are robust to outliers, such as tree-based models. Robust models usually handle outliers automatically, but overfitting can occur when the tree fits all the training data set samples. For decision trees, pruning helps overcome overfitting.
Usually, there’s never one way to handle outliers in a dataset. Seasoned data scientists and ML engineers study the dataset and understand features during data preparation before building models. If outliers are likely to affect your training, it's best to remove them. Sometimes, you can ignore them if your model selection is robust to outliers.
The best way is to identify/test your data for potential outliers and move forward on how to handle them. I use deepchecks outlier detection code/check.
- import pandas as pd
- from sklearn.datasets import load_iris
- from deepchecks.tabular import Dataset
- from deepchecks.tabular.checks.integrity.outlier_sample_detection import OutlierSampleDetection
I hope this helps.
I have thrown out the outliers when doing regressions on rare occasions. I have also used binaries in MLS to drop the outliers. Again very rarely. Another option is to use median in place of mean for normalizing the distribution in probability analysis.
It is generally not a good idea to screw around with the data. Strange events occur naturally. “Bad Data” some times occurs such as an instrument failure or data transcription error. Those are the ones to put in jail with a footnote.
The real trick is to understand WHY the values are outliers. If you can’t write an explanation in a footnote, do
I have thrown out the outliers when doing regressions on rare occasions. I have also used binaries in MLS to drop the outliers. Again very rarely. Another option is to use median in place of mean for normalizing the distribution in probability analysis.
It is generally not a good idea to screw around with the data. Strange events occur naturally. “Bad Data” some times occurs such as an instrument failure or data transcription error. Those are the ones to put in jail with a footnote.
The real trick is to understand WHY the values are outliers. If you can’t write an explanation in a footnote, don’t ignore the data point. It is also a good idea to quantify just how much the outliers are affecting your analysis.
In doing electric peak load forecasting in Florida I ran across one winter out of 20 that was especially cold. I agonized over that data point every year. The data was accurate, but that one winter had enough impact on the forecast to call for new generation five years earlier than without the outlier.
MISSING VALUES:
There is no golden solution for dealing with missing values. It all depends on the situation. I ll mention few.
- If if a feature has too many missing values then then drop the whole feature.
- If the feature is too important to drop then introduce another binary feature as isnull of this feature and impute the null values of the existing feature with median/mean.
- If there are very few missing values in a feature and removing those rows doesn't hurt the sample size then remove the rows.
- If removing rows with missing values in either of the feature reduce the sample size drastically then
MISSING VALUES:
There is no golden solution for dealing with missing values. It all depends on the situation. I ll mention few.
- If if a feature has too many missing values then then drop the whole feature.
- If the feature is too important to drop then introduce another binary feature as isnull of this feature and impute the null values of the existing feature with median/mean.
- If there are very few missing values in a feature and removing those rows doesn't hurt the sample size then remove the rows.
- If removing rows with missing values in either of the feature reduce the sample size drastically then go for imputation. There are multiple ways for that.
- impute with mean/median of the column.
- mean/median of that column for N nearest neighbors
- If if it's a time series data set then use Marcov Chain to predict the missing values
- If each row is a time series and your algorithm doesn't demand the rows to be of same size then leave it as is. One example would be dynamic time warping distance between time series.
There are many other ways to deal with this. I wrote a few from my tiny experience if data pre-processing. Come up with your creative solution once you ate stuck with missing values.
OUTLIERS:
Straight way drop those; I hate those!!! Now it depends on you u how you decide whether a data point is outlier or not. The most common way is to calculate the standard deviation. Never try to impute the outliers with some other values. That will harm your analysis. Also, recalculate the statistics after you remove the outliers.
If you’re dealing with observations that should be independent and identically distributed, then there are three good general methods:
In every case, the method assigns a score to each observation measuring how anomalous it is, and it’s up to the user to decide which points are outliers. That’s not usually a completely trivial problem.
If your data has time series structure, then the problem gets more complicated. You have to worry about trends and seasonality and nonstationarity in general. There are a few methods out there,
If you’re dealing with observations that should be independent and identically distributed, then there are three good general methods:
In every case, the method assigns a score to each observation measuring how anomalous it is, and it’s up to the user to decide which points are outliers. That’s not usually a completely trivial problem.
If your data has time series structure, then the problem gets more complicated. You have to worry about trends and seasonality and nonstationarity in general. There are a few methods out there, but I’m not aware of any small number of methods that generally work well.
If your data has a spatial dependence structure, you’re on your own here.
Another answer to the ones give, is the “trimmed mean”. This discards the N extreme values and calculates the mean on the remaining data.
In any case, if you have outliers, it’s a really good idea to see what the heck is going on…
It could be something dumb like using 9999 to indicate missing value. Or the value was recorded incorrectly. Or any number of other things.
Or it could be real, in which case you really want to find out why. For example, there was a case where there were a lot of outliers with a magnetic number readings on checks. It turns out, the cause was due to forged checks.
How outiers are dealt with depends on what you need the data for.
The presence of outliers is often a sign that there are several distinct populations in the dataset… so you need to reflect that in the conclusions.
This question seems to be broad. It depends on what problem you are trying to solve and the current state of your model. Something like transfer learning could give you a high accuracy only by training on 100 examples, and the improvement would not be significant if you fine-tune it on million examples. Also, you need to understand if your model seems to get stuck in local minima, changing hyper-parameter and tuning them with grid search can be one of the ways you can improve the performance of your predictive model. Lastly, depending on the problem you are trying to solve accuracy need not be
This question seems to be broad. It depends on what problem you are trying to solve and the current state of your model. Something like transfer learning could give you a high accuracy only by training on 100 examples, and the improvement would not be significant if you fine-tune it on million examples. Also, you need to understand if your model seems to get stuck in local minima, changing hyper-parameter and tuning them with grid search can be one of the ways you can improve the performance of your predictive model. Lastly, depending on the problem you are trying to solve accuracy need not be the only performance metric you track to analyze the model performance.