Contents

# Outlier|Definition & Meaning

## Definition

In the field of **statistics**, some **data** **points** may vary a lot from the rest of the **data** **points**, and supposedly they might hold some important or useful information. Such **points** differ from the overall **observational** **points** and are known as **outliers**. The reason for an **outlier** to differ can be the drift in measurement or an indication that such a point holds some **valuable** **information**.

Most of the time, the **data** **points** that are farther away from the rest are omitted from the set. Another reason can be because of an **experimental** or human **error** which may result in a significant drift.

Not every **outlier** is supposed to be causing a serious problem in your **experimental** **analysis**, but neither does every **outlier** can be a hint for an exhilarating **opportunity**.

## Defining an Outlier

In any **distribution**, outliers might happen by chance. However, they can also signal measurement errors, abnormal conditions or structures in the **data** set, or a heavily loaded propagation in the **population**.

Far more often, in **extremely** large **data** collections, certain **data** **points** will be significantly distant from the mean of the **distribution** compared to what is regarded as fair.

It can be because some observations are distant from the middle of the **spectrum**, unintentional periodic **faults**, or problems with the concept that formed the **hypothesized** group of **stochastic** **processes**.

Consequently, **outlier** values may hint at inaccurate **information**, improper **methodology**, or regions in which a **particular** **concept** may not necessarily apply. Nevertheless, it is common to anticipate a few outliers in big **datasets**.

## What Is a Data Set?

An assortment of **data** is referred to as a **dataset**. Usually, this collection is shown in a **tabulated** **form**. A specific parameter is described in each section. Moreover, depending on the proposed issue, each array refers to a certain element of the information set. **Data** **management** includes this.

The values of every parameter for uncertain things, such as the size, **mass**, **warmth**, **density**, etc., of an item or the values of **different** numbers, are described in **data** **sets**.

The **data** set comprises information from one or perhaps more participants for each array. Now let us discover the concept of a **dataset**, multiple **dataset** kinds, **characteristics**, and more.

Depending upon the nature of the environment, **data** **sets** can have several types, but let’s look into some of the most common ones used in **machine** **learning** and **AI**.

### Numerical Sets

As the name suggests, **data** represented in the form of numbers instead of alphabetical letters is known as a numerical **dataset**. The weight of a human body or the length of a boat are examples of numerical **datasets** on which arithmetic operations can be performed.

### Bivariate Set

A bivariate set consists of two parameters in a **data** set. This group of **data** addresses the connection between those two **parameters**. Usually, two types of related **data** are typically present.

For instance, to determine a soccer team’s player **age** and **goal** **percentage**. Age and goals scored are two criteria that can be addressed.

### Categorical Sets

Value or attribute **sets** for people or things are represented by **categorical** **sets** of **data**. Polytomous values are categorical parameters or variables that have more than two **possible** **quantities**. The gender status of a **human** being is a solid example of categorical **sets**.

## How Can an Outlier Be Recognized?

There are certain ways to recognize or identify an **outlier**, but we will look into the **mathematical** methods to find an **outlier** from a given numerical **data** set.

An **outlier** can be of two types, a lower **outlier**, and a higher **outlier**, and both of these outliers can be found if they fulfill the following requirements:

outlier < First Quartile – 1.5 x (Inter-Quartile Range)

outlier > Third Quartile + 1.5 x (Inter-Quartile Range)

The first requirement is for the lower **outlier **as the **data **point has to fulfill the **requirement **of being less than 1.5 times the interquartile range from the first quartile.

Similarly, the second requirement is for the higher **outlier** as the **data** point has to fulfill the **requirement** of being greater than 1.5 times the **interquartile** **range** from the third quartile.

But first, we have to find the Q_{1}, Q_{2}, and IQR, which will be explained in the example.

## How To Handle the Outlier?

Depending on the circumstance, one should decide how to cater to an **outlier**. Here are a few methods that will help you to retain or eliminate your **outlier** from the **data** set.

### Preservation

Outliers really shouldn’t be immediately ignored even though a **parametric** **statistical** model would be suitable for the **variables** getting examined because they are predicted for large amounts of **data**. The system should characterize **data** with circulating **outlier** values using machine learning **techniques** that are resilient to misfits.

### Dismissal

Dismissal of an **outlier** is not considered a good practice among the **scientific** community as well as the educators who teach such subjects. From the mathematical **perspective**, there are certain **quantitative** ways in which flawed **data** can be dismissed.

Dismissal or exclusion of an **outlier** is better practiced in areas where there is a known chance of an error in the measurement or if there lies an instrumental fault causing the **exclusion** of an **outlier**.

### Irregular Distributions

Take into account that perhaps the **data**‘s regression line or **distribution** might not have normal behavior and could **potentially** have “ifs and buts.” A good example is a Cauchy **distribution** in which the variability of **data** **points** increases with the increase in sample size and thus throws the outliers as far as they can from the standard **distribution**.

## Solved Example

Let’s take an odd **distribution** and find out the lower **outlier** and the upper **outlier** using the equations described above:

15, 16, 3, 5, 5, 34, 10, 11, 12, 4, 1

### Solution

Arranging in **ascending** **order:**

1, 3, 4, 5, 5, 10, 11, 12, 15, 16, 34

**Finding the Median**

Next, we will calculate the **median,** which is 10, as it is in the middle. So Q_{2} = 10.

**Calculating **Q_{1}

Q_{1} can be found by figuring out the **median** of the lower half of the **data** set. Thus the **median** of 1, 3, 4, 5, 5 is 4, which is the value of Q_{1}.

**Calculating **Q_{1}

Similarly, Q_{1} can be found by figuring out the median of the **upper** **half** of the **data** set. Thus the **median** of 11, 12, 15, 16, 34 is 15, which is the value of Q_{1}.

**Finding the Value of IQR**

The **interquartile** **range** can be easily found by taking the difference between the **upper** **quartile** and the **lower** **quartile:**

IQR = Q_{1} – Q_{1}

IQR = 15 – 4

IQR = 11

**Finding an Outlier in the Dataset**

As of now, we have the following information about the **dataset:**

Q_{1} = 4

Q_{2} = Med = 10

Q_{1} = 15

IQR = 11

Min = 2

Max = 34

So the **outlier** can be calculated as:

outlier < Q_{1} – 1.5 x (Inter-Quartile Range)

outlier < 4 – 1.5 x (11)

outlier < -12.5

Since there is no value less than -12.5, the lower **outlier** of the **dataset** doesn’t exist.

Now let’s find the other one:

outlier > Q_{1} + 1.5 x (IQR)

outlier > 15 + 1.5 x (11)

outlier > 31.5

If we peek inside the given **dataset**, we can find a number that is larger than 31.5, which is 34, so the upper **outlier** of the current **dataset** is 34.

By **applying** the same **methods** as above, you can find an **outlier** of an even **dataset. **The **difference** lies in the **process** of the **median** hunt, whereas the rest remains the same.

*All images/mathematical drawings were created with GeoGebra.*