Introduction
Missing values
Duplicate rows
7. Automatic check for duplicates
Outliers
Joining datasets
Summary

Instruction

Good! Duplicate rows are not easy to identify at first sight. Luckily, just like with NaN values, we can check if a given DataFrame or specific column contains duplicates. To look for duplicates in a DataFrame named cars, we can write:

cars.duplicated().values.any()

To check if a specific column named vin has any duplicate rows, we can write:

cars['vin'].duplicated().values.any()

Both of these expressions return True if there is at least one duplicate row.

Usually, it makes more sense to check for duplicates in the whole DataFrame rather than in single columns. For instance, in our case, two states could generate the exact same sales figures – and it would be perfectly legit.

Exercise

Check if there are duplicate rows in the states_sales DataFrame.

Stuck? Here's a hint!

Add:

.duplicated().values.any()
to the whole DataFrame name.