how to delete duplicates in google sheets
In this post, you'll see how to remove duplicates in Google Sheets using the Remove Duplicates tool, the UNIQUE function, Pivot Tables, Conditional Formatting, Apps Script or an Add-On.
Firstly though, what are duplicates? And why are they a problem?
Duplicates (or duplicate entries, or duplicate rows) are instances of the same record appearing in your data more than once.
They're a huge problem and the bane of any data analyst's life. It's critical that they are found and removed before any data analysis is performed.
Imagine you had two instances of the same client transaction for $5,000 in your database. When you summarize your data, you might think you have $10,000 in revenue from that client when in fact you only have $5,000. You'll make decisions based on the wrong data.
And we all know that's bad for business.
So it's best that you deal with them head on.
Thankfully, it's not too painful or time consuming to fix them. In this guide I'll show you six different techniques on how to search for duplicates in Google Sheets and how to remove duplicates in Google Sheets.
Contents
- Method 1: How to remove duplicates with the Remove Duplicates tool
- Method 2: How to remove duplicates with formulas
- Method 3: How to remove duplicates with Pivot Tables
- Method 4: How to remove duplicates with Conditional Formatting
- Method 5: How to remove duplicates with Apps Script
- Method 6: How to remove duplicates with an Add-On
How to remove duplicates in Google Sheets? Recommendations
In most cases, I would advocate using the Remove Duplicates tool, because it's quicker and easier than the other methods. However, here's a rundown of when it might make sense to use the different methods:
Method 1: Remove Duplicates tool is the most robust, practical method of removing duplicates.
Method 2: Formulas The UNIQUE formula is great for small, simple datasets or when you need to remove duplicates inside a nested formula.
The IF formulas are a robust but rather old-school approach. They have the advantage of getting your hands dirty with the data so you really see the duplicate entries up close.
Method 3: Pivot Tables are a great way to search for duplicates in Google Sheets. Pivot Tables are extremely flexible and fast to use, so it's great tool to use when you're unsure if you have duplicates and you're checking your data.
Method 4: Conditional Formatting is a great way to highlight duplicates in Google Sheets.
Method 5: Apps Script is useful for developers who want to remove duplicates from Sheets as part of their apps, or someone who needs to repeatedly and automatically de-duplicate their data.
Method 6: Add-on gives you more options for how you remove the duplicates (e.g. making fresh copy of the data, highlighting the duplicate entries first for review etc.) so is a good option if you need to understand the duplicates. However, it's a paid add-on which seems unnecessary given the other free tools at your disposal.
I do use all of these different methods myself, for the different scenarios described above.
Video: how to remove duplicates in Google Sheets using the UNIQUE formula and Add-on methods
Method 1: How to remove duplicates in Google Sheets with the Remove Duplicates tool
The new feature is super easy to use. You find this feature under the menu: Data > Remove Duplicates
When you click Remove Duplicates, you'll be prompted to choose which columns you want to check for duplicates.
You may want to remove duplicates where the rows entirely match, or you may wish to choose a specific column, such as an invoice number, regardless of what data is in the other columns.
The duplicates will then be removed and you'll be presented with a summary report, telling you how many duplicates were removed:
For more information and more detail about the tool, you can check out Google's G Suite announcement (includes details of other new features too!).
Method 2: How to remove duplicates in Google Sheets using formulas
2.1 Use the UNIQUE formula
This method deletes duplicates in the range of data you select.
It considers all the columns of your data range when determining the duplicates. In other words, it compares each row of data and removes any rows that are duplicates (identical to any others across the whole row).
It's very easy to implement as it involves a single formula with a single argument — the range you want to de-duplicate (remove duplicates from).
=UNIQUE(A1:D11)
Here's an example of the UNIQUE function in action. The function is in cell F1 and looks for duplicates in the data range in A1:D11
:
And this is the result:
You can see the table on the right has fewer rows, because the duplicate rows have been removed.
2.2 Highlight duplicate values with COUNTIF
This method first highlights duplicates in your dataset.
First, create a new column next to the data column you want to check for duplicates (e.g. invoice number).
Then use this formula in cell B2 to highlight the duplicates in column A:
=COUNTIF(A$2:A2,A2)>1
You'll notice the range is A$2:A2
The $ sign is key here because it locks the range to the top of the column, even as you copy the formula down column B. So this formula checks for duplicates in the current row back up to the top.
When a value shows up for the first time, the count will be 1, so the formula result will be false. But when the value shows up a second time, the count will be 2, so the formula result will be TRUE.
The final step is to highlight the TRUE values (the duplicates) and delete them.
Note: If you have a large dataset, with a lot of duplicates, then it's best to turn the Duplicate column into values (Copy > Paste Special), sort by this column so all the duplicates (TRUEs) are in a block at the bottom of your dataset, and then delete them in one big group. It's much quicker.
2.2 How to remove duplicates in Google Sheets using IF formulas
This is how I first learnt to find duplicates and how to remove duplicates.
It works by sorting your data and comparing adjacent rows. It's rather cumbersome at first, because the formulas are a little tricky. However, it's easy once you get used to it and has the benefit of getting you into the weeds of your data. You will see the duplicates up close and can investigate why they're happening.
It involves using an IF formula to check whether a value on your current row is equal to the value on the previous row. If they're equal, then it's a duplicate entry.
Duplicate entries are assigned a value of 1, and unique (non-duplicate) entries are assigned a value of 0. That way you can easily count how many duplicates you have (by summing the column).
Here's an image showing this method in action:
Only the first column, the invoice number, is considered for duplication purposes in this example.
That means if two rows have the same invoice number then they'll be tagged as duplicates, regardless of whether any other datapoints, like date, are different.
You can see that invoice number "196-X" on row 4 has been marked with a 1 to indicate it's a duplicate row.
If you want to consider more columns or the whole row use one of other methods described in this article.
The next step is to highlight the duplicate pairs:
This is done so that you can compare the unique entry and the duplicate entries side-by-side and investigate what's happening before you blindly delete them.
The final step is to then delete the duplicate rows.
Here's the whole process explained in more detail:
Step 1: Add a unique order column
This step is important. It lets you sort the data by whichever column you want and then easily return to the original order. (Note: This was not shown in the above screenshot.)
Step 2: Sort A-Z on whichever column is your "unique" column
Do you have an invoice number you can check for duplicates? Or a transaction reference number? Or a student ID?
Identify the column that should be unique. Then sort it from A-Z.
This sorts all matching entries next to each other (so duplicate entries will be next to their unique counterparts).
Step 3: Insert two new columns to the right of this "unique" column
Apply a color coding to inform users that these columns have been added and were not part of the original dataset.
Step 4: Identify duplicate entries with an IF formula
The basic formula to check for duplicates, in column B, is this:
=IF(B3=B2,1,0)
Step 5: Identify duplicate groupings with an IF and OR formula
And then the formula in the adjacent column, which identifies the duplicate groups (i.e. the first entry and subsequent duplicates):
=IF(OR(D3=1,D2=1),1,0)
Step 6: Review duplicate entries
This step is always worth doing. You may identify the source of your duplicate entries. Perhaps all the duplicates are specific to a certain date or customer, which you can then resolve.
Step 7: Keep copies of your duplicate formulas for future reference (optional)
If you think you may want to use these formulas again, you can keep copies in the rows above your Sheet so you have them "live".
Step 8: Turn duplicate columns into values
Highlight the data (consisting of 0's and 1's) in your two duplicate columns.
Copy > Paste special > Paste values only
Step 9: Sort by the duplicate column to move duplicates
Move all of the duplicates to the bottom of your dataset, so they can easily be removed as a single contiguous range. This is much faster than simply filtering on the duplicates and deleting them, especially if your dataset is large.
Step 10: Sort your dataset by the original order column
Sort your data back to it's original order, but now with the duplicate rows removed.
To recap, this method:
- Allows you to highlight duplicates in Google Sheets
- Allows you to identify duplicate groups in Google Sheets
- Keeps copies of all your data until you are ready to delete rows
- Is rather labor intensive however, involving tricky formulas
- Modifies the original dataset (unless you make a copy)
2.3 Use IF formulas on multiple columns
This is the same as method 3 above, but applied to multiple columns.
In this case, you want to search for duplicates in Google Sheets based on two (or more) columns. Maybe an invoice number and a product ID.
First thing to do is to identify the columns you want to include in the duplicate consideration.
This time, insert three new columns to the right of your final "unique" column.
Use a formula to concatenate all of the "unique" columns you're considering to highlight duplicates:
=B2&"/"&C2
or you can also use the CONCATENATE formula (including more columns this time):
=CONCATENATE(B2,"/",C2,"/",D2,"/",E2)
This new column consists of all of the "unique" columns squashed together.
Now, use Method 3 to de-duplicate on this new "unique" column of joined values. You proceed through the same 10 steps from Method 3 to identify duplicates and duplicate groups.
Method 3: How to remove duplicates in Google Sheets using Pivot Tables
If you're new to Pivot Tables, check out my article Pivot Tables 101: A Beginner's Guide.
Pivot Tables are extremely useful for exploratory data analysis.
Pivot Tables are a great tool to use to search for duplicates in Google Sheets. They're extremely flexible and fast to use, so they're often a great place to start if you're unsure whether you have any duplicates in your data.
Highlight your dataset and create a Pivot Table (under the Data menu).
A new tab opens with the Pivot Table editor.
Under ROWS, choose the column you want to check for duplicates (e.g. invoice number). Then in VALUES, choose another column (I often use the same one) and make sure it's set to summarize by COUNT or COUNTA (if your column contains text), like this:
The Pivot Table will then look like this:
You can see that duplicates values (for example 196-X) will have a count greater than 1.
From here you can lookup these duplicate values in your original dataset and decide how to proceed.
As you can see, this method is most suitable when you're checking for duplicates and want to go and investigate them.
Method 4: How to remove duplicates in Google Sheets using Conditional Formatting
This method has two steps: 1) highlight duplicates in Google Sheets, and 2) remove those highlighted duplicates (optional).
Step 1
Select your dataset and open the conditional formatting sidebar (under the Format menu).
Under the "Format cells if…" option, choose custom formula is (the last option) and enter the following formula:
=COUNTIF($A$1:$A1,A1)>1
This formula checks for duplicates in column A.
The output is highlighting applied to the duplicate values:
What if you want to apply the highlight to the whole row?
You need to make one small tweak to the formula (highlighted in red) by adding a $ sign in front of the final A:
=COUNTIF($A$1:$A1,$A1)>1
Now your output will look like this, with the whole row highlighted:
Check out this article for a more detailed look at how to highlight a whole row using conditional formatting.
Step 2
This is an optional step to delete the duplicate rows.
Add filters to your dataset and then sort by color to bring the duplicate rows together.
Highlight the block of duplicate rows and delete.
Method 5: How to remove duplicates in Google Sheets using Apps Script
(New to Apps Script? Read my starter guide to Apps Script for a primer.)
It's relatively straightforward to create a small script file that can remove duplicate rows from your datasets.
The advantage of writing an Apps Script program is that you can run it over and over, for example each time you add new data.
Sample Apps script program: How to remove duplicates in Google Sheets
This program removes duplicates from a dataset in Sheet 1. It's very specific to the Sheet and data range, but it's easy to create and modify.
It works as follows:
- Get the values from the data range in Sheet1, using Apps Script
- Turn the array rows into strings (blocks of text) for comparison
- Filter out any duplicate rows
- Check whether a de-duplicate sheet exists
- If it does, clear out the old data and paste in the new de-duplicated data
- If it does not exist, create a new sheet and paste in the new de-duplicated data
- Add a custom menu to run from the Google Sheet
So it's very specific to this use case, but it could be easily adapted if necessary for different datasets. Here it is in action:
And here's the Apps Script code for this program:
/** * remove duplicate rows from Google Sheets data range */ function removeDupRows() { var ss = SpreadsheetApp.getActiveSpreadsheet(); var sheet = ss.getSheetByName('Sheet1'); // change the row number of your header row var startRow = 7; // get the data var range = sheet.getRange(startRow,1,sheet.getLastRow(),sheet.getLastColumn()).getValues(); // remove duplicates with helper function var dedupRange = arrayUnique(range); Logger.log(dedupRange); // check if duplicate sheet exists already, if not create new one if (ss.getSheetByName('Sheet1 Duplicates Removed')) { // case when dedup sheet already exists var dedupSheet = ss.getSheetByName('Sheet1 Duplicates Removed'); var lastRow = Math.max(dedupSheet.getLastRow(),1); var lastColumn = Math.max(dedupSheet.getLastColumn(),1); // clear out any previous de-duplicate data dedupSheet.getRange(1,1,dedupSheet.getLastRow(),dedupSheet.getLastColumn()).clear(); // replace with new de-duplicated data dedupSheet.getRange(1,1,dedupRange.length,sheet.getLastColumn()).setValues(dedupRange); } else { // case when there is no dedup sheet var dedupSheet = ss.insertSheet('Sheet1 Duplicates Removed',0); dedupSheet.getRange(1,1,dedupRange.length,dedupRange[0].length).setValues(dedupRange); } // make the de-duplicate sheet the active one dedupSheet.activate(); } /** * helper function returns unique array */ function arrayUnique(arr) { var tmp = []; // filter out duplicates return arr.filter(function(item, index){ // convert row arrays to strings for comparison var stringItem = item.toString(); // push string items into temporary arrays tmp.push(stringItem); // only return the first occurrence of the strings return tmp.indexOf(stringItem) >= index; }); }
And you can also add a custom menu to run it from your Google Sheet rather than the script editor window:
/** * add menu to run function from Sheet */ function onOpen() { var ui = SpreadsheetApp.getUi(); ui.createMenu('Remove duplicates') .addItem('Highlight duplicate rows','highlightDupRows') .addItem('Remove duplicate rows','removeDupRows') .addToUi(); }
The code for this simple duplicate program is also here on GitHub.
Suggestions for improvement
- Set triggers to run the duplicate remover on certain conditions (e.g. once a day, when new data is added)
- Better control over selecting the data (i.e. which Sheet, what range etc.)
- Whether to consider all of the columns or not for duplicates
- Better control over the output
I started coding something along these lines, but it gets complicated as you start to pile on more edge-cases and user options. I realized pretty quickly that all I was doing was reinventing the wheel, since a perfectly fine add-on exists (see method 2).?
The best thing about Apps Script is that it lets you build minimal viable products specific to your situation very quickly.
Once you're familiar with Apps Script, it only takes 15 – 30 minutes to build custom scripts, like this one to remove duplicates in Google Sheets.
Method 6: How to remove duplicates in Google Sheets using an Add-on
Add-ons are specialized programs that you add to your Google Sheets, to give you extra functionality. They're developed by third-party developers but are vetted by Google before they end up in the Add-On directory.
You add Add-ons to your Sheet through the Add-ons menu:
As you'll notice there a lot of options. The best one I've found is Remove Duplicates Add-On from Ablebits, which is the first one in this image:
This is a paid Add-On, although you can use the full functionality for free during a 30-day trial period. (Full disclosure: this is an affiliate link, meaning I'll earn a small commission at no extra cost to you.)
Click here to test out Remove Duplicates Add-On from the Add-On store.
This add-on highlights duplicates in Google Sheets, deletes duplicates in Google Sheets and can even compare two columns for the same value.
It's very easy to use and super handy if you deal with duplicates a lot, or if you're not sure where or why duplicates are in your data. Before you delete them, you can investigate which rows of data are duplicates and see if you can identify anything that's a problem.
It's very easy to use.
It has a lot of options, such as whether you want to highlight duplicates, delete them, create a new copy of your data etc., so I'll leave it to you to explore.
Here's a GIF image showing the steps to highlight duplicates in your dataset:
Now you know how to remove duplicates in Google Sheets with six different techniques, go forth and banish those duplicates from your datasets!
how to delete duplicates in google sheets
Source: https://www.benlcollins.com/spreadsheets/how-to-remove-duplicates-in-google-sheets/
Posted by: nashpitand.blogspot.com
0 Response to "how to delete duplicates in google sheets"
Post a Comment