Remove Duplicate Records in CSV Files with PowerShell

In a recent comment on a my post about combining CSV files using PowerShell, a reader named Oliver asked how duplicate values should be handled when combining multiple CSV records. I added a suggestion to that post but figured, “Why not elaborate more on a post of its own?”

Question

Oliver commented on the initial post, asking the following:

How would you remove duplicate entries? When for examples the files you import could contain the same rows

I don’t know what the “best” way would be. You could probably come up with a way to filter out duplicates as you’re combining files. I replied with where my mind went first.

Answer

My suggestion was to import the CSV, use the -Unique parameter to filter unique values, and export that CSV as a new file. Let’s say we already combined our CSV files and now have this as a result:

Publisher,Identity,Name,Hometown
"DC Comics","Clark Kent","Superman","Metropolis"
"Marvel Comics","Peter Parker","Spider-Man","New York"
"Dark Horse Comics","Hellboy","Anung Un Rama","East Bromwich"
"Marvel Comics","Peter Parker","Spider-Man","New York"
"Image Comics","Mark Grayson","Invincible","Reginaldville"
"IDW Publishing","April O'Neil","The Reporter","New York"
"Dark Horse Comics","Hellboy","Anung Un Rama","East Bromwich"

Note the duplicates for Spider-Man and Hellboy. Now we’ll run this PowerShell script to clean up our duplicates:

Our resulting Unique_Only.csv file has only one record for Spider-Man and Hellboy:

Import-Csv C:\Users\Chad\Desktop\Test.csv |
    Sort-Object Publisher, Identity, Name, Hometown -Unique |
    Export-Csv -Path C:\Users\Chad\Desktop\Unique_Only.csv -NoTypeInformation

"Publisher","Identity","Name","Hometown"
"Dark Horse Comics","Hellboy","Anung Un Rama","East Bromwich"
"DC Comics","Clark Kent","Superman","Metropolis"
"IDW Publishing","April O'Neil","The Reporter","New York"
"Image Comics","Mark Grayson","Invincible","Reginaldville"
"Marvel Comics","Peter Parker","Spider-Man","New York"

Just What I Needed

We used -Unique to only select unique records. We also included -NoTypeInformation to remove the #TYPE information from our new CSV file. If we didn’t include -NoTypeInformation, our result would have some extra data at the beginning we don’t need:

#TYPE System.Management.Automation.PSCustomObject
"Publisher","Identity","Name","Hometown"
"Dark Horse Comics","Hellboy","Anung Un Rama","East Bromwich"
"DC Comics","Clark Kent","Superman","Metropolis"
"IDW Publishing","April O'Neil","The Reporter","New York"
"Image Comics","Mark Grayson","Invincible","Reginaldville"
"Marvel Comics","Peter Parker","Spider-Man","New York"

I’ll add that you could delete the old file and consider renaming the “unique” file to match the original name. That would add some cleanup to the process, but I suppose it depends on what you’re doing with the files.

Thanks again to Oliver for the question and thank you for reading!

5 thoughts on “Remove Duplicate Records in CSV Files with PowerShell”

Pingback: Remove Duplicate Records from CSV Files – Curated SQL

Thanks for taking the time to write this up.

My question would be, how is performance on, say, a million element or million row table where each row is a single element?

LikeLike

CallihanData says:

June 30, 2024 at 10:35 am

Thanks for the comment!

I created a file with elements 1-100,000 (Element_1..Element_100000) and copy/pasted those values 10 times to total a million. The script to remove duplicates took 12 minutes on my laptop.

LikeLike

Reply
1. JeffModen says:
  
  July 1, 2024 at 7:58 am
  
  Awesome feedback. Thanks for the test! Just to be sure… was the first set of 100,000 all unique? If not, what was the ratio of duplicates, please?This reason I ask is that I’m going to try a similar experiment using a different method. This whole thing is something that I’ve wanted to know for quite a while and so i have to say, thank you for this great opportunity because I’m terrible with PowerShell and will rely on your good testing.
  
  LikeLike
2. CallihanData says:
  
  July 1, 2024 at 8:36 am
  
  Yes, I started out with 100,000 unique. To get to 1,000,000 records, I copied all 100,000 and pasted until I had 10 of Element_1, 10 of Element_2, up to having 10 of Element_100000. After the script ran to cleanup duplicates, I was back to 100,000 total.
  
  LikeLike

Question

Answer

Just What I Needed

Share this:

Related

5 thoughts on “Remove Duplicate Records in CSV Files with PowerShell”

Leave a comment Cancel reply