Recently, a challenge came across my desk which included comparing very large sets of data against one another. Specifically, a list of all computers in our domain compared to a list of all computers registered with a specific application. This posed an interesting question to me, “What would be the fastest way to accomplish this?”
I set out to look for different ways of comparing lists. I can think of three. The first two would be to load all of the items into an array and then search the array, item by item, for the value using either the –match operator or the –contains operator. The third method would be to load all the items into a hash table with empty values and then search the keys to see if they exist. Now since I know that loading up a hash table should take more time than loading an array, so I want to time the entire process not just the searches.
To actually do the timing, I will use the measure-command cmdlet. If you haven’t ever used this, you should really play with it. It’s a great tool for figuring out how long any given code block takes to run. That can be useful for things like filling in a time on you write-progress applet, or reporting the time to execute back to a user. Really you can look at it as a way to avoid setting a variable to get-date and then creating a new-timespan after the command completes. It is essentially rolling it all into one.
So, it’s a race between searching Hash Tables, and Searching arrays using both –match and –contains. Here is the code I used:
$ArrayContainsTime = Measure-Command {
$Array = @(Get-Content u:scriptworkstations.txt)
$Found = 0
foreach ($Name in $Checks){
If ($Array -contains $Name){$found++}
}
}
"Array Contains Count: t$($Array.Count)"
"Array Contains Found: t$($found)"
$ArrayMatchTime = Measure-Command {
$Array = @(Get-Content u:scriptworkstations.txt)
$Found = 0
foreach ($Name in $Checks){
If ($Array -match $Name){$found++}
}
}
"Array Matches Count: t$($Array.Count)"
"Array Matches found: t$($found)"
$HashTime = Measure-Command {
$HashTable = @{}
ForEach ($Line in Get-Content u:scriptworkstations.txt){
$HashTable.Add($Line,"1")
}
$Found = 0
foreach ($Name in $Checks){
If ($Hashtable.ContainsKey($Name)){$found++}
}
}
"Hash Table Count: t$($HashTable.Count)"
"Hash Table Found: t$($found)"
"Milliseconds for Array Contains:t$($ArrayContainsTime.TotalMilliseconds)"
"Milliseconds for Array Matches:t$($ArrayMatchTime.TotalMilliseconds)"
"Milliseconds for Hast Table Contains:t$($HashTime.TotalMilliseconds)"
I have loaded up the text file with 2,000 entries, so we are basically comparing 2,000 items to 2,000 items. Every single one will be a match, so we can see that it’s working by making sure that the found and count values are the same. If you wanted to take this code and load two different lists, then you would see a difference there. So, without further delay, it’s off to the races!
Array Contains Found: 2000
Array Matches Count: 2000
Array Matches found: 2000
Hash Table Count: 2000
Hash Table Found: 2000
Milliseconds for Array Contains: 532.6136
Milliseconds for Array Matches: 9839.4498
Milliseconds for Hast Table Contains: 51.2049
So we have a winner! As you can see all the methods work, but the hash table is substantially faster than the other two, array based methods searching through all 2,000 items in under one tenth of a second! I think array using the –contains operator is still a very reasonable time, and is probably easier and more comfortable for the average scripter to use. As well, it should be said that the array with the –match operator isn’t insanely slow by any means, and is by far the most robust method for searching as it can match any portion of the name in the array. This should be used with caution, though, as it can actually create false positives. Let’s say you are looking for “Computer” in a list that contains “Computer1”. You may not expect this to be a match, but it will.
So, there you have it. If you need to search a massive list for some reason and speed is top on your mind, use a hash table!
Thank you for the great post.
There is one approach you did not try, which I have found to the be the fastest.
For the simple “exists” case using IndexOf for a hashtable with a variable length string key is roughly twice as fast as ContainsKey.
Which is of course counter-intuitive, one returns a simple boolean the other returns possibly a long integer.
Another advantage is you get the index returned for direct access to the data.
The stats below are for a hashtable containing 5.5 million AWS S3 keys which are longish strings.
Measure-Command({$script:keyhash.Contains($key)})
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 244
Ticks : 2440138
TotalDays : 2.8242337962963E-06
TotalHours : 6.77816111111111E-05
TotalMinutes : 0.00406689666666667
TotalSeconds : 0.2440138
TotalMilliseconds : 244.0138
Measure-Command({$script:keyhash.IndexOf($key)})
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 115
Ticks : 1155392
TotalDays : 1.33725925925926E-06
TotalHours : 3.20942222222222E-05
TotalMinutes : 0.00192565333333333
TotalSeconds : 0.1155392
TotalMilliseconds : 115.5392
Thank you!
-match uses regular expressions, not simple string matching. Your warning about false positives doesn’t mention this detail, and it’s important. Regular expressions are powerful. If you want to match exactly “Computer” and no other string, then you can use -match ‘^Computer$’; but that’s not where regex shines. Regular expressions allow you to match complicated patterns. Consider a case where phone numbers are encoded with the country code, and may or may not have punctuation, then you could use this regular expression to match only U.S. phone numbers in the 816 area code:
-match ‘^\+?1-?816-?[0-9-]+’
False positives with -match just mean you didn’t construct the regular expression you wanted!
This is a very interesting post but somewhat limited in scope. While it’s interesting to see that building a hash table is much faster than building an array that’s not really where the huge benefits of hash tables are seen. The hash table truely stomps all over the array when you are running comparisions against large datasets. I often have a dataset of 50,000 rows and 27 columns wide in which I have to locate 2,000+ things. For example I’ll have to search through that 50,000 rows to see if I find the 2,000 host names and if I find a matching host name then grab 5 fields from that row and output it to a CSV file.
I had a script that used simple arrays, a ForEach loop, and Where-Object to cycle through and find where an object’s host name matched and then grab the data I was interested in. It took 4 to 6 hours to complete! That’s because I had to search 50,000 rows 2,000 times to get what I wanted; it was insane but it worked.
Now using hastables to hold these large datasets, using ForEach to loop through, and comparing to find what I need… the script takes 2 minutes.
My biggest apprehension for using hash tables in the past was the limitation of only a Key and a Value. As I mentioned I want 5, or more, properties in my result and not just a single Value. Then someone enlightented me… you can put an OBJECT in as the Value on a hash table Key – Value pair! Amazing! So now that object in the value has all 27 properties and when I find a match I simply grab the 5 or so properties I want and output it to my CSV file.
MorganM,
I am having the exact same issues with a set of data that you did. Can I contact you so I am talk to you about the code you are using to get your results? Thanks in advance!
Thank You! Thank You! Thank You! I have been looking for this information for a while. I integrated hashtables into a script that combines 4 large CSVs (imagine an Excel vlookup). The runtime of the script dropped from 40 minutes to about 80 seconds (most of the time now is downloading the CSV files and importing them).