Making MD5 Fuzzy

Hash functions like MD5 are terribly handy when you want to verify that one file is exactly the same as another very quickly, but what if you want to just check that two files are pretty close to the same?

There are probably lots of approaches, but one that I found that works fairly well for catching small spelling changes and the additions and removals of single words is a rounded average distance algorithm. Each word is converted into X and Y coordinates: the sum of the vowels makes the X coordinate and the sum of the consonants makes the Y coordinate. These values are rounded to the nearest 17 (a completely arbitrary number). The cartesian distance from the previous word is calculated, and every 11 words (also completely arbitrary) the average cartesian distance between the words is calculated, rounded to the nearest 17 and added to a list of distances. Finally each distance is averaged with the following distance and used to update an MD5 digest.

One can probably play with the arbitrary constants in this one to make the algorithm more or less selective. 11 and 17 were the values I needed to make my test pass.

Here's the code. It's in an early stage of development, so it's fairly sloppy I'm afraid. I'm sure there's a way to revise things so that the list of sums isn't needed, but it's late and I need to eat and sleep.

package net.spatula.utils;

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
import java.util.regex.Pattern;

public class FuzzySum {

private static final int ROUND_TO = 17;
private static final int AVG_SIZE = 11; // our amp goes to 11.

public static String checkSum(String text) {
text = text.toLowerCase();
text = text.replaceAll("<[^>]*>", ""); // strip HTML tags
Pattern pattern = Pattern.compile("\\s+", Pattern.DOTALL);
String[] wordlist = pattern.split(text);

long x = -1;
long y = -1;
double distanceSum = 0;
long sumCount = 0;
ArrayList sums = new ArrayList(wordlist.length / AVG_SIZE);

for (int i = 0; i < wordlist.length; i++) {
String word = wordlist[i].replaceAll("\\W", "");
String vowels = word.replaceAll("[^aeiouy]", "");
String consonants = word.replaceAll("[aeiouy]", "");
long thisx = round(sumChars(vowels) * 3);
long thisy = round(sumChars(consonants));
if (thisx == 0 && thisy == 0)
continue;
if (x > 0 && y > 0) {
distanceSum += Math.sqrt(Math.pow(thisx - x, 2) + Math.pow(thisy - y, 2));
sumCount++;
}
if (sumCount > 0 && sumCount % AVG_SIZE == 0) {
distanceSum = 0;
}
x = thisx;
y = thisy;
}

// just drop in the last value if there is one
if (distanceSum != 0 && sumCount % AVG_SIZE != 0) {
}

sumCount = sums.size();

MessageDigest md5 = null;
try {
md5 = MessageDigest.getInstance("md5");
} catch (NoSuchAlgorithmException e) {
return null;
}

for (int i = 0; i < sumCount; i++) {
if (i + 1 < sumCount) {
md5.update(String.valueOf((int) ((sums.get(i) + sums.get(i + 1)) / 2F)).getBytes());
}
}

byte[] data = md5.digest();
StringBuffer output = new StringBuffer(data.length * 2);
for (int i = 0; i < data.length; i++) {
if ((data[i] & 0xFF) < 16) {
output.append("0");
}
output.append(Integer.toHexString((int) data[i] & 0xFF));
}

return output.toString();

}

private static final long round(double value) {
return ROUND_TO * (long) (value / (double)ROUND_TO);
}

private static final long sumChars(String word) {
long sum = 0;
for (int i = 0; i < word.length(); i++) {
sum += word.charAt(i) - 'a';
}
return sum;
}

}

(One obvious problem is that crossing an 11-word boundary by even 1 word will cause two different sums to be produced... there is probably a way to cope with that, but I haven't given it much thought yet. More to come.)

Labels:

Why would you want to do this? I'm not sure I can think of a situation where I'd need to know if a file was "similar" to another.

It happens all the time with CMSes, especially those open to public posting. Someone hits post, hits back, changes one word, hits post again... it's nice to catch that and say "wouldn't you rather update your existing post?"

Also happens in message boards where someone floods or spams similar crap to many different topics with a word or two different.

I'm surprised by the commenter's lack of imagination. This is a very useful type of algorithmn. I'd like to see my feed reader combine duplicate news items with something like this. I wish TailRank and Memeorandom used this to prevent duplicate stories registering as new results. Thanks for sharing your thoughts.

Thanks for your comments. There are still some pretty significant problems to work out with the algorithm (especially the boundary problem) but I hope to get discussion going on this type of algorithm and potential means for improving it.

LHC

Okay, I can see the use for that... :) in that case though, why use a cryptographic hash function at all? Why not just generate a similarity score?

You could take each of your distances for the original and the new files, and apply a distance metric (Euclidean, Manhattan, etc), to generate a similarity score.

Then you can set a threshold value where you call documents similar or different... this has an advantage in that you can adjust the threshold for different purposes.

Mainly for speed. There are thousands of articles and hundreds of thousands of messages in my database. I can calculate a hash at the time of insertion very quickly, and select potential matches based on that hash very quickly as well when a new message is posted or a new article is created. I don't get to know the similarity beforehand-- the use case is for new objects being added to the system.

Calculating distances relative to all of the articles or messages in the system (or even a small subset of them based on temporal proximity) would be much slower and require a full table scan with comparison vs a relatively quick index scan (albeit with a slow comparison to verify similarity from those rows returned).

Good questions nonetheless.

NICK -

It is the FUNNIEST thing ever!!Tried to email you personally but it was undeliverable.

would be interested in a link exchange with us www.thefrap.com.

Check out site and let us know
aimee@thefrap.com

This could be an interesting anti-spam tactic. I receive spam emails all the time (however, most are marked by spam by other reasons -- like having a non-fully quallified domain name) where the sender usually substitutes one single character in a word to make another spammy word: i.e. v1agra, v14gra, or ViagrA instead of viagra. It would be interesting to see a message rated as spam because it's very similar to another message konwn to be spam. This could potentially lead to an even more interesting web service where a "fuzzy MD5" checksum could be stored for others to check against. Interesting idea Nick. 