Tuesday, December 05, 2006
Making MD5 Fuzzy
There are probably lots of approaches, but one that I found that works fairly well for catching small spelling changes and the additions and removals of single words is a rounded average distance algorithm. Each word is converted into X and Y coordinates: the sum of the vowels makes the X coordinate and the sum of the consonants makes the Y coordinate. These values are rounded to the nearest 17 (a completely arbitrary number). The cartesian distance from the previous word is calculated, and every 11 words (also completely arbitrary) the average cartesian distance between the words is calculated, rounded to the nearest 17 and added to a list of distances. Finally each distance is averaged with the following distance and used to update an MD5 digest.
One can probably play with the arbitrary constants in this one to make the algorithm more or less selective. 11 and 17 were the values I needed to make my test pass.
Here's the code. It's in an early stage of development, so it's fairly sloppy I'm afraid. I'm sure there's a way to revise things so that the list of sums isn't needed, but it's late and I need to eat and sleep.
package net.spatula.utils;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
import java.util.regex.Pattern;
public class FuzzySum {
private static final int ROUND_TO = 17;
private static final int AVG_SIZE = 11; // our amp goes to 11.
public static String checkSum(String text) {
text = text.toLowerCase();
text = text.replaceAll("<[^>]*>", ""); // strip HTML tags
Pattern pattern = Pattern.compile("\\s+", Pattern.DOTALL);
String[] wordlist = pattern.split(text);
long x = -1;
long y = -1;
double distanceSum = 0;
long sumCount = 0;
ArrayListsums = new ArrayList (wordlist.length / AVG_SIZE);
for (int i = 0; i < wordlist.length; i++) {
String word = wordlist[i].replaceAll("\\W", "");
String vowels = word.replaceAll("[^aeiouy]", "");
String consonants = word.replaceAll("[aeiouy]", "");
long thisx = round(sumChars(vowels) * 3);
long thisy = round(sumChars(consonants));
if (thisx == 0 && thisy == 0)
continue;
if (x > 0 && y > 0) {
distanceSum += Math.sqrt(Math.pow(thisx - x, 2) + Math.pow(thisy - y, 2));
sumCount++;
}
if (sumCount > 0 && sumCount % AVG_SIZE == 0) {
sums.add((double) round(distanceSum / (double)AVG_SIZE));
distanceSum = 0;
}
x = thisx;
y = thisy;
}
// just drop in the last value if there is one
if (distanceSum != 0 && sumCount % AVG_SIZE != 0) {
sums.add((double) round(distanceSum / (double)AVG_SIZE));
}
sumCount = sums.size();
MessageDigest md5 = null;
try {
md5 = MessageDigest.getInstance("md5");
} catch (NoSuchAlgorithmException e) {
return null;
}
for (int i = 0; i < sumCount; i++) {
if (i + 1 < sumCount) {
md5.update(String.valueOf((int) ((sums.get(i) + sums.get(i + 1)) / 2F)).getBytes());
}
}
byte[] data = md5.digest();
StringBuffer output = new StringBuffer(data.length * 2);
for (int i = 0; i < data.length; i++) {
if ((data[i] & 0xFF) < 16) {
output.append("0");
}
output.append(Integer.toHexString((int) data[i] & 0xFF));
}
return output.toString();
}
private static final long round(double value) {
return ROUND_TO * (long) (value / (double)ROUND_TO);
}
private static final long sumChars(String word) {
long sum = 0;
for (int i = 0; i < word.length(); i++) {
sum += word.charAt(i) - 'a';
}
return sum;
}
}
(One obvious problem is that crossing an 11-word boundary by even 1 word will cause two different sums to be produced... there is probably a way to cope with that, but I haven't given it much thought yet. More to come.)
Labels: software
Also happens in message boards where someone floods or spams similar crap to many different topics with a word or two different.
You could take each of your distances for the original and the new files, and apply a distance metric (Euclidean, Manhattan, etc), to generate a similarity score.
Then you can set a threshold value where you call documents similar or different... this has an advantage in that you can adjust the threshold for different purposes.
Calculating distances relative to all of the articles or messages in the system (or even a small subset of them based on temporal proximity) would be much slower and require a full table scan with comparison vs a relatively quick index scan (albeit with a slow comparison to verify similarity from those rows returned).
Good questions nonetheless.
LOVE your website www.youhavebo.com.
It is the FUNNIEST thing ever!!Tried to email you personally but it was undeliverable.
would be interested in a link exchange with us www.thefrap.com.
Check out site and let us know
aimee@thefrap.com
<< Home
Subscribe to Posts [Atom]