Tuesday, April 03, 2007
Making MD5 Fuzzy, Redux
I struggled with the solution to this for quite a while, and then it dawned on me: I was looking at the problem the wrong way. It's fine if an off-by-one changes the outcome, if we're prepared to handle it.
The answer is to produce two checksums, not one! In the first, we begin at the beginning, and skip the last n/2 characters for an averaging length of n. In the second, we begin n/2 characters from the beginning and work all the way to the end.
Then instead of comparing one sum to another sum, we perform four comparisons:
object1.sum1 == object2.sum1
object1.sum2 == object2.sum1
object1.sum1 == object2.sum2
object1.sum2 == object2.sum2
If any of these statements returns true, we consider the objects to be "similar".
Here's the code. I've also simplified the way the distance between words is caculated and left room for non-english words to be handled at some point in the future (ie, there's no longer any special significance given to vowels).
package net.spatula.tally_ho.utils;
public class FuzzySum {
private static final int SLOP = 3;
private static FuzzySum instance;
private static final int SAMPLE_SIZE = 10;
private FuzzySum() {
}
public static synchronized FuzzySum getInstance() {
if (instance == null) {
instance = new FuzzySum();
}
return instance;
}
public String[] getSums(String text) {
text = TextUtils.stripTags(text).toLowerCase().replaceAll("[^\\w\\s]", "").trim();
if (text.length() < SAMPLE_SIZE * 1.5) {
String md5 = TextUtils.md5(text);
return new String[] { md5, md5 };
}
String[] words = text.split("(?s)\\s+");
String md5_1 = calculateFuzzyMd5(words, 0, words.length - 1 - (SAMPLE_SIZE / 2));
String md5_2 = calculateFuzzyMd5(words, SAMPLE_SIZE / 2, words.length - 1);
return new String[] {md5_1, md5_2};
}
private String calculateFuzzyMd5(String[] input, int startIndex, int endIndex) {
StringBuilder builder = new StringBuilder();
int distanceSum = 0;
for (int i = startIndex + 1; i<= endIndex; i++) {
String thisWord = input[i];
String lastWord = input[i - 1];
distanceSum += calculateDistance(thisWord, lastWord);
if (i % SAMPLE_SIZE == 0) {
if (builder.length() > 0) {
builder.append("\n");
}
builder.append(distanceSum / SAMPLE_SIZE);
distanceSum = 0;
}
}
if (distanceSum != 0) {
builder.append("\n");
builder.append(distanceSum / (endIndex + 1 - startIndex % SAMPLE_SIZE));
}
return TextUtils.md5(builder.toString());
}
private int calculateDistance(String word1, String word2){
int word1Sum = calculateWordSum(word1);
int word2Sum = calculateWordSum(word2);
return Math.abs(word1Sum - word2Sum) / SLOP;
}
private int calculateWordSum(String word) {
if (word.length() == 1) {
return (int)(word.charAt(0)) & 0xffff;
}
int wordSum = 0;
for (int i = 1; i < word.length(); i++) {
int prevChar = (int)(word.charAt(i-1)) & 0xffff;
int thisChar = (int)(word.charAt(i)) & 0xffff;
wordSum += Math.abs(thisChar - prevChar);
}
return SLOP * wordSum / word.length();
}
}
As you can see, this code has been committed as part of the Tally-Ho project, https://tally-ho.dev.java.net/
Labels: checksum, fuzzy, java, md5, software, tally-ho
Subscribe to Posts [Atom]