Tuesday, December 26, 2006

trashbrowns


Tuesday, December 05, 2006

Making MD5 Fuzzy

Hash functions like MD5 are terribly handy when you want to verify that one file is exactly the same as another very quickly, but what if you want to just check that two files are pretty close to the same?

There are probably lots of approaches, but one that I found that works fairly well for catching small spelling changes and the additions and removals of single words is a rounded average distance algorithm. Each word is converted into X and Y coordinates: the sum of the vowels makes the X coordinate and the sum of the consonants makes the Y coordinate. These values are rounded to the nearest 17 (a completely arbitrary number). The cartesian distance from the previous word is calculated, and every 11 words (also completely arbitrary) the average cartesian distance between the words is calculated, rounded to the nearest 17 and added to a list of distances. Finally each distance is averaged with the following distance and used to update an MD5 digest.

One can probably play with the arbitrary constants in this one to make the algorithm more or less selective. 11 and 17 were the values I needed to make my test pass.

Here's the code. It's in an early stage of development, so it's fairly sloppy I'm afraid. I'm sure there's a way to revise things so that the list of sums isn't needed, but it's late and I need to eat and sleep.


package net.spatula.utils;

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
import java.util.regex.Pattern;

public class FuzzySum {

private static final int ROUND_TO = 17;
private static final int AVG_SIZE = 11; // our amp goes to 11.

public static String checkSum(String text) {
text = text.toLowerCase();
text = text.replaceAll("<[^>]*>", ""); // strip HTML tags
Pattern pattern = Pattern.compile("\\s+", Pattern.DOTALL);
String[] wordlist = pattern.split(text);

long x = -1;
long y = -1;
double distanceSum = 0;
long sumCount = 0;
ArrayList sums = new ArrayList(wordlist.length / AVG_SIZE);

for (int i = 0; i < wordlist.length; i++) {
String word = wordlist[i].replaceAll("\\W", "");
String vowels = word.replaceAll("[^aeiouy]", "");
String consonants = word.replaceAll("[aeiouy]", "");
long thisx = round(sumChars(vowels) * 3);
long thisy = round(sumChars(consonants));
if (thisx == 0 && thisy == 0)
continue;
if (x > 0 && y > 0) {
distanceSum += Math.sqrt(Math.pow(thisx - x, 2) + Math.pow(thisy - y, 2));
sumCount++;
}
if (sumCount > 0 && sumCount % AVG_SIZE == 0) {
sums.add((double) round(distanceSum / (double)AVG_SIZE));
distanceSum = 0;
}
x = thisx;
y = thisy;
}

// just drop in the last value if there is one
if (distanceSum != 0 && sumCount % AVG_SIZE != 0) {
sums.add((double) round(distanceSum / (double)AVG_SIZE));
}

sumCount = sums.size();

MessageDigest md5 = null;
try {
md5 = MessageDigest.getInstance("md5");
} catch (NoSuchAlgorithmException e) {
return null;
}

for (int i = 0; i < sumCount; i++) {
if (i + 1 < sumCount) {
md5.update(String.valueOf((int) ((sums.get(i) + sums.get(i + 1)) / 2F)).getBytes());
}
}

byte[] data = md5.digest();
StringBuffer output = new StringBuffer(data.length * 2);
for (int i = 0; i < data.length; i++) {
if ((data[i] & 0xFF) < 16) {
output.append("0");
}
output.append(Integer.toHexString((int) data[i] & 0xFF));
}

return output.toString();

}

private static final long round(double value) {
return ROUND_TO * (long) (value / (double)ROUND_TO);
}

private static final long sumChars(String word) {
long sum = 0;
for (int i = 0; i < word.length(); i++) {
sum += word.charAt(i) - 'a';
}
return sum;
}

}


(One obvious problem is that crossing an 11-word boundary by even 1 word will cause two different sums to be produced... there is probably a way to cope with that, but I haven't given it much thought yet. More to come.)

Labels:


Saturday, December 02, 2006

More Wicket Stupidity

It seems to be the weekend for stupidity out of Wicket.

God only knows how they managed to introduce this bug: component.setModelObject(foo) will truncate everything after the first semicolon. Yes, that's right: the first semicolon. Why? Your guess is as good as mine. Fortunately, component.getModel().setObject(component, foo) doesn't arbitrarily dislike semicolons and works fine.

More stupidity: ExternalLink is not a Link, though PageLink, BookmarkablePageLink, ResourceLink and so-on are all Links (that is, they extend Link). To Wicket's credit, ExternalLink at least appears in wicket.markup.html.link, although PagingNavigationIncrementLink and PagingNavigationLink, which do extend Link do not. I'm sensing some problems with the project's layout.

Another particular nuisance: there is no nice way of linking from one application to another. This may be because of limitations imposed by the Servlet spec, but it is annoying no less. This manifests itself if you want to move from one application that doesn't do anything special with Sessions to one that does. For example, I have an application for reading articles that doesn't need a session at all (but it implicitly gets a WebSession) with a link to another application for doing an article update. The latter uses an ArticleSession, but the moment you click the link to the homepage of the article update application, you get a class cast exception because the type of WebSession is different from one application to the other.

The solution is to loosely couple the two pages by directing the update link to the other application by its servlet path (e.g., /app/updateArticle) via an ExternalLink (which isn't a Link), but where does that path come from then? I had to settle for getting it out of a properties file for now, until some better way comes along.

I think Wicket could possibly figure this out in the Wicket Servlet by tracking which applications are mapped to which context paths, though this would probably require an additional init-param in the servlet config (since you don't get to know the context path in a servlet's init, though it's coming in the Servlet 2.5 spec... some day). In the near term there doesn't seem to be any other way than the ugly way.

Labels:


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]