In PUMA, every entry (publication or bookmark) is stored with a hash (a unique array of numbers and letters). This makes it possible to identify the entry clearly. In addition, publications are even stored with two hashes (inter hash and intra hash). On this page, you can find information on the different types of hashes.
In particular for literature references there is the problem of detecting duplicate posts because there are big variations in how users enter fields such as journal name or author. On the one hand, it is desirable to allow a user to have several posts which differ only slightly. On the other hand, one might want to find other users' posts which refer to the same paper or book even if they are not completely identical.
To fulfill both goals, we implemented two hashes to compare publication posts. One is for comparing the posts of a single user (intra hash) and one for comparing the posts of different users (inter hash). Comparison is accomplished by normalizing and concatenating BibTeX fields, hashing the result with the MD5 message digest algorithm and comparing the resulting hashes. MD5 hashing is done for efficiency reasons only, since this allows for a fixed length storage in the database. Storing the hashes along with the resources in the posts table enables fast comparison and search of posts.
The intra hash is relatively strict and takes into account the fields title, author, editor, year, entrytype, journal, booktitle, volume, and number. This allows a user to have articles with the same title from the same authors in the same year but in different volumes (e.g. a technical report and the corresponding journal article).
In contrast, the inter hash is less specific and only includes title, year, and author or editor (depending on what the user has entered).
In both hashes, all fields which are taken into account are normalized, i.e., certain special characters are removed, whitespace and author/editor names normalized. The latter is done by concatenating the first letter of the first name by a dot with the last name, both in lower case. Persons are then sorted alphabetically by this string and concatenated by a colon.
To demonstrate the generation of inter and intra hashes, you can go to the hash example page and fill out the form displayed there. PUMA then will calculate both hashes.
The computation of the hashes is done in the class org.bibsonomy.model.util.SimHash
.
It contains the following code to compute the intra hash:
public static String getSimHash2(final BibTex bibtex) {
return StringUtils.getMD5Hash(StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getTitle()) + " " +
StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(PersonNameUtils.serializePersonNames(bibtex.getAuthor(), false)) + " " +
StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(PersonNameUtils.serializePersonNames(bibtex.getEditor(), false)) + " " +
StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getYear()) + " " +
StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getEntrytype()) + " " +
StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getJournal()) + " " +
StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getBooktitle()) + " " +
StringUtils.removeNonNumbersOrLetters(bibtex.getVolume()) + " " +
StringUtils.removeNonNumbersOrLetters(bibtex.getNumber())
);
}
The following code is responsible to compute the inter hash:
public static String getSimHash1(final BibTex publication) {
if (!present(StringUtils.removeNonNumbersOrLetters(PersonNameUtils.serializePersonNames(publication.getAuthor())))) {
// no author set --> take editor
return StringUtils.getMD5Hash(getNormalizedTitle(publication.getTitle()) + " " +
PersonNameUtils.getNormalizedPersons(publication.getEditor()) + " " +
getNormalizedYear(publication.getYear()));
}
// author set
return StringUtils.getMD5Hash(getNormalizedTitle(publication.getTitle()) + " " +
PersonNameUtils.getNormalizedPersons(publication.getAuthor()) + " " +
getNormalizedYear(publication.getYear()));
}
To see how further help functions work, have a look at the Bitbucket repository.
Click here to go back to beginner's area and learn more about the basic functions.