Hashing

1.Hashing

Hashing类提供了一系列静态方法来生成HashFunction实例,以及其他的一些静态hash相关的工具.

1.1 背景

java源生的哈希值被限定为32位(因为他是使用的内存地址做hash),并且在算法和算法执行数据上面没有做划分(想想瓜哇的划分很精彩),因此不容易动态的改变算法的实现.并且hashCode的实现质量低劣,部分是因为大量的代码依赖于这个质量低劣的hashCode,包括jdk本身的类.

hashCode的实现效率是很高的,但是弱于碰撞预防和no expectation of bit dispersion,这使得他们非常适合在哈希表中使用,因为额外的碰撞仅仅导致轻微的性能损失,然而bit dispersion也可以通过使用二级哈希函数很容易进行修正.对于哈希函数在数据结构上的使用,hashCode总是显得不足,这就是为啥有了这个库.

1.2 Provided Hash Functions(提供的函数)

md5()
murmur3_128()
murmur3_32()
sha1()
sha256()
sha512()
goodFastHash(int bits)

1.3 用法

HashFunction hf = Hashing.md5();
HashCode hc = hf.newHasher()
       .putLong(id)
       .putString(name, Charsets.UTF_8)
       .putObject(person, personFunnel)
       .hash();

1.4 源码分析

/**
  * Returns a hash function implementing the MD5 hash algorithm (128 hash bits).
  *
  * @deprecated If you must interoperate with a system that requires MD5, then use this method,
  *     despite its deprecation. But if you can choose your hash function, avoid MD5, which is
  *     neither fast nor secure. As of January 2017, we suggest:
  *     <ul>
  *       <li>For security:
  *           {@link Hashing#sha256} or a higher-level API.
  *       <li>For speed: {@link Hashing#goodFastHash}, though see its docs for caveats.
  *     </ul>
  */
@Deprecated
public static HashFunction md5() {
  return Md5Holder.MD5;
}

private static class Md5Holder {
  static final HashFunction MD5 = new MessageDigestHashFunction("MD5", "Hashing.md5()");
}

/**
  * Returns a hash function implementing the <a
  * href="https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp">32-bit murmur3
  * algorithm, x86 variant</a> (little-endian variant), using a seed value of zero.
  *
  * <p>The exact C++ equivalent is the MurmurHash3_x86_32 function (Murmur3A).
  */
public static HashFunction murmur3_32() {
  return Murmur3_32HashFunction.MURMUR3_32;
}

2.BloomFilter

布隆过滤器(bloom filter)是非常经典的,以空间换时间的算法.它实际上是一个很长的二进制向量和一系列随机映射函数.布隆过滤器可以用于检索一个元素是否在一个集合中.它的优点是空间效率和查询时间都远远超过一般的算法,缺点是有一定的误识别率和删除困难.

2.1 原理

布隆过滤器的原理是,当一个元素被加入集合时,通过K个散列函数将这个元素映射成一个位数组中的K个点,把它们置为1.检索时,我们只要看看这些点是不是都是1就(大约)知道集合中有没有它了:如果这些点有任何一个0,则被检元素一定不在;如果都是1,则被检元素很可能在.这就是布隆过滤器的基本思想

2.2 设计思路

2.2.1 单个元素hash

Hashing过程

创建一个非常长的二进制数组
通过随机映射函数将数据分配在不同的数据位置上
将这个位置的字节从0变为1

这里面最关键的地方就是,随机映射函数的使用.我们首先希望数据可以平均分配在这个数组上,同时速度必须要快.这种情况,最合适的是hash算法,guava中使用的hash算法,是murmurhash,于2008年被发明.这个算法hbase,redis,kafka都在使用.

既然选用hash算法,必然就会存在碰撞的可能.两个不完全相同的值计算出来的hash值难免会一致.多次使用hash算法,为同一个值取不同的多个hash,取的越多.碰撞率的几率就越小.

2.2.2 多个元素hash

变量定义:

k : hash函数个数
fpp: 误报率
m: 二进制数组大小
n: 预估的数据量大小

Bit数组大小选择 根据预估数据量n以及误判率fpp,bit数组大小的m的计算方式

哈希函数选择 由预估数据量n以及bit数组长度m,可以得到一个hash函数的个数

2.3 用法

BloomFilter<Person> friends = BloomFilter.create(personFunnel, 500, 0.01);
for (Person friend : friendsList) {
  friends.put(friend);
}
// much later
if (friends.mightContain(dude)) {
  // the probability that dude reached this place if he isn't a friend is 1%
  // we might, for example, start asynchronously loading things for dude while we do a more expensive exact check
}

2.4 源码分析

看看Guava中BloomFilter中对于m和k值计算的实现

/**
 * 计算 Bloom Filter的bit位数m
 *
 * <p>See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the
 * formula.
 *
 * @param n 预期数据量
 * @param p 误判率 (must be 0 < p < 1)
 */
@VisibleForTesting
static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
    p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}

/**
 * 计算最佳k值,即在Bloom过滤器中插入的每个元素的哈希数
 *
 * <p>See http://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg for the formula.
 *
 * @param n 预期数据量
 * @param m bloom filter中总的bit位数 (must be positive)
 */
@VisibleForTesting
static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}

put的过程,hash策略以MURMUR128_MITZ_64为例

public <T> boolean put(T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
    long bitSize = bits.bitSize();

    //利用MurmurHash3得到数据的hash值对应的字节数组
    byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();

    //取低8个字节、高8个字节,转成long类型
    long hash1 = lowerEight(bytes);
    long hash2 = upperEight(bytes);

    boolean bitsChanged = false;

    //这里的combinedHash = hash1 + i * hash2
    long combinedHash = hash1;

    //根据combinedHash,得到放入的元素在bit数组中的k个位置,将其置1
    for (int i = 0; i < numHashFunctions; i++) {
    bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
    combinedHash += hash2;
    }
    return bitsChanged;
}

判断元素是否在bloom filter中

public <T> boolean mightContain(T object, Funnel<? super T> funnel, int numHashFunctions, BloomFilterStrategies.LockFreeBitArray bits) {
    long bitSize = bits.bitSize();
    long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
    int hash1 = (int)hash64;
    int hash2 = (int)(hash64 >>> 32);

    for(int i = 1; i <= numHashFunctions; ++i) {
        int combinedHash = hash1 + i * hash2;
        if (combinedHash < 0) {
            combinedHash = ~combinedHash;
        }

        if (!bits.get((long)combinedHash % bitSize)) {
            return false;
        }
    }

    return true;
}

2.5 应用

爬虫过滤已抓到的url就不再抓,可用bloom filter过滤
垃圾邮件过滤.如果用哈希表,每存储一亿个 email地址,就需要 1.6GB的内存(用哈希表实现的具体办法是将每一个 email地址对应成一个八字节的信息指纹,然后将这些信息指纹存入哈希表,由于哈希表的存储效率一般只有 50%,因此一个 email地址需要占用十六个字节.一亿个地址大约要 1.6GB,即十六亿字节的内存).因此存贮几十亿个邮件地址可能需要上百 GB的内存.而Bloom Filter只需要哈希表 1/8到 1/4 的大小就能解决同样的问题

11.Hashing