Collision (computer science)
Due to the possible applications of hash functions in data management and computer security (in particular, cryptographic hash functions), collision avoidance has become a fundamental topic in computer science.
Collisions are unavoidable whenever members of a very large set (such as all possible person names, or all possible computer files) are mapped to a relatively short bit string. This is merely an instance of the pigeonhole principle.
The impact of collisions depends on the application. When hash functions and fingerprints are used to identify similar data, such as homologous DNA sequences or similar audio files, the functions are designed so as to maximize the probability of collision between distinct but similar data. Checksums, on the other hand, are designed to minimize the probability of collisions between similar inputs, without regard for collisions between very different inputs.
Although hashing is generally pretty quick and efficient to do, there exists some limitations that can cause problems when it comes to computer security. Although there are some prevention methods being discovered to protect against hashing collisions, hashing is still not a perfect operation.
When it comes to hashing strings, there could be multiple strings that can be hashed to the same value. Hackers can use this knowledge to their advantage to bypass security measures such as username and passwords. According to the example in the website “Learn Cryptography”, we can see that the two strings, “hello” and “8571935789325698” can be hashed to the same hash value, “89232323”. This would be effective for an attacker to know because when passwords are stored into a database with their appropriate hashes, the hacker could use the password “8571935789325698” which would allows the hacker to login successfully despite the actual password being “hello”. Although hackers would require time to brute force different passwords to find a hash collision, these hackers could alternatively wait until computational power increases enough to make cracking these hashing collisions faster. In the meantime, they can search around for servers or sites that have outdated hashing functions that could be penetrated with lesser computational power.
As computational power increases, the need to make our Hashing functions more complex increases. An article that was published in February of 2017 explains how the SHA-1 is a weak hashing function as Google Researchers were able to “to produce two different documents that have the same SHA-1 hash signature”. Knowing that this is possible is scary to see because this means that anything that uses SHA-1 can potentially be penetrated through calculated collisions and time. Because of this, it is highly recommended that any application that utilizes SHA-1 should be switching over to a stronger hashing algorithm such as SHA-256 which hashes to a longer hash value than SHA-1. This makes it even more unlikely that a hash collision would occur.
Moving forward, there is research being done to find ways to improve the way information can be hashed. Some methods to prevent against hashing collisions would include the shuffling of bits, the usage of compression algorithms, application of T-Functions, and LFSR. These methods can can be applied to the data in the preprocessing stage before using the hashing function. This process could be applied multiple times to further increase the variance in the hashing process.
- Jered Floyd (2008-07-18). "What do Hash Collisions Really Mean?". http://permabit.wordpress.com/: Permabits and Petabytes. Retrieved 2011-03-24.
For the long explanation on cryptographic hashes and hash collisions, I wrote a column a bit back for SNW Online, “What you need to know about cryptographic hashes and enterprise storage”. The short version is that deduplicating systems that use cryptographic hashes use those hashes to generate shorter “fingerprints” to uniquely identify each piece of data, and determine if that data already exists in the system. The trouble is, by a mathematical rule called the “pigeonhole principle”, you can’t uniquely map any possible files or file chunk to a shorter fingerprint. Statistically, there are multiple possible files that have the same hash.
- A more in depth analysis and description of this process can be found using this link to the research: https://www.peertopatent.org/method-for-preventing-and-detecting-hash-collisions-of-data-during-the-data-transmission/