• The2b@lemmy.vg
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 days ago

      Not only is md5sum not proven to have equal dostribution, it is specifically known to not have equal distribution, only nearly equal distribution.

      A hash function is any function that converts an arbitrary input size into a specific output size deterministically. No other requirements are there. A hash for a simple job could be just adding the ascii values together and give the output. Needless to say, that would not have an even distribution.

      • FrederikNJS@lemmy.zip
        link
        fedilink
        arrow-up
        4
        ·
        2 days ago

        You are correct for regular hash functions, but a cryptographic hash function has stronger requirements.

        MD5 was supposed be a cryptographic hash function, but it was found to be flawed all the way back in 1996, and has been discouraged ever since… Now it’s too weak to be used in a cryptographic setting, and too slow to be used in non-cryptographic settings.

        This is why hashes like xxhash is considered a non-cryptographic hash function, while SHA-256 is considered a cryptographic hash function.

      • squaresinger@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        2 days ago

        This is about cryptographic hashing functions (to be fair, could have spelled that out in my prior comment, but in general when someone talks about anything security relevant in conjunction with hashing, they always mean cryptographic hashing functions).

        MD5 is not a cryptographic hashing function for exactly these reasons.

        Also, the example you gave in your original comment wasn’t actually about distribution but about symbol space.

        By multiplying by four (and I guess you implicitly meant that the bit length of the hash stays the same thus dropping two bits of information) you are reducing the number of possible hashes by a factor of four, because now 3/4 of all potential hashes can’t happen any more. So sure, if your 64bit hash is actually only a 62bit hash that just includes two constant 0 bits, then of course you have to calculate the collision chance for 62bits and not 64bits.

        But if all hashes are still possible, and only the distribution isn’t perfectly even (like is the case with MD5), then the average chance for collisions doesn’t change at all. You have some hashes where collisions are more likely, but they are perfectly balanced with hashes where collisions are less likely.

        • tyler@programming.dev
          link
          fedilink
          arrow-up
          1
          ·
          1 day ago

          The article calls out hash functions and links to the relevant Wikipedia page, so I don’t think this is solely about cryptographic hash functions, though that seems to be what you were talking to the other user about.

          • squaresinger@lemmy.world
            link
            fedilink
            arrow-up
            1
            ·
            1 day ago

            I see what you are saying. But if you aren’t using a cryptographic hash function then collisions don’t matter in your use case anyway, otherwise you’d be using a cryptographic hash function.

            For example, you’d use a non-cryptographic hash function for a hashmap. While collisions aren’t exactly desireable in that use case, they also aren’t bad and in fact, the whole process is designed with them in mind. And it doesn’t matter at all that the distribution might not be perfect.

            So when we are talking about a context where collisions matter, there’s no question whether you should use a cryptographic hash or not.

            • tyler@programming.dev
              link
              fedilink
              arrow-up
              1
              ·
              8 hours ago

              Why wouldn’t collisions matter in a hash map? They’re directly attributable to the speed of the hash map. In fact I would venture to say that collisions are directly attributable to speed in all situations. That matters, right? Especially at the language level.

              • squaresinger@lemmy.world
                link
                fedilink
                arrow-up
                1
                ·
                5 hours ago

                If you have a hash collision in a cryptography context, you have a broken system. E.g. MD5 became useless for validating files, because anyone can create collisions without a ton of effort, and thus comparing an MD5 sum doesn’t tell you whether you have an unmodified file or not.

                On a hash map collisions are part of the system. Sure, you’d like to not have collisions if possible, but if not then you’ll just have two values in the same bucket, no big issue.

                In fact, having a more complex hashing algorithm that would guarantee that there are no collisions will likely hurt your performance more because calculating the hash will take so long.