This essay was originally written 25 Sept 2021 for Professor Leinecker's Digital Forensics I course.
Data hiding, or steganography, is the practice of hiding data within other data. The data that is hidden is called the payload. This can be text, images, videos, any data. The data that it is hidden in is called the carrier. Again, this can be any type of data. Oftentimes, data is hidden within slack space or free space. It can also be hidden by replacing carrier data values with payload data values.
Data hiding has been practiced for centuries. A long time ago, data hiding would be done with invisible ink, or using wax to cover stones with engravings in them that read out a secret message. In the present age, data hiding is done electronically. You can deconstruct any file you want to hide into bits, shove those bits somewhere they won’t be seen, and then reconstruct the file at a later time. This technique is so incredible because it is invisible to the human eye. Looking at an image file or listening to an audio file that has data hidden in it will usually not be detectable unless forensically analyzed.
To find a file hidden inside another file, you should begin by looking for the file signature, or header. This is a string of hex values that every file of that file-type begins with. For a JPEG, it is FF D8 FF. Every JPEG image of the JFIF format will begin with these 6 hex values. Make those values the start of your block, and have it end at the file trailer (the last bits in every file of that type.) Extract those bits to their own file and you now have the original file that was hidden.
To find text in a hidden file can be more difficult because of encryption. If you were to hide human-readable text inside a file, you could easily find it by extracting strings of consecutive human-readable characters from the file. This is because most of the ASCII characters in a file are not human-readable (meaning not in the English alphabet.) For example you may have a string of text that reads: “ÀYŒ€c¸+k·£‘»zzŠñït”. However if you have a string of consecutive characters that read: “This is a code”, you can assume that it was inserted in there on purpose. If the text inserted into a file are encrypted into a format that combines readable and non-readable characters, it would be very difficult to distinguish these from the original file’s data.
One example of the former is hiding data within image color values. An image has data in the form bytes defining its RGB values. The least significant bit often makes no noticeable difference in the color or quality of an image. This is because it affects less than half of a percent of a pixel’s value. The human eye struggles with perceiving differences that small. To perform the data hiding, you would look at two bit streams: one being the bits of data you wish to hide, called the payload bits, the other being the least significant bits for the RGB values in an image, called the carrier bits. You simply replace the carrier bits with the payload bits. After this is done, the image should look no different to the human eye, but you have hidden your data within the image. To view your data, you would go through the least significant RGB value bits, extract it and concatenate it with the next one. After you have finished, your payload bit stream will be exactly the same as before. The only problem with this method is, if you wanted to return the carrier image used to hide the data to its original, unchanged form, you couldn’t unless you saved those RGB values that were replaced.
Another method of data hiding is using slack space. Slack space exists when only a portion of a cluster is used. This space is simply unused space and you can add whatever data to it you wish, making it the perfect carrier for payloads. The only caveat is that if that space gets overwritten (through use of the file system,) you may lose your payload bits. This method then clearly works best with unused medium (archive file systems, backup file systems, etc.)