Can any file and its data be converted to a String or plaintext?
Can any file and its data be converted to a String or plaintext?
I've noticed some files I opened in a text editor have all kinds of crazy unrenderable chars
Can any file and its data be converted to a String or plaintext?
I've noticed some files I opened in a text editor have all kinds of crazy unrenderable chars
can it? Sure, most any arrangement of bits can be converted into some kind of Unicode text. Can it be converted to something meaningful or readable? No, some formats are plain text (.txt, .ini, .json, .html for some random examples) that are meant to be read by humans, and others are binary formats that are only meaningful when decoded by a computer into specific data structures inside a piece of software.
Yes, see Binary-to-text encoding (e.g., Base64).
Can you comment on the specific makeup of a "rendered" audio file in plaintext, how is the computer representing every little noise bit of sound at any given point, the polyphony etc?
What are the conventions of such representation? How can a spectrogram tell pitches are where they are, how is the computer representing that?
Is it the same to view plaintext as analysing it with a hex-viewer?
There's two things at play here.
MP3 (or WAV, OGG, FLAC etc.) provide a way to encode polyphony and stereo and such into a sequence of bytes.
And then separately, there's Unicode (or ASCII) for encoding letters into bytes. These are just big tables which say e.g.:
01000001
= uppercase 'A'01000010
= uppercase 'B'01100001
= lowercase 'A'So, what your text editor does, is that it looks at the sequence of bytes that MP3 encoded and then it just looks into its table and somewhat erronously interprets it as individual letters.
I think you are conflating a few different concepts here.
Can you comment on the specific makeup of a “rendered” audio file in plaintext, how is the computer representing every little noise bit of sound at any given point, the polyphony etc?
What are the conventions of such representation? How can a spectrogram tell pitches are where they are, how is the computer representing that?
This is a completely separate concern from how data can be represented as text, and will vary by audio format. The "simplest", PCM encoded audio like in a .wav file, doesn't really concern itself at all with polyphony and is just a quantised representation of the audio wave amplitude at any given instant in time. It samples that tens of thousands of times per second. Whether it's a single pure tone or a full symphony the density of what's stored is the same. Just an air-pressure-over-time graph, essentially.
Is it the same to view plaintext as analysing it with a hex-viewer?
"Plaintext" doesn't really have a fixed definition in this context. It can be the same as looking at it in a hex viewer, if your "plaintext" representation is hexadecimal encoding. Binary data, like in audio files, isn't plaintext, and opening it directly in a text editor is not expected to give you a useful result, or even a consistent result. Different editors might show you different "text" depending on what encoding they fall back on, or how they represent unprintable characters.
There are several methods of representing binary data as text, such as hexadecimal, base64, or uuencode, but none of these representations if saved as-is are the original file, strictly speaking.
Most binary-to-text encodings don’t attempt to make the text human-readable—they’re just intended to transmit the data over a text-only medium to a recipient who will decode it back to the original binary format.
At the end of the day data is just binary, i.e. it's composed of 0 and 1. What those 0 and 1 represent is mostly irrelevant to this discussion. The short version is that 01000001
can mean A
or it can mean that a given pixel is 65/256
red, or that the speaker should vibrate in a specific frequency, etc, etc.
So what happens when you open a file that's not text in a text editor? Well, some of the 0 and 1 make up gibberish, or characters that are not meant to be printed. Fun fact, you should be able do this the other way around too, i.e. open a text as an image, but again it will be gibberish, and most likely would not load since images have lots of information that relate to size, compression, etc, that if incorrect the program won't know what to do, but because text can always be valid it will always work, although sometimes your editor might show weird thing in the places where there's a non-printable character.
technically, yes. all unprintable binary can be resolved to 64 printable characters. but that resulting string may not be english or any human language.
But its still contains the actual data in a faithfully reproducible/useable way?
Yes. Decoding a base64 encoded string will give you back the exact original data.
Importantly though, this isn't what you're seeing when you open files in a text editor as you describe in your original post, and if you copied the text of those files and saved a new copy it's very likely that it would not reproduce correctly.
yes, this method doesn't lose any bits. one of its primary use before was email which was strictly text only.
Are those binary files by any chance?
I just mean like any file (pdf, jpeg, mp4, mp3, exe—
mp4/mp3 most famously for me
I find it so damn cool and incredible I can record something/anything right now and open the audio in a text file and its all right there—albeit in an incomprehensible format but there altogether.
Its like a thinking rock etching sound into stone
If you're on Linux, you can convert that to something more human readable by piping it to base64. It works with any file, but I'll use an image here:
cat image.webp | base64
Which yields:
UklGRroEAABXRUJQVlA4WAoAAAAgAAAAYwAAQgAASUNDUKACAAAAAAKgbGNtcwRAAABtbnRyUkdC IFhZWiAH6AAIABoADgAJACBhY3NwQVBQTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9tYAAQAA AADTLWxjbXMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA1k ZXNjAAABIAAAAEBjcHJ0AAABYAAAADZ3dHB0AAABmAAAABRjaGFkAAABrAAAACxyWFlaAAAB2AAA ABRiWFlaAAAB7AAAABRnWFlaAAACAAAAABRyVFJDAAACFAAAACBnVFJDAAACFAAAACBiVFJDAAAC FAAAACBjaHJtAAACNAAAACRkbW5kAAACWAAAACRkbWRkAAACfAAAACRtbHVjAAAAAAAAAAEAAAAM ZW5VUwAAACQAAAAcAEcASQBNAFAAIABiAHUAaQBsAHQALQBpAG4AIABzAFIARwBCbWx1YwAAAAAA AAABAAAADGVuVVMAAAAaAAAAHABQAHUAYgBsAGkAYwAgAEQAbwBtAGEAaQBuAABYWVogAAAAAAAA 9tYAAQAAAADTLXNmMzIAAAAAAAEMQgAABd7///MlAAAHkwAA/ZD///uh///9ogAAA9wAAMBuWFla IAAAAAAAAG+gAAA49QAAA5BYWVogAAAAAAAAJJ8AAA+EAAC2xFhZWiAAAAAAAABilwAAt4cAABjZ cGFyYQAAAAAAAwAAAAJmZgAA8qcAAA1ZAAAT0AAACltjaHJtAAAAAAADAAAAAKPXAABUfAAATM0A AJmaAAAmZwAAD1xtbHVjAAAAAAAAAAEAAAAMZW5VUwAAAAgAAAAcAEcASQBNAFBtbHVjAAAAAAAA AAEAAAAMZW5VUwAAAAgAAAAcAHMAUgBHAEJWUDgg9AEAALAQAJ0BKmQAQwA+8WSmTqmlKCYvmWqp MB4JZQDLnNaF2NMD2L3xQGb5nmLiGhGWxQuD8kwUSXF0u2UTgX0YrR3MY2SsRCNEQ8hZ6WkCUTih LdmsElHZVzoMwO/fj4X/ZSNT2R9qgxwqgEed891j4KCNRLK/tUbG3hZ3Mw2kixguSFIEcAgBtv8w eAu0PwAA/upMzBqq+dcN8viO7FpqpV6GvPcRILm+HsOQblnpHx03lASjGlSyGbkKUD3xA5KOqgq/ VEUJ4qF9VoAYFbFhQRAgkvmREk5umMj8sr9Np95+n/oP2Aq2VW5xU4F1xpD8Vd4Dp7Phwm9w/Dnf 94djRROFRYPZeg/1Q/qiROFRVRu2nBcgndbhc0x0h+kgvT/naeJOEqwNjYPlIiw/DGuxav7+x09R mf2mJto3ineDqfyMWUN83PmKqzGHkYGhZrTU478qjlQucDzWkwobnUmzhE6I+mDYkfiUVPcHyXbf xXRStyPiPZAkJZrE9OrjFNUeljRQdVTQqeBsy+O9VwDLU5GcKhBQHa4cj+/DGqUhi74WH0EuHsb3 EgZVNc1FbRm5QFOpjDSprGIRYxe6sFFDrDOg4DhWZRnOa7s68pGaDDpbqrORxzPHXPbs55/1HTas DDGzKFmTG4hJ2GUZKqjPcQ+MAAAA
Copy that into a text file and pass it to base64 with the decode flag, and you'll get the original binary:
cat data.txt | base64 -d > data.bin
Inspect it to see what kind of file it is:
file data.bin
-> data.bin: RIFF (little-endian) data, Web/P image
Rename it so you can just double-click it to open it:
mv data.bin data.webp
Enjoy the surprise.
You can also print files like that, scan them using OCR, and then restore them. A very inefficient way to do backups, but it works.
You can use a hex editor to view those files and even change them in some cases.
Something like this https://github.com/WerWolv/ImHex
You're looking for https://en.m.wikipedia.org/wiki/Character_encoding Which explains the funny characters.
Spank you, much :D
Ace Ventura on lemmy trying to understand file encodings
Spank?