How to identify encoding of a text string? [closed]

https://softwareengineering.stackexchange.com/questions/404675

07-03-2021
|

Pergunta

I guess most of you already met them. You get them from your data sources, see them in your logs, or in the output from your legacy systems. Some strings you can't really read.

To derive any useful information from them, you need to decode them first. With files, it is often possible to see the header. With text strings that have no header, you need to guess.

Many have been in that situation, and, StackOverflow has an endless supply of one-shot questions on the topic.

Some are fine with that practice... but others don't want to litter SO with overly specific questions and wait for help. In the long run, it's faster to invest some skill points into identification anyway.

Let's do that.

How do you identify that a string is in Base64, JSON, B-encode, or any other data exchange format? What resources do you use - cheat sheets, online tools, something else? Are there any techniques that can be learned, apart of "just seeing it"?

Solução

@MartinGrey, I think your question is perfectly sensible, but it's like Bob says in a comment, your misconception is in the idea that handling arbitrary data in arbitrary encodings, and decoding it based only on inferences from the data itself, is a common problem with a well-defined procedural solution.

There are potentially an infinite number of encodings, including small variations between two encodings that generally look similar. If there is limited data, it may be undecidable which of two encodings exactly are in use, even if the possibilities are finite and well-defined.

In terms of a human attempting to interpret the data, there are not really any cheat sheets. People will be familiar with the patterns employed in some common encodings and recognise them quickly.

If a person isn't familiar with an encoding, then you'd be moving into the realm of code-breaking, which relies on a broad range of inferences to establish the encoding and decoding method.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange