Question

Why are flat text files the state of the art for representing source code?

Sure - the preprocessor and compiler need to see a flat file representation of the file, but that's easily created.

It seems to me that some form of XML or binary data could represent lots of ideas that are very difficult to track, otherwise.

For instance, you could embed UML diagrams right into your code. They could be generated semi-automatically, and annotated by the developers to highlight important aspects of the design. Interaction diagrams in particular. Heck, embedding any user drawing might make things more clear.

Another idea is to embed comments from code reviews right into the code.

There could be all sorts of aids to make merging multiple branches easier.

Something I'm passionate about is not just tracking code coverage, but also looking at the parts of code covered by an automated test. The hard part is keeping track of that code, even as the source is modified. For instance, moving a function from one file to another, etc. This can be done with GUIDs, but they're rather intrusive to embed right in the text file. In a rich file format, they could be automatic and unobtrusive.

So why are there no IDEs (to my knowledge, anyway) which allow you to work with code in this way?

EDIT: On October 7th, 2009.

Most of you got very hung up on the word "binary" in my question. I retract it. Picture XML, very minimally marking up your code. The instant before you hand it to your normal preprocessor or compiler, you strip out all of the XML markup, and pass on just the source code. In this form, you could still do all of the normal things to the file: diff, merge, edit, work with in a simple and minimal editor, feed them into thousands of tools. Yes, the diff, merge, and edit, directly with the minimal XML markup, does get a tad more complicated. But I think the value could be enormous.

If an IDE existed which respected all of the XML, you could add so much more than what we can do today.

For instance, your DOxygen comments could actually look like the final DOxygen output.

When someone wanted to do a code review, like Code Collaborator, they could mark up the source code, in place.

The XML could even be hidden behind comments.

// <comment author="mcruikshank" date="2009-10-07">
// Please refactor to Delegate.
// </comment>

And then if you want to use vi or emacs, you can just skip over the comments.

If I want to use a state-of-the-art editor, I can see that in about a dozen different helpful ways.

So, that's my rough idea. It's not "building blocks" of pictures that you drag on the screen... I'm not that nuts. :)

Was it helpful?

Solution

  • you can diff them
  • you can merge them
  • anyone can edit them
  • they are simple and easy to deal with
  • they are universally accessible to thousands of tools

OTHER TIPS

In my opinion, any possible benefits are outweighed by being tied to a particular tool.

With plain-text source (that seems to be what you're discussing, rather than flat files per se) I can paste chunks into an email, use simple version control systems (very important!), write code into comments on Stack Overflow, use one of a thousand text editors on any number of platforms, etc.

With some binary representation of code, I need to use a specialized editor to view or edit it. Even if a text-based representation can be produced, you can't trivially roll back changes into the canonical version.

Smalltalk is an image-based environment. You are no longer working with code in a file on disk. You are working with and modifying the real objects in runtime. It still is text but classes are not stored in human readable files. Instead the whole object memory (the image) is stored on a file in binary format.

But the biggest complaints of those trying out smalltalk is because it doesn't use files. Most of the file-based tools that we have (vim, emacs, eclipse, vs.net, unix tools) will have to be abandoned in favor of smalltalk's own tools. Not that the tools provided in smalltalk in inferior. It is just different.

Why are essays written in text? Why are legal documents written in text? Why are fantasy novels written in text? Because text is the single best form - for people - of persisting their thoughts.

Text is how people think about, represent, understand, and persist concepts - and their complexities, hierarchies, and interrelationships.

Lisp programs are not flat files. They are serialization of data structures. This code-as-data is an old idea, and actually one of the greatest idea in computer science.

<?xml version="1.0" encoding="UTF-8"?><code>Flat files are easier to read.</code></xml>

Here's why:

  • Human readable. That makes a lot easier to spot a mistake, in both the file and the parsing method. Also can be read out loud. That's one that you just cannot get with XML, and might make a difference, specially in customer support.

  • Insurance against obsolescence. As long as regex exist, it is possible to write a pretty good parser in just a few lines of code.

  • Leverage. Almost everything there is, from revision control systems to editors to filter, can inspect, merge and operate on flat files. Merging XML can be a mess.

  • Ability to integrate them rather easily with UNIX tools, such as grep, cut or sed.

It's a good question. FWIW, I'd love to see a Wiki-style code management tool. Each functional unit would have its own wiki page. The build tools pull together the source code out of the wiki. There would be a "discuss" page linked to that page, where people can argue about algorithms, APIs and such like.

Heck, it wouldn't be that hard to hack one up from a pre-existing Wiki implementation. Any takers...?

Ironically there ARE programming constructs that use precisely what you describe.

For example, SQL Server Integration Services, which involve coding logic flow by dragging components into a visual design surface, are saved as XML files describing precisely that back end.

On the other hand SSIS is pretty difficult to source-control. It is also fairly difficult to design any sort of complex logic into it: if you need a little bit more "control", you'll need to code VB.NET code into the component, which brings us back to where we started.

I guess that, as a coder, you should consider the fact that for every solution to a problem there are consequences that follow. Not everything could (and some argue, should) be represented in UML. Not everything could be visually represented. Not everything could be simplified enough to have a consistent binary file representation.

That being said, I would posit that the disadvantages of relegating code to binary formats (most of which will also tend to be proprietary) far outweight the advantages of having them in plain text.

IMHO, XML and binary formats would be a total mess and wouldn't give any significant benefit.

OTOH, a related idea would be to write into a database, maybe one function per record, or maybe a hierarchical structure. An IDE created around this concept could make navigating source more natural, and easier to hide anything not relevant to the code you're reading at a given moment.

People have tried for a long time to create an editing environment that goes beyond the flat file and everyone has failed to some extent. The closest I've seen was a prototype for Charles Simonyi's Intentional Programming but then that got downgraded to a visual DSL creation tool.

No matter how the code is stored or represented in memory, in the end it has to be presentable and modifiable as text (without the formatting changing on you) since that's the easiest way we know to express most of the abstract concepts that are needed for solving problems by programming.

With flat files you get this for free and any plain old text editor (with the correct character encoding support) will work.

Steve McConnell has it right, as always - you write programs for other programmers (including yourself), not for computers.

That said, Microsoft Visual Studio must internally manage the code you write in a very structured format, or you wouldn't be able to do such things as "Find All References" or rename or re-factor variables and methods so readily. I'd be interested if anyone had links to how this works.

Actually, roughly 10 years ago, Charles Simonyi's early prototype for intentional programming attempted to move beyond the flat file into a tree representation of code that can be visualized in different ways. Theoretically, a domain expert, a PM, and a software engineer could all see (and piece together) application code in ways that were useful to them, and products could be built on a hierarchy of declarative "intentions", digging down to low-level code only as needed.

ETA (per request in the questions) There's a copy of one of his early papers on this at the Microsoft research web site. Unfortunately, since Simonyi left MS to start a separate company several years ago, I don't think the prototype is still available for download. I saw some demos back when I was at Microsoft, but I'm not sure how widely his early prototype was distributed.

His company, IntentSoft is still a little quiet about what they're planning to deliver to the market, if anything, but some of the early stuff that came out of MSR was pretty interesting.

The storage model was some binary format, but I'm not sure how much of those details were disclosed during the MSR project, and I'm sure some things have changed since the early implementations.

Why do text files rule? Because of McIlroy's test. It is vital to have the output of one program be acceptable as the source code for another, and text files are the simplest thing that works.

Labview and Simulink are two graphical programming environments. They are both popular in their fields (interfacing to hardware from a PC, and modeling control systems, respectively), but not used much outside of those fields. I've worked with people who were big fans of both, but never got into them myself.

You mention that we should use "some form of XML"? What do you think XHTML and XAML are?

Also XML is still just a flat file.

Old habits die hard, I guess.

Until recently, there weren't many good-quality, high-performing, widely-available libraries for general storage of structured data. And I would emphatically not put XML in that category even today--too verbose, too intensive to process, too finicky.

Nowadays, my favorite thing to use for data that doesn't need to be human-readableis SQLite and make a database. It's so incredibly easy to embed a full-featured SQL database into any app... there are bindings for C, Perl, Python, PHP, etc... and it's open-source and really fast and reliable and lightweight.

I <3 SQLite.

Anyone ever tryed Mathematica?

The pic above is from an old version but it was the best google could give me.

Anyway...compare the first equation there to Math.Integrate(1/(Math.Pow("x",3)-1), "x") like you would have to write if you were coding with plain text in most common languages. Imo the mathematical representation is much easier to read, and that is still a pretty small equation.

And yes, you can both input and copy-paste the code as plain text if you want.

See it as the next generation syntax highlighting. I bet there are alot of other stuff than math that could take benifit from this kind of representation.

It's pretty obvious why plain text is king. But it is equally obvious why a structured format would be even better.

Just one example: If you rename a method, your diff/merge/source control tool would be able to tell that only one thing had changed. The tools we use today would show a long list of changes, one for every place and file that the method was called or declared.

(By the way, this post doesn't answer the question as you might have noticed)

The trend we are seeing about DSL's are the first thing that comes to mind when reading your question. The problem has been that there does not exist a 1-to-1 relationship between models (like UML) and an implementation. Microsoft among others are working on getting there, so that you can create your app as something UML-like, then code can be generated. And the important thing - as you opt to change your code, the model will reflect this again.

Windows Workflow Foundation is a pretty good example. Of cause there are flat files and/or XML in the background, but you usually end up defining your business logic in the orchestration tool. And that is pretty cool!

We need more of the "software factories" thinking, and will see a richer IDE experience in the future, but as long as computers run on zeroes and ones, flat text files can and (probably) will always be an intermediate stage. As stated be several people already, simple text files are very flexible.

I've wistfully wondered the same thing, as described in the answer to: What tool/application/whatever do you wish existed?

While it's easy to imagine a great number of benefits I think the biggest hurdle that would have to be addressed is that no-one has produced a viable alternative.

When people think of alternatives to storing source as text they seem to often immediately think in terms of graphical representations (I'm referring here to the commercial products that have been available - eg. HP-vee). And if we look at the experience of people like the FPGA designers, we see that programming (exclusively) graphically just doesn't work - hence languages like Verilog and VHDL.

But I don't see that the storage of source necessarily needs to be bound to the method of writing it in the first place. Entry of source can be largely done as text - which means that the issues of copying/pasting can still be achieved. But I also see that by allowing merges and rollbacks to be done on the basis of tokenised meta-source we could achieve more accurate and more powerful manipulation tools.

Visual FoxPro uses dbf table structures to store code and metadata for forms, reports, class libs, etc. These are binary files. It also stores code in prg files that actual text files...

The only advantage I see is being able to use the built in VFP data language to do code searches on those files... other than that it is a liability imo. At least once every few months, one of these files will become corrupted for no apparent reason. Integration with source control and diffs very painful as well. There are workarounds for this, but involve converting the file to text temporarily!

For a example of a language that does away with traditional text-programming, see the Lava Language.

Another nifty thing I just recently discovered is subtext2 (video demo).

The code of your program define the structure that would be created with xml or the binary format. Your programming language is a more direct representation of your program's structure than an XML or Binary representation would be. Have you ever noticed how Word misbehaves on you as you give structure to your document. WordPerfect at least would 'reveal codes' to allow you to see what lay beneath your document. Flat files do the same thing for your program.

Neat idea's. I have myself wondered on a smaller scale ... much smaller, why can't IDE X generate this or that.

I don't know if I am capable as a programmer yet to develop something as cool and complex as your talking about or what I am thinking about, but I would be interested in trying.

Maybe start out with some plugins for .NET, Eclipse, Netbeans, and so on? Show off what can be done, and start a new trend in coding.

I think another aspect of this is that the code is what is important. It is what is going to be executed. For example, in your UML example, I would think rather than having UML (presumably created in some editor, not directly related to the "code") included in your "source blob" would be almost useless. Much better would be to have the UML generated directly from your code, so it describes the exact state the code is in as a tool for understanding the code, rather than as a reminder of what the code should have been.

We've been doing this for years regarding automated doc tools. While the actual programmer generated comments in the code might get out of sync with the code, tools like JavaDoc and the like faithfully represent the methods on an object, return types, arguments, etc. They represent them as they actually exist,not as some artifact that came out of endless design meetings.

It seems to me that if you could arbitrarily add random artifacts to some "source blob", these would likely be out of date and less than useful right away. If you can generate such artifacts directly from the code, then the small effort to get your build process to do so is vastly better than the previously mentioned pitfalls of moving away from plain text source files.

Related to this, an explanation of why you'd want to use a plain-text UML tool (UMLGraph) seems to apply nearly equally as well to why you want plain-text source files.

This might not answer exactly your question but here is an editor allows having an higher view of code: http://webpages.charter.net/edreamleo/front.html

I think the reason of why text files are used in development is that they are universal against various development tools. You can look inside or even fix some errors using a simple text editor (you can't do it in a binary file because you never know how any fix would destroy other data). It doesn't mean, however, that text files are best for all those purposes.

Of course, you can diff and merge them. But it doesn't mean that the diff/merge tool understand the distinct structure of the data encoded by this text file. You can do the diff/merge, but (especially seen in XML files) the diff tool won't show you the differences correctly, that is, it will show you where the files differ and which parts of the data the tool "thinks" are the same. But it will not show you the differences in the structure of XML file - it will just match lines that look the same.

Regardless whether we're using binary files or text files, it's always better that the diff/merge tools take care of the data structure this file represents rather than the lines and characters. For C++ or Java files, for example, report that some identifier changed its name, report that some section was surrounded with additional if(){}, but, on the other hand, ignore changes in indents or EOL characters. The best approach would be that a file is read into internal structures and dumped using specific format rules. This way the diff-ing will be made through the internal structures and the merge result will be generated from the merged internal structure.

Modern programs are composed of flat pieces, but are they flat? There are usings, and includes, and libraries of objects, etc. An ordinary function call is a peek into a different place. The logic isn't flat, due to having multiple threads, etc.

I have the same vision! I really wish this would exists.

You might want to take a look at Fortress, a research language by Sun. It has special support for formulas in source code. The quote below is from Wikipedia

Fortress is being designed from the outset to have multiple syntactic stylesheets. Source code can be rendered as ASCII text, in Unicode, or as a prettied image. This will allow for support of mathematical symbols and other symbols in the rendered output for easier reading.

The major reason for the persistence of text as source is the lack for powertools, as eg version control, for non-text date. This is based on my experience working with Smalltalk, where plain byte-code is kept in a core-dump all time. In a non-text system, with today's tools, team development is a nightmare.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top