cat puzzle: How would you?

Let's say that a.html and b.html should be concatinated into a single HTML document. As you may know, the cat command creates an invalid document with an <html> tag in the middle of the document.

cat a.html b.html > c.html

How would you do this with POSIX-style commands? I don't know. I want something like head and tail, but to skip the first/last line of text. Does grep work? Assuming an HTML document has fewer than 9999 lines...

grep --before-context=9999 \</body a.html > top.html
tail --after-context=9999 \<body\> b.html > bottom.html
cat top.html bottom.html

Argh, the following tags are in the middle of the document.
<body bgcolor="#ffffff">

The --context=-1 is an invalid argument.


My answers ...

0) This is off-topic for the JNode website

1) This is not a cat puzzle. Not even a Berkeley student would expect 'cat' to be able to handle this task.

2) Any solution that uses something as simple as grep and tail is bound to fail for some input documents. It will get tripped up by stupid things like tag capitalization, embedded attributes, extra blank lines, unexpected elements, etc.

3) Define the problem better: can your input HTML documents have <head>? Doctypes? What version(s) of HTML / XHTML? Do you need to cope with malformed or pathological HTML?

4) Assuming reasonable answers to 3) I'd use Awk if I had to use a POSIX command, and Perl or something like that otherwise. (While I dislike Perl, it is well-suited to moderately complicated text bashing applications like this.) A scriptable HTML / XML editor would probably give a neater, more robust solution, but I don't know of one to recommend.

And no ... I'm not going to implement it for you Smiling

P.S. Concatenating HTML files is a kind of strange thing to do, like concatenating (say) image files or word-processor files. A tool/script for concatenating HTML would have ... ummm ... limited utility, IMO.

Extending the grep command

Thank you for your thoughful reply. Even when it sounds like this is a waste of time, I believe that something valuable comes from this discussion.

Here is the rest of the story. While the puzzle may be a bit trivial and contrived, it is a real problem that I ran into in my day job. All HTML documents are generated, thus very uniform. The body tag is always in all lowercase. It is always on a line by itself. Documents are generated by two different tools. And oh, by the way, the publishing tools are written in Java.

For the sake of discussion, let's say that JNode has already implemented the grep command with a command class. I do not want to change the grep command. Rather, I want to quickly create a new command like grep, but very slightly different. I expect to reuse a grep command class to create a new command that supports --context=-1. I put emphasis on quick and dirty.

Does JNode require a grep command class to be final? Is the grep command class easily extended? Is command-line syntax inherited and extended, too?

In a classic operating system, the answer to the puzzle is, "No, of course not." But what about JNode? Should this be the answer?

None of the commands are

None of the commands are declared as final, but, they would not be easily extended as mostly all of them have their work done in private methods and their structure is generally private. The only public method is generally execute(), which if you override, you might as well be implementing your own command. Rewriting commands so they can be extended introduces alot of pitfalls, and It only makes creating/maintaining them more complicated, which is the opposite of what we want.

I think that it is simply the wrong angle to approach the problem from. Using a proper scripting language, is the right approach. Commands like sed and awk provide a very terse syntax for processing streams of text. If they are not enough, there is bound to be a perl module that will do what you want.

This type of problem is what I would classify as a 'throw-away' conversion tool. It generally sees relative limited use, and may never be used again. Java was not really designed for write once and throw away, as there is generally takes longer to write a quick and dirty tool, versus that of a scripting language. Even if you don't wind up actually throwing it away, it still falls into the same scope. I could probably write a script to do what your asking in a matter of a few minutes, i really doubt the same could be said for extending a grep command, even if we did export a viable interface that allowed this.

I expect to reuse a grep command class to create a new command that supports --context=-1
That really doesn't make any sense. I know what your trying to say with it, but again, wrong approach. Not to mention the fact that you would have to completely rewrite the ContextLineWriter inside GrepCommand, which is using a clever but fragile method of printing context with minimal buffering.

I think we should make cat

I think we should make cat able to read the users mind and therefore do anything it very well pleases...

Seriously though, what steve is saying, is that you have to expand from just a couple of commands. There are hundreds, if not thousands of tools with millions of variations. For text processing, sed/awk/perl are THE tools. cat/head/tail are 'dumb' in the sense they do strictly as they are told, and nothing more, and that will never change.

In all honesty though, i dont think i would use shell commands, posix or otherwise. If your html is valid xhtml, then any set of XML tools is going to do this much more accurately and error-free than traditional commands. A perl script with an html or xml module would do this properly.