I'm writing a macro that converts Word documents to HTML cleanly. I would like to save a Word file in plain text format and give it the same base name as the original, but the file extension .shtml. For example, if the original is named v59_no5_carliner.doc, I want to save it as a text file named v59_no5_carliner.shtml.
The following macro selects the contents of the active document, copies it, creates a new document, and pastes in the contents of the original document, then it saves the document as plain text. However, I can't figure out how to tell Word to save the document with the .shtml extension. Any suggestions?
Sub SaveAsSHTML() ActiveDocument.Select Selection.Copy Documents.Add Selection.Paste ActiveDocument.SaveAs FileFormat:=wdFormatText End Sub
Also, do you think it would be better to just save the original with the same base name but a different extension, instead of creating a new file and copying in the contents of the original?
Below is the code which I posted. Let me explain it bit by bit.
The first line isn't actually a piece of executable code but is what is known as a Compiler Directive. This Option Explicit tells the compiler that all variables must be declared. This stops code doing all the wrong things if a variable is misnamed in the code. When I used to interview people who wanted to work at the last place I would always ask "when do you not use 'Option Explicit'?". I just wanted to see what people would say -- as far as I am concerned the answer is never. So, always make sure that this line is at the top of every code module in every piece of VB, VBA or VBS that you write. There should be no exceptions.
Then we have two routines; the first one, SaveAsSHTML() is basically what you wrote and the other one, GetAlternativeFileName(), is a function which I wrote to expand the code.
The first thing that I do in the SaveAsSHTML() routine is to declare two variables; these will hold the names of the current document name and also of the target file name (the file which you are going to create).
To get the current file name all I did was to call the property, .FullName, of the ActiveDocument object. The ActiveDocument is actually a pointer to one of the current open document. I am not sure if you are aware of objects and pointers at this stage; I am thinking of doing something on my site in the near future so that people can get to grips with the idea. Anyway, consider there is a thing called a pointer which is the software equivalent of a small finger signpost which points to the location of something. For example there is a signpost at Land's End in England which points to the location of John O'Groats (the opposite end of this green and pleasant land).
A pointer is much the same; it is something which points to the location of something else. So if one knows where the pointer is then you can be redirected to the object in question. In Word VBA there is always a pointer called ActiveDocument. It always points to the currently active document. If there is no active document then the pointer points to Nothing.
So, in our code we have to get hold of the file name of the active document. Now, I am adding some extra code in here. What if, for example, there are no documents open? Then this code will fail horribly. So, I am now adding the following line:
If Documents.Count > 0 then
Right, this ought to make sense straight off. All I am saying here is that if there is at least one document open then carry on with the code. If there are no documents open then there is nothing to copy, paste and save. Does this make sense?
I am doing this by looking at the collection of Documents. You are happy about what a Document is, I assume. A Document is each instance of a Word document open. Now, in VB and VBA we have things called collections. A collection is a group of like-minded things; for example every Document which is open is in the Documents collection.
Note the usage of the singular and plural. A singular instance belongs to a plural collection. So there could be zero, one or more open documents in a Word application at any one time. Each of these Document instances belongs to the Documents collection.
So, in this code I am asking Word to look into the collection of Documents and to tell me how many Document instances there are. If there are more than zero (i.e. something is open) then we carry on.
In Word VBA there are all sorts of Collections. Hundreds. For example I posted some code to the Word Tips newsletter recently (which was printed, he says smugly) because someone wanted to remove all the Bookmarks in a Document. You may start to work out how I did this. First of all I opened the ActiveDocument object and then looked at the Bookmarks collection. As you may guess each Bookmark in a document is in the Bookmarks collection. All I did was to rattle through the code and delete each Bookmark. I didn't have to worry about finding the next Bookmark on the printed page and then mess on with selecting it and then deleting; I just ran through the collection and removed them.
What other collections are there? Well, each Section, for example, is in the Sections collection. Each Paragraph is in the Paragraphs collection, there are Tables collections containing all the Tables, CommandBars for each CommandBar (toolbars to you and I), there are Dictionaries, Styles, Shapes; in fact almost anything you can put into a Word document seems to fit into a collections somewhere. This includes the Word environment as well.
If we go to Excel there are Workbooks, in Access there are Tables and Queries, in PowerPoint there are Presentations and PointyHairedBosses (joking!), in Outlook and Exchange there are Folders. There are over 200 applications with VBA which means that they expose this sort of Object Model. So after you get to grips with VBA in one application then there isn't a huge Quantum Leap to messing with VBA in another application. I am starting to digress...
Right, having satisfied myself that there is indeed one more open documents and, thus, an ActiveDocument object I then get the .FullName of the ActiveDocument. The .FullName property (which is read only) tells us the full name, including the path, of the document. Obvious really.
I saved this into a variable called sOldFileName which is defined as a string. A string really means a string of characters rather than a variable containing a numeric value, for example. Having got the name of the current document we go into the code which you wrote; you select the current document, pop it into the clipboard, create a new Document and then paste the contents into the new one.
Note that the Documents.Add actually extends the Documents collection by one Document; in other words this is the program equivalent of clicking on File then New.
Now, we know the name of the old ActiveDocument (the process of creating a new Document means that the ActiveDocument points to the newly created Document and no long the original) and so I have written this routine to take the name, which is held in sOldFileName, and then work with this information and then return the new file name.
This is done in the function, GetAlternativeFileName().
In most languages a sub-routine is either a procedure or a function. Both go off and do things but a function differs in that it returns a value. For example, if I wanted to compare something with the real world; I could instruct my son to go and chop the firewood. He would go and to it and then come back and settle down on the sofa next to me. Or I could tell him to go and chop the firewood and then come back and tell me how many hundredweights of wood he has chopped as he collapses next to me on the sofa.
In both of these cases the job is done but the function returns a value. In this case it could be how much wood he has cut, where he has put the stacked and bundled wood (usually a pointer to the new heap of wood) or how many life-threatening injuries he has sustained (the lad is only eight). So, you can see here that some tasks require feedback and some don't.
Now, in normal programming speak the sub routines which don't return anything are called procedures and those which return something and these are called functions. Some languages, such as C, only have functions and rather than have a procedure which just have a function which returns nothing (or in the C language, void). However, in the land of the giant, Microsoft, things aren't named as well as they ought to. Functions are still functions but the procedures are called Subroutines, or Sub.
This means that old farts like myself have to worry about getting the correct name across when talking to more recent programmers.
So, we have this function which I have called GetAlternativeFileName(). From the declaration we can see the following:
Function GetAlternativeFileName(sDocumentFilePath As String, sSuffix As String) As String
The first word is Function. This tells the compiler that this is will go off and do something and then return some sort of String. This is the 'As String' part at the end of the declaration. So, in the calling part of the code we call this routine by the following:
sNewFileName = GetAlternativeFileName(sOldFileName, ".shtml")
So this says that we are going to call the function GetAlternativeFileName and this returns a string and then this string is going to to be passed into the sNewFileName string variable. By the way, you can tell it's a variable which holds string types as I have prefixed the name with the 's' to remind myself and whomever follows me that this is a string variable.
Now, there are two parameters which are also passed to this function. They are both defined as strings, sDocumentFilePath and sSuffix. What I am saying is that this function will take these two parameters and then process this information and then spit out (return) the final result.
Now, if you consider the problem we have. We have a file name and all we want to do is to replace the suffix with the one which is passed to the function.
Could I have done this without parameters? Yes, I could have but it would have meant using Global variables which is not a good idea. There isn't scope here to go into this discussion. But by using parameters the routine is effectively self contained. This means that if I want to use this code elsewhere then all I have to do is to copy this routine and then drop into in a new VB project and it will work providing that when it is called it has two parameters.
Now, into the code of the function I Trim$() the filename. I always do this so that I can be more confident with having 'clean' data. Depends on my level of paranoia I could be checking for non-printable characters, such as carriage-return and line-feeds or even tabs. Here we are not. So, having removed the blank spaces from each end of the filename (not that I was expecting any -- but this is Defensive Programming -- expect the worst and you won't be too far wrong) and then work with the cleaned filename.
Now, the next thing that I do is to see if there is a filename to work with. This is why I use the Len() function. If there is no filename to manipulate then there is no point doing much more, is there? Again, Defensive Programming. Once we have established that there is a filename to mess with then I use the InstrRev() function.
The InstrRev() function is similar to the more commonly used Instr() function. The Instr() function returns the position number of the first instance of a substring. With the InstrRev() function we are looking for the last instance of a substring. The substring to search for? Why, the simple '.'; which in Windows is the filename suffix delimiter. Note that I am searching from the end of the string forwards. This is because in recent years Windows does allow us to have more than one period in a file name. For example when I am posting a zip file containing version 6.2, say, of an application then the filename could well be 6.2.zip . So this is the reason why I am searching from the right hand end of the string.
The value returned by this function is a Long (i.e. a numeric) which tells us which position this substring was located. If there was one which was located, that is. If there is, and this is what we expect, then what we do is to shave off the characters to the Left of this position and then place them into a string and then append this string with the required suffix. We use the Left$() function to do this. The Left$() function has two parameters; the first is the string which is wanted to be chopped up and then the how many characters are to be used. Rather straight forward, really.
Then we have the new name in the sBuffer variable (I tend to have my string work variables called sBuffer, again this old fart has old habits). Then the last line of the function is:
GetAlternativeFileName = sBuffer
Which means that we are 'passing the value back' to the calling procedure.
So, we are back in the SaveAsSHTML() routine and now we know what we are going to call the new filename. Then it's quite simply a case of saving the new document with this name then there we are. All done.
Now, would I do it this way if I had to write this code? No. The biggest reason is that I am reluctant to use the clipboard in this manner. What happens if the user has something in the clipboard at the time? Well, we've done and zapped it for him. I were he then I would be less than chuffed. Remember the clipboard should only be used by the user; if we are going to program then let's do the job right.
But, why use the clipboard and, in fact, why create a new document? We have the document open in front of us so why not just do the following?
Sub SaveAsSHTML() Dim sNewFileName As String Dim sOldFileName As String If Documents.Count > 0 then sOldFileName = ActiveDocument.FullName sNewFileName = GetAlternativeFileName(sOldFileName, ".shtml") ActiveDocument.SaveAs FileName:=sNewFileName, FileFormat:=wdFormatText End If End Sub
Wouldn't this be better? The difference is that we are doing a .SaveAs which means that a new filename is expected. A .Save call would just overwrite the existing file.
Option Explicit Sub SaveAsSHTML() Dim sNewFileName As String Dim sOldFileName As String If Documents.Count > 0 then sOldFileName = ActiveDocument.FullName ActiveDocument.Select Selection.Copy Documents.Add Selection.Paste sNewFileName = GetAlternativeFileName(sOldFileName, ".shtml") ActiveDocument.SaveAs FileName:=sNewFileName, FileFormat:=wdFormatText End If End Sub Function GetAlternativeFileName(sDocumentFilePath As String, sSuffix As String) As String Dim sBuffer As String Dim nPosn As Long sBuffer = Trim$(sDocumentFilePath) If Len(sBuffer) > 0 Then nPosn = InStrRev(sBuffer, ".", , vbTextCompare) If nPosn > 0 Then sBuffer = Left$(sBuffer, nPosn - 1) & sSuffix End If End If GetAlternativeFileName = sBuffer End Function
If there are any suggestions for updates then please drop me a mail at firstname.lastname@example.org.