
This was a big thing long before PCs existed but you don’t see it so much nowadays. Say we have a lot of static data, perhaps text strings which can vary in length from a few bytes to hundreds of bytes long. What is an efficient method to store them, i.e. in both space and time?
The answer is ISAM (Indexed Sequential Access Method) which sounds more complicated than it is. We will use two binary files- one is the data file and one is the index file. The data file holds the raw text strings. We’ll write the raw strings using the C file write. At the same time as write them, we also build up an array of structs. This struct will look something like this:
typedef struct {
int id;
unsigned long offset;
unsigned long length;
} indexRec;
The id could be a char * string. It’s just a way to search for and find the data.
As each text string is written to the data file we populate a struct with the file offset and length. Eventually after all strings are written you can write the arrays of structs to the index file.
Then anytime you wish to read text strings back, load the structs into an array in ram and scan for the matching id then call fgetpos with offset (this moves the file pointer to the specified byte offset) and read in the length number of bytes from the data file.
An even simpler approach
If you just use an index number then you can scrap the id field and instead store the index file as a collection of offset and length fields. To read string #100 just do a fgetpos to byte at 100 * 8 (four for the size of offset plus four for the size of length) in the index file, read the eight bytes for offset and length then do a fgetpos in the data file to the offset and read in length bytes as before. Very efficient and very fast. Just two reads in two files and one of those is only eight bytes.
If you wish to change the text strings in the data file, use the same method to retrieve them. If the new string is smaller then just write it and change the length field of the index record and rewrite it. If the new string is longer than the old, write it on to the end of the data file and change the index offset and length fields and rewrite them in the index file.
Both of these edit methods (shorter or longer string edits) will ‘lose’ bytes in the data file. When replacing with a shorter string and adjusting the length, the extra bytes that were in the previous string are no longer referenced. With a longer edit the entire previous string is no longer referenced.
Reclaim lost bytes
However it’s very easy to compact the data file and reclaim these lost bytes. Just process the index file and read each string and write them on the end of a new initially-empty file. You also change each index file record to have the new offset and length of each string in the new data file. Once it’s rewritten just delete or backup the old data file and rename the new data file to be the one used from now on. Depending on how often the data file is edited, you will need to “compact it” every few days or weeks. But compacting is a pretty quick process, even for mega byte or larger sized files.
I used this technique albeit in Turbo Pascal not C to store game data for postal games I wrote back in the late 1980s. Games that are still run today but on the internet, not by post. For example in a map location there can be several parties of adventurers. The file record for that location has a file pointer to the start of a chain of party file ptr records. Each index file of adventurer parties has a binary file offset to the party in a binary file and a file ptr (or -1 if the end) of the next party’s file pointer. Like a pointer linked list but using file pointers instead of real pointers. Similarly the map location has a file pointer to a dungeon if one exists in this location (or -1 if it doesn’t).
This technique kept the map file fairly small but there were lots of binary file pairs for parties, dungeon, characters (in parties), and towns and shops in towns. Who needed a database? I didn’t… (That’s cos databases didn’t really exist back in the day. Had they existed I might have used them… maybe).
The minus one problem
In the struct definition, you’ll notice that I use unsigned longs for file pointers and length. There is no -1 for unsigned longs. It’s 4294967295. Using this as an end of chain pointer is ok because it is never going to be used as an offset. If I had used signed numbers then I could have used -1 but remember when I wrote this, it ran on a 16 bit computer so I used 65535 instead of -1. I could easily get binary files 40 Kb or 50 KB in size, so a signed number would have overflowed after 32767 bytes.