[OT] Resource file

Hi All,

I’m designing a resource file format to hold game data (scripts, images,
musics, sound effects etc.) and would like some input from the community.

First, why another resource file? Because I want two things other
resource files don’t offer (at least that I’m aware of):

A. Fast access through mmap (the file can be mmapped to the process
address space and used with very little setup)
B. Spare space where I could insert the digital signature of the file.
This space must be filled with zeroes while computing the hash for the
signature.

The preliminary version is working quite well. Data is accessed by name
(char *), with a reader that supports basic data input operations being
returned. Since the file is mmaped, the location of a given chunk of
data within the file is quickly found via a bsearch call. Data can be
stored without compression (good for mp3, ogg, jpeg…) or bzipped. The
resource file can even be part of another file, the most common use
being to append the resource file to an executable.

There is also support for transparently using a directory instead of a
real resource file just like zziplib, and reading entries via SDL_RWops.

So my questions are:

  1. Is there any thing bad about mmapping a resource file? The file size
    can easily be greater than 4 GiB.
  2. Are there other resource file formats that provide A and B above?
  3. Are there other important characteristics for a resource file format
    I’m missing?
  4. Do you have requirements for a new resource file?

The exposed interface so far is:

--------------------8<--------------------
/* Seek modes. */
#define AF_SEEK_SET 0
#define AF_SEEK_CUR 1
#define AF_SEEK_END 2

/* The opaque file type. */
struct af_file_t;
typedef struct af_file_t af_file_t;

/* The opaque reader type. */
struct af_reader_t;
typedef struct af_reader_t af_reader_t;

/* The callback type used to return the address of needed functions,
i.e. BZ2_bzDecompress and friends. This allow the application
to statically link support libraries as bzip2 or dinamically load them
from anywhere in the file system. */
typedef void *(*af_dynamic_link_t)(const char *library_name, const char
*function_name);

/* Information filled by af_entry_get_info and af_iterate. */
typedef struct {
const char *name;
uint8_t compression[4];
uint32_t compressed_size;
uint32_t uncompressed_size;
const void *data;
} af_entry_info_t;

/* Sets the callback for the load of support libraries. */
void af_set_dynamic_link_callback(af_dynamic_link_t call_back);

/* Opens the resource file, possibly with an offset from the
beginning of the named file. Size defaults to the number
of bytes left in the file starting at offset, but can be specified
if the resource file is embedded in the middle of some
other file. */
af_file_t *af_open(const char *name, uint32_t offset, uint32_t size);

/* Closes the resource file, freeing all allocated resources. */
void af_close(af_file_t *file);

/* Calls the process callback for all entries in the resource file. */
int af_iterate(af_file_t *file, int (*process)(af_entry_info_t *info,
void *udata), void *udata);

/* Opens an entry for reading. */
af_reader_t *af_entry_open(af_file_t *file, const char *name);

/* Reads size bytes into buffer. */
int af_entry_read(af_reader_t *reader, void *buffer, int size);

/* Changes the location of the next read operation. */
int af_entry_seek(af_reader_t *reader, int offset, int whence);

/* Closes the reader, freeing all allocated resources. */
int af_entry_close(af_reader_t *reader);

/* Returns the current position within the entry. */
int af_entry_tell(af_reader_t *reader);

/* Returns information of the entry associated with the reader. */
int af_entry_get_info(af_file_t *file, af_entry_info_t *info, const char
*name);
--------------------8<--------------------

Cheers,

Andre

  1. Is there any thing bad about mmapping a resource file? The file size
    can easily be greater than 4 GiB.
    Pages that are mmapped from the file can’t be used for other memory.
    Therefore, on most 32-bit platforms, an app using a resource file
    that’s near 2GiB would have serious memory starvation problems. If you
    selectively mmap only relevant portions instead, that’s not a problem,
    I think; and mmap generally provides a significant performance
    improvement in situations like this.
  2. Do you have requirements for a new resource file?
    I probably won’t use this (none of my current projects require
    anything like it), but were I to need it, portability would be a
    primary concern. I live on a distressing variety of CPU architectures
    and operating systems, and too often is a nice-looking library ruined
    because it won’t work on all of them. (ONE of my computers is x86.)
    -:sigma.SB

P.S. Cryptographic hashes for the resource file, complete with
provisions for a PKI, would be pretty sweet.On 2/19/08, Andre de Leiradella wrote:

Hi All,

I’m designing a resource file format to hold game data (scripts, images,
musics, sound effects etc.) and would like some input from the community.

First, why another resource file? Because I want two things other
resource files don’t offer (at least that I’m aware of):

A. Fast access through mmap (the file can be mmapped to the process
address space and used with very little setup)

Well, mmap is a bit faster than reading/writing a file. I’m rather fond of
it myself. Not to mention that having the data in a binary format is very
nice. Do remember that addresses may not stay meaningful when mmaped into
memory. The worst case is when they do stay meaningful on your machine, and
on some other machines, but are not meaningful on every machine. This can
lead you to putting addresses into the file and then having to redo
everything so that you store offsets.

B. Spare space where I could insert the digital signature of the file.

This space must be filled with zeroes while computing the hash for the
signature.

That isn’t really a big deal, you can always create a zeroed out item in any
resource file that can be used to store the signature after it is computed.
Even better, you can just put the signature in a wrapper so that you don’t
actually change the signature of the file by adding the signature to the
file. A simple header consisting of the length of the signature followed by
the signature can be prepended to the file to sign it. Or, even simpler, you
can just append the signature to the file and never care about the format.

The preliminary version is working quite well. Data is accessed by name

(char *), with a reader that supports basic data input operations being
returned. Since the file is mmaped, the location of a given chunk of
data within the file is quickly found via a bsearch call. Data can be
stored without compression (good for mp3, ogg, jpeg…) or bzipped. The
resource file can even be part of another file, the most common use
being to append the resource file to an executable.

There is also support for transparently using a directory instead of a
real resource file just like zziplib, and reading entries via SDL_RWops.

So my questions are:

  1. Is there any thing bad about mmapping a resource file? The file size
    can easily be greater than 4 GiB.

Well, by allowing your file to be so large you restrict yourself to 64 bit
architectures. In general 32 bit machines can not address more than 4 gigs
of process space. In reality they can rarely address more than 2 gigs of
process space. If you don’t care about 32 bit machines then there is really
nothing wrong with what you are doing. Just remember that your addresses and
offsets need to be 64 bits.

  1. Are there other resource file formats that provide A and B above?

Well, your “A” requirement is a requirement of the implementation of the
access library and has absolutely nothing to do with the file format. So
basically all and/or no file format gives you that. You could take any
existing file format and create a library for accessing it that uses mmap.
If you look deep down in the file code for your favorite compiler you might
find that ti already uses mmap to implement read and write in which case all
libraries have this feature.

And your “B” requirement can be met by simply appending the signature to an
existing file or by writing it to a different file, so pretty much all other
file formats meet this requirement.

  1. Are there other important characteristics for a resource file format

I’m missing?

That it is available right now and you don’t have to write it from scratch?

  1. Do you have requirements for a new resource file?

No.

The exposed interface so far is:

Hmmm, you aren’t being consistent about the use of uint32_t and int. The
interface as written may blow up if you try to do arithmetic on offsets or
lengths because you are mixing unsigned and signed integers for lengths and
offsets. It depends on whether the “int” variables are 32 bits or 64 bits.

Above you mentioned that the file size can easily be bigger than 4 gigs but
you seem to only have 32 bits of offset in this format which restricts you
to <= 4 gigs. You need to changes this interface to use 64 bit offsets.

Bob Pendleton

--------------------8<--------------------On 2/19/08, Andre de Leiradella wrote:

/* Seek modes. */
#define AF_SEEK_SET 0
#define AF_SEEK_CUR 1
#define AF_SEEK_END 2

/* The opaque file type. */
struct af_file_t;
typedef struct af_file_t af_file_t;

/* The opaque reader type. */
struct af_reader_t;
typedef struct af_reader_t af_reader_t;

/* The callback type used to return the address of needed functions,
i.e. BZ2_bzDecompress and friends. This allow the application
to statically link support libraries as bzip2 or dinamically load them
from anywhere in the file system. */
typedef void *(*af_dynamic_link_t)(const char *library_name, const char
*function_name);

/* Information filled by af_entry_get_info and af_iterate. */
typedef struct {
const char *name;
uint8_t compression[4];
uint32_t compressed_size;
uint32_t uncompressed_size;
const void *data;
} af_entry_info_t;

/* Sets the callback for the load of support libraries. */
void af_set_dynamic_link_callback(af_dynamic_link_t call_back);

/* Opens the resource file, possibly with an offset from the
beginning of the named file. Size defaults to the number
of bytes left in the file starting at offset, but can be specified
if the resource file is embedded in the middle of some
other file. */
af_file_t *af_open(const char *name, uint32_t offset, uint32_t size);

/* Closes the resource file, freeing all allocated resources. */
void af_close(af_file_t *file);

/* Calls the process callback for all entries in the resource file. */
int af_iterate(af_file_t *file, int (*process)(af_entry_info_t *info,
void *udata), void *udata);

/* Opens an entry for reading. */
af_reader_t *af_entry_open(af_file_t *file, const char *name);

/* Reads size bytes into buffer. */
int af_entry_read(af_reader_t *reader, void *buffer, int size);

/* Changes the location of the next read operation. */
int af_entry_seek(af_reader_t *reader, int offset, int whence);

/* Closes the reader, freeing all allocated resources. */
int af_entry_close(af_reader_t *reader);

/* Returns the current position within the entry. */
int af_entry_tell(af_reader_t *reader);

/* Returns information of the entry associated with the reader. */
int af_entry_get_info(af_file_t *file, af_entry_info_t *info, const char
*name);
--------------------8<--------------------

Cheers,

Andre


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

  1. Is there any thing bad about mmapping a resource file? The file size
    can easily be greater than 4 GiB.

Pages that are mmapped from the file can’t be used for other memory.
Therefore, on most 32-bit platforms, an app using a resource file
that’s near 2GiB would have serious memory starvation problems. If you
selectively mmap only relevant portions instead, that’s not a problem,
I think; and mmap generally provides a significant performance
improvement in situations like this.

But only pages actually accessed by the application are actually brought
to RAM by the OS, right? Moreover, the OS will swap out pages to disk if
the application needs physical RAM, won’t it?

  1. Do you have requirements for a new resource file?

I probably won’t use this (none of my current projects require
anything like it), but were I to need it, portability would be a
primary concern. I live on a distressing variety of CPU architectures
and operating systems, and too often is a nice-looking library ruined
because it won’t work on all of them. (ONE of my computers is x86.)
-:sigma.SB

I can test only on Linux (x86 and PPC) and Windows, and I’ll make sure
it works on those platforms.

Thanks for the input, I’ll do more research regarding mmapping files.

Cheers,

Andre

Well, mmap is a bit faster than reading/writing a file. I’m rather fond of
it myself. Not to mention that having the data in a binary format is very
nice. Do remember that addresses may not stay meaningful when mmaped into
memory. The worst case is when they do stay meaningful on your machine, and
on some other machines, but are not meaningful on every machine. This can
lead you to putting addresses into the file and then having to redo
everything so that you store offsets.

I like it a lot too. Implementing decompressing readers with mmapped
files is so much easier than reading chunks of data when the input
buffer of the decompressor is empty.

I’m not storing pointers in the file even because the file can be mapped
to different addresses each time on the same machine. All I have are
offsets and they’re all at the file header, and a copy-on-write mmap
helps when converting them to platform-dependent pointers. For each
offset I have 16 bytes of extra space in the file to accomodate the
corresponding pointer.

B. Spare space where I could insert the digital signature of the file.

This space must be filled with zeroes while computing the hash for the
signature.

That isn’t really a big deal, you can always create a zeroed out item in any
resource file that can be used to store the signature after it is computed.
Even better, you can just put the signature in a wrapper so that you don’t
actually change the signature of the file by adding the signature to the
file. A simple header consisting of the length of the signature followed by
the signature can be prepended to the file to sign it. Or, even simpler, you
can just append the signature to the file and never care about the format.

Yeah, you’re right.

The preliminary version is working quite well. Data is accessed by name

(char *), with a reader that supports basic data input operations being
returned. Since the file is mmaped, the location of a given chunk of
data within the file is quickly found via a bsearch call. Data can be
stored without compression (good for mp3, ogg, jpeg…) or bzipped. The
resource file can even be part of another file, the most common use
being to append the resource file to an executable.

There is also support for transparently using a directory instead of a
real resource file just like zziplib, and reading entries via SDL_RWops.

So my questions are:

  1. Is there any thing bad about mmapping a resource file? The file size
    can easily be greater than 4 GiB.

Well, by allowing your file to be so large you restrict yourself to 64 bit
architectures. In general 32 bit machines can not address more than 4 gigs
of process space. In reality they can rarely address more than 2 gigs of
process space. If you don’t care about 32 bit machines then there is really
nothing wrong with what you are doing. Just remember that your addresses and
offsets need to be 64 bits.

Sure, they are. Do you know of any gotchas when mmapping files bigger
than 2 GiB into an application’s address space?

  1. Are there other resource file formats that provide A and B above?

Well, your “A” requirement is a requirement of the implementation of the
access library and has absolutely nothing to do with the file format. So
basically all and/or no file format gives you that. You could take any
existing file format and create a library for accessing it that uses mmap.
If you look deep down in the file code for your favorite compiler you might
find that ti already uses mmap to implement read and write in which case all
libraries have this feature.

I partially agree. The file format is being designed so that all offsets
are stored at the header so that I don’t have to walk through all the
file with a pointer to find an entry, which would cause the OS to bring
many pages to RAM. I didn’t really check if other formats would behave
the same. The TAR is one I know which doesn’t. Besides, the format I’m
designing will allow a simple bsearch call to find an entry instead of
comparing all entries’ names. I know the speed gain is small if one is
going to read a large entry, but I like to think that many small speed
gains result in a overall speed gain.

I’m aware that some compilers might use mmap behind the scenes for
regular file IO, but as I said before decompressing things with mmapped
files is a breeze.

And your “B” requirement can be met by simply appending the signature to an
existing file or by writing it to a different file, so pretty much all other
file formats meet this requirement.

  1. Are there other important characteristics for a resource file format

I’m missing?

That it is available right now and you don’t have to write it from scratch?

Yeah :slight_smile: I’m trying to use readly available code for everything in my
projects. zziplib already does almost everything I need, and I could
implement the signature like you suggested. But although it can be used
to open ZIP files that are part of larger files, the documentation does
not say how this can be done, and I’m not really into source code study
of someone else’s code, I prefer to study and to do this myself.

  1. Do you have requirements for a new resource file?

No.

The exposed interface so far is:

Hmmm, you aren’t being consistent about the use of uint32_t and int. The
interface as written may blow up if you try to do arithmetic on offsets or
lengths because you are mixing unsigned and signed integers for lengths and
offsets. It depends on whether the “int” variables are 32 bits or 64 bits.

Entries are 4 GiB maximum each. But you are right, af_entry_read,
af_entry_seek and af_entry_tell should take and return uint32_t values
too. I was just closely following the SDL_RWops interface; my file
format will be used in a SDL application later on and I wanted them have
the same interface.

Above you mentioned that the file size can easily be bigger than 4 gigs but
you seem to only have 32 bits of offset in this format which restricts you
to <= 4 gigs. You need to changes this interface to use 64 bit offsets.

Bob Pendleton

Thanks for your input.

Cheers,

Andre

[…]

But only pages actually accessed by the application are actually
brought to RAM by the OS, right? Moreover, the OS will swap out
pages to disk if the application needs physical RAM, won’t it?
[…]

Yes, but physical memory isn’t the problem. The application’s address
space is. All you have is 4 GiB of address space, some of which is
already reserved by the system (usually 1-2 GiB, depending on OS),
and the remaining 2-3 GiB is all you have left for addressing memory,
VRAM, memory mapped files and whatnot.

Thus, in a 32 bit environment, you need to “bank switch” large files;
ie map only part of the file at a time.

//David Olofson - Programmer, Composer, Open Source Advocate

.------- http://olofson.net - Games, SDL examples -------.
| http://zeespace.net - 2.5D rendering engine |
| http://audiality.org - Music/audio engine |
| http://eel.olofson.net - Real time scripting |
’-- http://www.reologica.se - Rheology instrumentation --'On Thursday 21 February 2008, Andre de Leiradella wrote:

[…]

But only pages actually accessed by the application are actually
brought to RAM by the OS, right? Moreover, the OS will swap out
pages to disk if the application needs physical RAM, won’t it?

[…]

Yes, but physical memory isn’t the problem. The application’s address
space is. All you have is 4 GiB of address space, some of which is
already reserved by the system (usually 1-2 GiB, depending on OS),
and the remaining 2-3 GiB is all you have left for addressing memory,
VRAM, memory mapped files and whatnot.

Thus, in a 32 bit environment, you need to “bank switch” large files;
ie map only part of the file at a time.

So it means I have to mmap only pieces of the file when af_entry_open is
called…

Thanks for the info.

Cheers,

Andre>

//David Olofson - Programmer, Composer, Open Source Advocate

.------- http://olofson.net - Games, SDL examples -------.
| http://zeespace.net - 2.5D rendering engine |
| http://audiality.org - Music/audio engine |
| http://eel.olofson.net - Real time scripting |
’-- http://www.reologica.se - Rheology instrumentation --’

This may be a bit of a n00b question, but when I looked up “mmapping”, all the information I found seemed to indicate that it’s just a UNIX/Linux thing. Will this be cross-platform?> ----- Original Message -----

From: leiradella@bigfoot.com (Andre de Leiradella)
To: sdl at lists.libsdl.org
Sent: Thursday, February 21, 2008 9:13:46 AM
Subject: Re: [SDL] [OT] Resource file

[…]

But
only
pages
actually
accessed
by
the
application
are
actually

brought
to
RAM
by
the
OS,
right?
Moreover,
the
OS
will
swap
out

pages
to
disk
if
the
application
needs
physical
RAM,
won’t
it?

[…]

Yes,
but
physical
memory
isn’t
the
problem.
The
application’s
address

space
is.
All
you
have
is
4
GiB
of
address
space,
some
of
which
is

already
reserved
by
the
system
(usually
1-2
GiB,
depending
on
OS),

and
the
remaining
2-3
GiB
is
all
you
have
left
for
addressing
memory,

VRAM,
memory
mapped
files
and
whatnot.

Thus,
in
a
32
bit
environment,
you
need
to
"bank
switch"
large
files;

ie
map
only
part
of
the
file
at
a
time.

So
it
means
I
have
to
mmap
only
pieces
of
the
file
when
af_entry_open
is
called…

Thanks
for
the
info.

Cheers,

Andre

//David
Olofson

Programmer,
Composer,
Open
Source
Advocate

.-------
http://olofson.net

Games,
SDL
examples
-------.

|

http://zeespace.net

2.5D
rendering
engine

|

|

http://audiality.org

Music/audio
engine

|

|

http://eel.olofson.net

Real
time
scripting

|

'–
http://www.reologica.se

Rheology
instrumentation
–’


SDL
mailing
list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

That’s a term based on the POSIX mmap() call, and AFAIK, Win32 has a
completely different native API. Try “memory mapped I/O” or something
like that.

//David Olofson - Programmer, Composer, Open Source Advocate

.------- http://olofson.net - Games, SDL examples -------.
| http://zeespace.net - 2.5D rendering engine |
| http://audiality.org - Music/audio engine |
| http://eel.olofson.net - Real time scripting |
’-- http://www.reologica.se - Rheology instrumentation --'On Thursday 21 February 2008, Mason Wheeler wrote:

This may be a bit of a n00b question, but when I looked
up “mmapping”, all the information I found seemed to indicate that
it’s just a UNIX/Linux thing. Will this be cross-platform?

This may be a bit of a n00b question, but when I looked
up “mmapping”, all the information I found seemed to indicate that
it’s just a UNIX/Linux thing. Will this be cross-platform?

That’s a term based on the POSIX mmap() call, and AFAIK, Win32 has a
completely different native API. Try “memory mapped I/O” or something
like that.

A simple code (copy-on-write) to mmap/unmap a file within Windows is:

--------------------8<--------------------
typedef struct {
HANDLE hFile;
HANDLE hMap;
void *contents;
} mmap_t;

int mmap(mmap_t *mm, const char *name, uint32_t offset, uint32_t size) {
mm->hFile = CreateFile(
name, // pointer to name of the file
GENERIC_READ | GENERIC_WRITE, // access (read-write) mode
0, // share mode
NULL, // pointer to security attributes
OPEN_EXISTING, // how to create
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_NO_BUFFERING |
FILE_FLAG_RANDOM_ACCESS, // file attributes
NULL // handle to file with attributes to copy
);
if (mm->hFile == INVALID_HANDLE_VALUE)
return -1;
mm->hMap = CreateFileMapping(
mm->hFile, // handle to file to map
NULL, // optional security attributes
PAGE_WRITECOPY | SEC_COMMIT, // protection for mapping object
0, // high-order 32 bits of object size
0, // low-order 32 bits of object size
NULL // name of file-mapping object
);
if (mm->hMap == NULL) {
CloseHandle(hFile);
return -1;
}
mm->contents = MapViewOfFile(
mm->hMap, // file-mapping object to map into address space
FILE_MAP_COPY, // access mode
0, // high-order 32 bits of file offset
offset, // low-order 32 bits of file offset
size // number of bytes to map
);
if (mm->contents == NULL) {
CloseHandle(mm->hMap);
CloseHandle(mm->hFile);
return -1;
}
return 0;
}

void munmap(mmap_t *mm) {
UnmapViewOfFile(mm->contents);
CloseHandle(mm->hMap);
CloseHandle(mm->hFile);
}
--------------------8<--------------------

The contents of the file can be accessed through the contents member of
the mmap_t structure.

Cheers,

Andre