In regards to WAL (write-ahead-logging) support being added to SQLite for
version 3.7.0, I have a few comments, questions, suggestions, etc. Most of them
refer to parts of http://www.sqlite.org/draft/wal.html .
1. Congratulations on this move, which should greatly improve SQLite's utility
relative to 3.6.x. I greatly look forward to the multi-versioning capabilities
that this would provide.
2. Quoth the raven:
"The wal-index greatly improves the performance of readers, but the use of
shared memory means that all readers must exist on the same machine. This is
why the write-ahead log implementation will not work on a network filesystem."
I don't understand the need to sacrifice WAL on a network filesystem. My naive
interpretation of the situation is that WAL itself would work on a network
filesystem, but just that processes using it that way would have slower
performance as they can't use the wal-index.
I would think it better that WAL *is* supported on the network but with the
caveat that its performance would be slower than use by a process on the same
machine. Then even network users can still get the multi-versioning/etc
benefits but just slower read performance. And concurrent processes on the same
machine as the database would still use the wal-index at the same time.
Is there more to this that I'm not seeing, or could SQLite's WAL be altered as I
suggest above so that network users are supported but just more slowly?
3. Quoth the raven:
"A checkpoint operation takes all the content from the WAL file and
transfers it back into the original database file. As long as all readers are
looking at the entire WAL file (in other words, as long as no reader is using an
end mark that is earlier than the actual end of the WAL) then it will be OK for
the checkpoint to run in parallel with the readers.  Thus, a checkpoint will
never block a reader and readers will not block a checkpoint as long as all
readers are looking at the entire WAL. But, older readers that only see a prefix
of the WAL will block a checkpoint. "
I don't understand the need for checkpointing to be an all-or-none affair. If
not all readers are looking at the entire WAL file, then why can't just a subset
of the WAL be checkpointed, that subset corresponding to what the reader sees
that started the earliest so that then the WAL only has the corresponding subset
from the earliest reader to the latest?
-- Darren Duncan
Correct. All clients that have the database open must be on the same SMP.
The easiest way to enforce that is to just say "no network files".
Note that the current implementation uses mmap() on an ordinary file to get
shared memory. We may well change that to use shm_open() before the
release. Does anybody know if mmap() works for network files?
We've been working on an incremental checkpointing mechanism for about a
week. With incremental checkpointing, the checkpoint can be run more or
less continuously in a background thread or process. Writers never block
readers, readers never block writers, and checkpointers never block
anybody. There can still only be one writer at a time, though, so writers
still block other writers. And during recovery from a crash, everybody is
blocked, but that's an unusual case. You can arrange for a fiendish
making any progress and hence cause the WAL to grow without bound. But we
do not anticipate that being a problem in the common case.
If you set synchronous=NORMAL, then the checkpoint thread/process is the
only thread/process that ever calls fsync() so you can do queries and
updates in a GUI thread with much less risk of freezing the system due to a
slow disk. You pay for this by giving up durability following a
Please note that SQLite databases with WAL will continue to feature fully
automatic recovery from power-loss or hard-reset; applications do not have
to do anything special to recover from a crash. Also, any client (reader,
writer, or checkpointer) can take a SIGKILL at any instant and the system
will automatically recover and continue operating.
Code for the above will appear in the source tree soon, I hope - initially
in a branch and then on the trunk as it stabilizes.
D. Richard Hipp
Richard, thank you for responding to my questions about SQLite's WAL design.
I look forward to this, which should be a considerable improvement that
shouldn't really increase complexity. I'm not going to advocate trying to
support multiple concurrent writers, which is probably where most of any
complexity would lie, or not any more so than SQLite currently does.
Now, another item, which I forgot to state earlier ...
4. Quoth the raven:
"3. Transactions that involve changes against multiple ATTACHed databases
are atomic for each individual database, but are not atomic across all databases
as a set."
I greatly hope that this limitation could go away. I consider that SQLite's
ability to make multiple databases subject to a common transaction is very
powerful, and I would even argue, essential.
I hope that some variation of the method used now with rollback journals can be
applied to databases with WALs, so that the latter can take part in
I don't see anything in the WAL design that this couldn't be done without much
Thank you in advance.
-- Darren Duncan
It does. However there is no way to ensure that everyone seeing the file
will have a consistent view of it. This is especially the case when they
are on different machines.
However there can even be consistency issues when all client processes are
on the same machine. This is because the client operating system should be
segmenting its cache by user - ie if my user and your user map the same
network file they should not share the client cache - as that is wonderful
way of doing cache poisoning attacks.
Good to know. Would moving to shm_open() help? I haven't do that yet
because (I read) shm_open() requires -lrt when you link and I like keeping
the number of dependencies to a minimum.
D. Richard Hipp
The problem is that unless the networked filesystem provides exclusive locks
that can be retained by one client only then there will always be race
conditions in that client's view of the underlying file and the contents on
the server. (And if the client's filesystem cache is segmented by user then
only one segment will be able to hold the lock.)
The API (mmap/shm_open) won't make any difference as this is all a property
of the user space API the client application is using, the file system/vmm
implementation in the client kernel, the network file system implementation
in the client and what requests/semantics are available in the network
protocol. One weak link in the chain and race conditions are possible.
Windows and CIFS/SMB does have the exclusive locking while NFS does not. If
the backend is Samba then Samba can't prevent concurrent access from Unix
apps on the server and the CIFS clients.
Basically if the shared memory file is networked then there will be hard to
diagnose corruption problems.
Note that shm_open (on Linux anyway) makes a file in /dev/shm (typically
tmpfs) which means it is always on the local machine.
I'd just go with saying that there are absolutely zero circumstances under
which SQLite is supported using a networked filesystem, that there is no way
to reliably diagnose when it is being misused that way, and that there will
be corruption that will be hard to detect as it happens, and can't be fixed
For people who must use a network, the solution is a networked database as
they can reliably ensure data integrity.
* Richard Hipp:
I think it's difficult to clean up the shared memory segments when a
database which had opened the database crashes. And access control
would be very hard to get right, too.
The semantics of ATTACH imply Partition. In the new WAL design, readers
never block, which is the same as Accessible. Hence, we must forgo
cross-database atomic commits (what the CAP theorem calls "Consistent").
D. Richard Hipp
Thanks for the url; that was an interesting read.
I believe that in SQLite's case, unless perhaps when some of the attached
databases are on different physical machines from each other or the main one,
that CAP (you can have at most 2 of 3) isn't applicable, or that there are some
common situations where it isn't applicable. (From my reading, CAP mainly
speaks to the situations where the database is split across multiple physical
Primarily, I speak to the simplified situation where all SQLite databases that
are open or attached by a SQLite process are all on the same machine, and that
all simultaneous SQLite processes using any of the same databases at once are on
the same machine.
In this situation, the semantics of ATTACH either do *not* imply Partition,
because everything is on the same machine, or it might be reason to tweak the
semantics of ATTACH such that they would not imply Partition.
I note from the SQLite documentation that one can not detach a database in the
middle of an active transaction; this is a good thing and would contribute
towards ATTACH not implying Partition. I don't know if ATTACH is similarly
restricted or not, as the documentation doesn't say, though arguably this is
less important; for consistency I might restrict ATTACH to not be possible in a
transaction either, unless there is a reason to do otherwise.
In the common scenario that I mention, with everything on one machine, would it
not be unreasonable to support transactions atomic across all attached databases
when WAL is in use? If it is reasonable, then perhaps SQLite could have a
partial guarantee, such that transactions when WAL is in use are only guaranteed
atomic across the subset of attached databases that are on the same machine as
I think that this matter may correspond somewhat to the limits of the wal-cache,
a same-machine limit, though I wouldn't go so far as assume they are connected.
On a tangential matter, where WAL isn't necessarily in use, the documentation
for ATTACH seems to say that a cross-database transaction wasn't even possible
before if the main database was ":memory:". If so, and while I can understand
why this might have been the case, such that the extra super-journal file that
marks the collection of journals/databases that are linked, I wonder if this can
be changed somehow. For example, could SQLite be updated to be able to create
this extra file even if there is no on-disk main database directly associated
with it, when there are on-disk databases attached? It would be nice for the
availability of cross-database transactions to be orthogonal to which database
is ":memory:", and only a multiplicity of on-disk databases need synchronizing.
Since journal or WAL files are based on the file names of the on-disk databases,
could it not be possible that when the main database is ":memory:", that the
user can specify a file name to use some other way?
For example, the syntax for creating/opening a database could be generalized
such that one would always give a file name, but that one would then provide an
extra parameter to say whether the database is temporary or not? A TRUE value
for this parameter would cause the database to just be in memory in general, and
FALSE means on disk. Then a possible filename is always provided. That
structure might also have other benefits.
-- Darren Duncan
Partition means that one part of the database cannot communicate with
another part. In the usual use of the CAP theorem, this means that the
database is spread across multiple machines and the network link is broken
(or in practice, simply congested and sluggish). In the context of ATTACH
it means that the two databases A.db and B.db cannot communicate with each
other. Of course, they can communicate with each other since they are on
the same machine, but they cannot communicate with each other in the context
set up by the current implementation of ATTACH. ATTACH treats each attached
database as a separate and independent entity. The key point for atomicity
of commits is that each attached database has its own write-ahead log file
and none of those separate write-ahead logs (WALs) know where to find any
other. And so the WALs are effectively partitioned from one other, though
by software design instead of by physical hardware limitations.
So it seems we could fix this situation by somehow storing in each WAL
information about the location of all the other WALs that participate in
each transaction, thus removing the partition between the components of the
D. Richard Hipp
Yes, exactly. I thought/hoped the solution could be as simple as that.
I believe that SQLite is at its best when a multiplicity of databases used in
the same connection, as ATTACH provides, are treated as a single database with
respect to transactions. An application should be able to use ATTACH when they
want these semantics, and the application instead should use distinct
connections when they want multiple database accesses that expressly are not
subject to common transactions.
Generally speaking, I believe that the role of individual SQLite database files
is to provide the maximum context for definition consistency, such that every
database taken on its own includes all of the context to fully interpret it, and
keep it self-consistent. Meaning for example that a table or view or constraint
definition is entirely within the database file, and there are no cross-file
constraints or foreign keys, say.
Both of these principles are orthogonal to whether WAL or journals are used,
Therefore, with my WAL comments, I'm not trying to coerce SQLite into being a
clone of the more complex multi-file or client-server databases. One might
think that from the scope I think that transactions should always have. But my
thought on the boundaries of constraints or other definitions should clearly
counter this thought.
Anyway, keep up the good work; I and others greatly appreciate it.
Thank you. -- Darren Duncan