If this is too off topic, please let me know and I'll sign up on a postgres list to get help. But this is related to my use of postgis and If anyone knows this stuff, it's you guys.
I have an example query that I expect to be much faster, but my main concern is we are about to do some visualization of historical congestion data which will require queries across much larger data sets - like 150 million records a day. We are about to test using partitions but the number per table will still be much larger than what I am dealing with now.
So here is a query I would think would be much faster than 43 seconds for two tables, one with about 97k rows, and the other 3.2 million.
explain select count(l.*)
from links l, source_link ld where l.link_id = ld.link_id;
'Aggregate (cost=174731.72..174731.73 rows=1 width=32)'
' -> Hash Join (cost=13024.27..166784.14 rows=3179029 width=32)'
' Hash Cond: (ld.link_id = l.link_id)'
' -> Seq Scan on source_link ld (cost=0.00..58282.29 rows=3179029 width=10)'
' -> Hash (cost=10963.12..10963.12 rows=96812 width=42)'
' -> Seq Scan on links l (cost=0.00..10963.12 rows=96812 width=42)'
Each table has an index on link_id, defined like this
CREATE INDEX links_link_id_idx
CREATE INDEX source_link_link_id_idx
Shouldn't this index prevent these sequential scans, or am I misreading this? Should this really take 43 seconds?
thanks for any advice,
Can you try this returning some sort of value (like the keys) instead of a
count(*)? count(*) can be pretty slow in Postgres, sometimes
(I think) forcinga seq scan.
I am not particularly confident this will fix your problem,
but it is worth a shot.
I would also experiment with DISTINCT and LIMIT, after
making sure ANALYZE has been run appropriately.
I'm no expert at this, but my understanding (which is limited) was that you are asking for the whole table, so indexing doesn't really get used (my understanding is that indexing helps to find the page for a subset of data more quickly than scanning through the whole lot).
Also, you might be able to get some speed up by using a different join type (outer join and not null where clause)?
The issue here is the count(*) which forces a full table scan in
postgresql as fork mentioned. You need to look at a real query, unless
you are really doing a count(*).
does this just apply to count(*), or is count(id) just as bad? I was originally a MySQL user and count(*) could be very efficient there.
originally a MySQL user and count(*)
My understanding is that Postgres does not keep record length for any of its
tables internally, so there is no way to get a count without seq scanning. The
trade off is that inserts deletes are faster but a very common query is much
slower. I don't know if the planner could benefit in any way from the count
being available, though.
The lists say to use a trigger on inserts and deletes to update a metadata table
if you really do need to know how many elements are in it exactly, but that is a
far less frequent need than you may think (for example an EXISTS can get you an
answer to "are there any records" more quickly than a count(*)). I think you
can do a quick and rough estimate by doing some math with the table size on
disk, but I never have.
It is unfortunate that the first (rather lame) "benchmark" anyone tries to do
with a new database is run "select count(*) from x" -- I am sure lots of people
have been turned off from PG because this particular query is slow compared to
(MySQL always wins in the more hollywood competitions against PG, but fails in
the long run, IMHO)
I think PostgreSQL 9.2 will have index-only scans that should improve
the performance of SELECT count(*) queries.
Granted, this is in trunk so it won't help for any production databases.
Center for Vectorborne Diseases
I haven't noticed count(*) causing additional query time, but I was just using it to rule out overhead of pulling all those rows back.
All of your feedback has been most helpful. Yes this query is contrived but I *thought* it was representative of a worst case scenario that might be similar to future data sets and it's likely not. Due to your prompting I did figure out why this particular problem i was seeing was slow and it was really just user stupidity.
The real use case is using geoserver to visualize this data. In the worst case scenario someone zooms out nice and far and effectively gets all the links in that join. Now geoserver seems to be able to get all the links with something like "get * from links" and generate the tiles at this zoom level fairly quickly so I figured all the other overhead being equal, the join (sans the count(*) ) would be the worst case and although it is, I think you are right that it simply has to join across all those ids and there is no way to improve that if selecting all.
I'll show a more reasonable case of what usually happens, but I'll explain the actual problem. I though adding a distinct on (link_id) would speed up the join since the source_link table has many rows for each link_id but it turns out this was what was making it slow (and I didn't realize that's what the real query was doing). I also though for some reason an index would make distinct run quickly since the index is effectively all unique values right?
So here is what is typically going on for the query and after changing the "select distinct on (link_id) l.*" to "select l.*" it performs reasonably with an additional level of filtering with a bounding box.
explain analyse SELECT "link_id",encode(ST_AsBinary(ST_Force_2D("the_geom")),'base64') as "the_geom"
FROM (select l.* from links l, source_link ld where l.link_id = ld.link_id) as "vtable"
WHERE "the_geom" && ST_GeomFromText('POLYGON ((-74.79526213727112 40.11142841966784, -74.79526213727112 40.21270521911115, -74.66273955428638 40.21270521911115, -74.66273955428638 40.11142841966784, -74.79526213727112 40.11142841966784))', 4326)
'Hash Join (cost=3722.92..74955.80 rows=58813 width=378) (actual time=412.729..5946.610 rows=44469 loops=1)'
' Hash Cond: (ld.link_id = l.link_id)'
' -> Seq Scan on source_link ld (cost=0.00..58282.29 rows=3179029 width=10) (actual time=0.026..2823.685 rows=3179029 loops=1)'
' -> Hash (cost=3706.60..3706.60 rows=1306 width=378) (actual time=7.805..7.805 rows=1285 loops=1)'
' -> Bitmap Heap Scan on links l (cost=74.42..3706.60 rows=1306 width=378) (actual time=0.944..6.093 rows=1285 loops=1)'
' Recheck Cond: (the_geom && '0103000020E61000000100000005000000E6D42993E5B252C0BF285549430E4440E6D42993E5B252C0ADE2B3EC391B44403DDB29536AAA52C0ADE2B3EC391B44403DDB29536AAA52C0BF285549430E4440E6D42993E5B252C0BF285549430E4440'::geometry)'
' -> Bitmap Index Scan on links_geom_idx (cost=0.00..74.09 rows=1306 width=0) (actual time=0.883..0.883 rows=1285 loops=1)'
' Index Cond: (the_geom && '0103000020E61000000100000005000000E6D42993E5B252C0BF285549430E4440E6D42993E5B252C0ADE2B3EC391B44403DDB29536AAA52C0ADE2B3EC391B44403DDB29536AAA52C0BF285549430E4440E6D42993E5B252C0BF285549430E4440'::geometry)'
'Total runtime: 5983.473 ms'
So what prompted my initial concern is solved. This slowness caught me at a time where in the back of my mind i am contemplating our next data challenge which is why I looked into it. If I run into problems with our other data set I'll be sure to cover all my options and give realistic queries if I ask for help.
Thanks again for all the advice.
True for a single table, in the given example, there was a join.
A possible outcome could have been a seq scan on the smaller table and a nested loop index lookup on the bigger table. The optimizer must have thought this was more expensive than the full scan with hash join.
postgis-users mailing list
*thought* it was
sets and it's likely not.
I think we are all glad to help. The count(*) assumption is reasonable enough
to be made by LOTS of people.
One thing -- while we hope that you ask lots of questions on this list, would
you not "top posting", and trimming out non-germane text? Threading and
trimming make a conversation MUCH easier to follow.
Also -- if you are developing an app that will be rolled out later or that is
somewhat academic, I would consider Postgres on trunk if it has features you
really need. It is easy to build, just make sure you back up your data...
Sadly it's for immediate production use and I'm forced to use windows which limits my version choices a bit given my lack of skill under windows to build postgis :(
Hehe -- none necessary -- welcome to the light side.
limits my version choices a bit given my lack of skill under windows to build
Yeah, so much for index only queries ;)
You ask interesting questions -- don't be a stranger.
Best way to evaluate this is with explain. But yes both have the same
full table scan.
MySQL maintains row count as one of its tables stats so it just pulls
that from the stats metadata. If you just want an approximate row count,
there are some status based on the last analyze. If you use pgadmin3 you
can see in the table properties:
Rows (estimated) 1957620
This is one of the largest hurdles for MySQL users moving over to
postgresql because it is so obvious and in your face. But the things I
can do today with postgresql and with postgis, I could never do in MySQL.
which limits my version choices a bit given my lack of skill under windows
to build postgis :(
You know we do have pretty much latest builds of PostGIS (even trunk on
PostGIS website for windows). http://www.postgis.org/download/windows/experimental.php
and as far as PostgreSQL -- they have released windows binaries for even
PostgreSQL 9.1 RC1 http://www.enterprisedb.com/products-services-training/pgdev[..]
Can't get too much more current than that (all without having to compile
anything unless you are talking about the 64-bit versions).
Sorry, I should have been more clear. I also use pgRouting and afaik I am tied to 8.4 on windows without building this stuff myself. If I am wrong, please let me know!
I am revisiting this again. How much of a performance difference should one expect to see between the 32 bit version and the 64 bit version of postgres when using PostGIS for typical gis queries like filtering by bounding box, locating nearest points etc? Depending on how I break up my data, I'll have anywhere from 200k to 260M records per table depending on how I partition it on a machine with 32G of ram.
I am trying to make a case to use linux for a specific project but without being able to say there are significant gains (in performance) I'm just fighting an uphill battle. At best right now I can use the latest postgis/postgresql under windows but only 32 bit.
Honestly I have little idea. In theory you should fair better with Linux,
but I think it depends on the kinds of processes you run how much better it
is and how much ram you have. One advantage that Linux has over windows
(which is an advantage when running 32-bit Linux even, but more of an
advantage I think when you are running 64-bit Linux) is that you can
allocate more shared memory so processes that benefit on a lot of shared
memory should benefit (basically processes where you are using more or less
the same data in different ways can live in RAM. Windows is limited not
just for the 32-bit but also the 64-bit and I think this is just a
fundamental flaw in PostgreSQL on windows.
So generally speaking on windows I can't boost my shared memory more than
say I think about 700 - 1GB without running into crashing issues.
As far as work mem and so forth is concerned, on 32-bit windows you can have
as much as 4GB per postgres process (and more with 64-bit windows) , but of
course if you are running
Linux and 64-bit at that you can go up way more.
I'm sure Linux folks will shoot me for saying this and chew me out, but I
haven't really noticed much of a difference running my processes on 64-bit
Linux vs. 32-bit window,
but then again my processes are probably different from other peoples and I
don't have a 32G ram Linux to take advantage of the massive more shared
memory I can allocate.
So on the low end (say 4-8GB ram range I suspect there isn't much of a
difference, but when you get higher to the 32GB/64GB range, you would
probably do a lot better with Linux.
32 bit Windows is actually limited to 2 GB processes, or 3 GB if you start
Windows with a special */3GB* option.