[lug] Speed up osm2pgsql

Bear Giles bgiles at coyotesong.com
Fri Feb 9 07:59:25 MST 2018


Is it possible to load the data in a much more piecemeal fashion? It sounds
like you're loading it in one shot. I've worked with similar data from
commercial vendors and they're always broken down into much smaller files
that can be loaded individually.

Check if there are options for things like disabling indexing. I would be
surprised if they aren't already doing this but you want to turn off all
indexing when loading the data (the sole exception is the btree index for
the primary key). You only turn the indexing on after all of the data has
been loaded. In your case there's so much data that the indexing will
require massive amounts of memory or swap but it might be cost-effective
(if it's possible) to perform the initial import locally, export it, import
it on a hefty AWS instance, sort it there, export it, then import it
locally.

Another possibility - but possibly too expensive - is to get an AWS RDS
instance with a lot of provisioned IOPs. Let them pay for the hardware. If
you do that you might want to simply move your database to AWS instead of
moving the data both ways. (Re provisioning - convert a representative
sample to get an estimate on what you'll need for the full conversion.)

A final thing to check is if there are already public exports of previously
loaded data. As you point out it's pretty resource intensive to import the
data so it might be available "predigested" somewhere.

On Fri, Feb 9, 2018 at 7:45 AM, Bear Giles <bgiles at coyotesong.com> wrote:

> I forgot to mention this earlier - the quoted price per vCPU seems pretty
> high. My single-vCPU AWS and DigitalOcean instances cost about $5/month.
> You can't do much with them but it gives you a baseline price - it's
> something like $0.003/hr on average. (I have a mix of long-term prepaid
> systems and on-demand instances.)
>
> The price you quoted sounds like the much heftier systems I'll use for
> small (Cloudera Express) Hadoop clusters. They're running a lot of services.
>
>
>
> On Fri, Feb 9, 2018 at 6:51 AM, <stimits at comcast.net> wrote:
>
>> I was using a mix of small embedded systems to do disk copies from full
>> sized systems, followed by restores to the full sized systems (using
>> regular disks).
>>
>> While checking other uses just for curiosity it seems that any read
>> operations are cached...the second time through a read was tremendously
>> fast in comparison to the same thing...the little embedded system used the
>> full sized system's cache over gigabit. This little tiny ARM system was
>> outperforming my old work station (24 cores, 48GB of RAM, three 15k rpm
>> disks striped).
>>
>> The reason this had intrigued me is that many years ago I had written
>> management software for diskless clusters. When the clusters used their own
>> disks it took about 18 minutes to boot. With NFS, despite using the same
>> disks on the host, the entire cluster booted in about 25 seconds once the
>> POST was done.
>>
>> Now if you can imagine if the iSCSI server used a RAM disk instead...you
>> won't care about whether the data is preserved upon reboot, it's just swap.
>> If a lot of work is being slowed down by PostgreSQL doing read-ahead into
>> page tables, then it kind of defeats the purpose to read ahead from a disk
>> and then swap it back out to a disk. I'm reminded of the phrase "the
>> operation was a success, but the patient died". I can't guarantee it, but
>> my feeling is that swap over iSCSI to a server with a lot of RAM would
>> outperform local swap if for no other reason than the caching on the faster
>> server...if you created a RAM disk instead on a server with lots of
>> it...and use iSCSI over gigabit...or even 10gig...I think you will end up
>> with something far faster than you expect, and not need to have the
>> hardware physically sitting next to you.
>>
>> The real question is how to create a RAM disk approximately the size you
>> need and then format it as a swap drive...once you have that everything
>> about iSCSI is standard and not difficult if you don't need things like
>> encryption or special authentication.
>>
>> FYI, I wouldn't touch NFS in such a situation for many reasons. iSCSI for
>> all practical purposes is just a local drive to the system using it. You
>> can use dd or just about anything requiring a physical disk (or loopback
>> device) on iSCSI. My operations never had a single error while transferring
>> 300GB images around.
>>
>> ----- Original Message -----
>> From: Lee Woodworth <blug-mail at duboulder.com>
>> To: lug at lug.boulder.co.us
>> Sent: Fri, 09 Feb 2018 09:02:36 -0000 (UTC)
>> Subject: Re: [lug] Speed up osm2pgsql
>> On 02/08/2018 03:48 PM, stimits at comcast.net wrote:
>> > Something I would ponder just because it is interesting, is that I've
>> > found some very good performance in the past using iSCSI. Would it be
>> > possible to have a system with this much RAM at a remote location
>> serving
>> > iSCSI off of a very large RAM disk, and mount that on the local system
>> as swap?
>> Were you heavily loading a software-based iscsi setup (client and target)?
>> I was considering a similar idea of using swap over nfs but it has been
>> reported as having problems under load in the past.
>> With just straightup sequential ops (e.g. shred a large file) I have had
>> glusterfs lock up where the only recovery is a client reboot. Same for
>> nbd (network block device) and nfs in the 4.14 kernels.
>> So I wonder about software-based iscsi reliability with a heavy load.
>> _______________________________________________
>> Web Page: http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>>
>> _______________________________________________
>> Web Page:  http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20180209/5332484b/attachment.html>


More information about the LUG mailing list