-
Notifications
You must be signed in to change notification settings - Fork 48
Chunking of larger tables in DB
Documenting a major development in this project that took a long time to fine-tune and got finalized with v2.0.0 in Sep 2018, but was in backburner development since May 2018.
Will share documentation regarding it here.
The stop_times.txt
file (and shapes.txt
also in some cases) can be HUGE, million+ rows Reading and writing these took extremely long time, and in many cases, simply errored out with a memory error, in the older json-based DB system. And the json files created there were also huge. This is given in detail in #4 .
This prompted a transition from json-based db to HDF5 format which took care of a lot of things at the DB storage and retrieval level. But the problem of working with a very large table in memory remained. v.long waiting time and constant threat of out-of-memory errors.
So from there, an idea for CHUNKING emerged.
- tables like stop_times.txt and shapes.txt are anyways never actually handled in full. ie, at no point in managing GTFS are you processing the entire table. You're just working with one trip_id's rows in stop_times, and one shape_id's rows in shapes.
- So why not store them in chunks
- Split the stop_times file into multiple chunks: stop_times_1.h5, stop_times_2.h5, ... having a million or half a million rows each.
- make a lookup json having the trip_id's mapped to the file containing them.
Note: from this point onwards,
stop_times
andtrip_id
will be mentioned but that should be assumed to apply to any table that's being chunked, and the corresponding primary key.
Then, when reading the table,
- load the lookup json and figure out which chunk our trip_id is mapped to.
- load that chunk only, leave the remaining data alone.
When creating the data for the first time during a feed import, this chunking feature was leveraged : the csv itself was read in chunks, and each chunk is processed separately and deleted from memory before loading the next chunk. This way huge feeds having huge stop_times.txt files can be read by the program without requiring large memory hardware.
- one challenge : how to ensure that one trip_id doesn't spill over from one chunk to the other.
- See the code for this in the GTFS import function:
Sample console output of an import:
Storing stop_times in chunks.
stop_times_1.h5: 499987 rows
stop_times_2.h5: 500004 rows
stop_times_3.h5: 499994 rows
stop_times_4.h5: 500007 rows
stop_times_5.h5: 499993 rows
stop_times_6.h5: 500007 rows
stop_times_7.h5: 500004 rows
stop_times_8.h5: 499970 rows
stop_times_9.h5: 500013 rows
stop_times_10.h5: 500020 rows
stop_times_11.h5: 4333 rows
Appending the 22 rows of last ID to last chunk stop_times_11.h5
Lookup json: stop_times_lookup.json created for mapping ID trip_id to stop_times_n.h5 chunk files.
An explanation of the workflow:
- from the csv chunk loaded, take all trip_ids except the last one (which may be continued in the next chunk). Store these into a .h5 file.
- store the last trip_id's rows in a "CarryOver" df.
- when the next csv chunk is loaded, concatenate the CarryOver rows with these. Even here, repeat the earlier process.
- this will result in chunks that have slightly less or more rows than the chosen chunk size.
- After processing the last chunk, append the last set of CarryOver rows back to the last chunk.
(this is a lot easier to code than it is to explain, so apologies if this is still confusing)