Fetch Data on large Collection MongoDB

March 1, 2018

Well i really trying to deal with large collection(more than 400GB) on mongoDB to fetch data since i couldn’t find a great way to do it my only option was to use field which index on date column.

like everyone else i have Google and there were some suggestions such as : skip() , limit() , mongoexport, etc. but as the MongoDB documentation mentioned skip() and limit() will put load on the DB and you might encounter duplicate values so its not very efficient therefor using mongoexport method which lasts forever, so i use a python method to fetch data to minimize the performance impact.

##Import necessary libraries
import datetime
import pymongo
import pytz

##need to convert the timezone since it uses UTC

tz = pytz.timezone('Asia/Tehran')
start_date = datetime.datetime(2019, 1, 17, 0, 0, 0, 0)
end_date = datetime.datetime(2019, 9, 17, 0, 0, 0, 0)

localize_startDate = tz.localize(start_date).replace(hour=0, minute=0, second=0, microsecond=0)
localize_endDate = tz.localize(end_date).replace(hour=23, minute=59, second=59, microsecond=0)


####funtion to iterate thru days 
def daterange(start_date, end_date):
    if localize_startDate <= localize_endDate:
        for n in range((localize_endDate - localize_startDate).days + 1):
            yield localize_startDate + datetime.timedelta(n)
    else:
        for n in range((localize_startDate - localize_endDate).days + 1):
            yield localize_startDate - datetime.timedelta(n)


######define MongoDB connection
myclient = pymongo.MongoClient("mongodb://[username]:[password]@[host/IP]:[port]/[DBName]")
mydb = myclient["[DBName]"]
mycol = mydb["Collection_Name"]

#####write the result to the file
with open('dates_iter','w') as file:
    for date in daterange(localize_startDate, localize_endDate):
        dateplus = date + datetime.timedelta(days=1)
        cursor = mycol.find({
             "service.createdDate" :  { "$gte" : date , "$lt" : dateplus  }
                       })
        file.write("Total count for One day {0}  starts at : {1}  ends at : {2}  is >> {3}".format('\n',date, dateplus ,cursor) + '\n')

I would be really glad if someone who uses MongoDB for bigdata as ETL data warehousing would recommend a more profound way for fetching data.

P.S : Well this code might work when you have limited access and kinda old version of mongo(3.2.6) in version 4 or higher there is command “CloneDatabase” or you may use mongodump.