Implementing TPC-H DB with MongoDB and Python
In this article I show some of what I learned by translating TPC-H transactions and DB to MongoDB format. If you already know SQL and is interested in studying MongoDB, this is the right place for you!
TPC-H is a benchmark developed to examine large scale SQL databases and it is a great document to study if you want to learn about relational DBs. Among other important content, it has 22 queries that cover a great range of business problems and uses possibly all the tools that SQL can offer. For that reason, it is a widely used benchmark and it can be used as well to learn about other NoSQL DBs by replicating the DB and the queries proposed in it. It is important to note that we present here a “literal” translation of SQL to MongoDB, so, in a real implementation case, many changes would be needed in order to have a satisfactory performance.
Preparing the environment
So, let’s get to the chase. First question to answer is: “How to set up the environment?”. Besides MongoDB, you will need to install the Python driver, also known as PyMongo. For that you can use the code below on pip. For more informations, see official site here.
$ python -m pip install pymongo
After that and installing Mongo DB, you should be ready to go!
Creating a new database and tables
Now we have to create the database. Luckly, we have MongoDB! First, we need to set up a client with the MongoClient method.
cliente = MongoClient('localhost', 27017)
To create the database you can use the code below. Notice that you can use the exact same code to open a DB that is already created. If the DB does not exist, it is automatically created.
db = client["db_tpch"]
db = client.db_tpch
Then, to create a table we only need to initialyze it with the desired name from the Database object. In this case we used the name “PART”.
# Create the table
part = db.PART
Importing data
The data here is not the full data and it is in .tbl format, which is very similar to .csv format. For that reason, we can import the data using read_csv method from Pandas. Column 9 is empty so it was dropped.
# Import data
part_data = pd.read_csv('data/PART.tbl', sep='|', header=None)
# Drop empty columns
part_data = part_data.drop(9, axis=1)
In order to set up column names, we will need to pass a list of strings with the desired names in the future, then we must set up the names of Pandas DataFrame first. Our case is a literal translation from SQL to MongoDB, so the names are the same as the original SQL database. Notice that the “_id” variable is the name for the ID Key in any table and need to be unique. If you don’t pass an “_id”, MongoDB will generate an unique id for your document. In the present case, we use the id generated by the SQL DB. It will make possible to perform more complex queries in the future.
# Send columns names
part_data.columns = ['_id', 'P_NAME', 'P_MFGR', 'P_BRAND', 'P_TYPE', 'P_SIZE', 'P_CONTAINER', 'P_RETAILPRICE', 'P_COMMENT']
To import data, we need to pass each row as a dict with the column names and respective data. We can pass many rows for insertion with a list of strings and the inset_many method.
#Set up a dictionary before import
dict_part_data = [x[1].to_dict() for x in part_data.iterrows()]
# Insert dictionary data on table
part.insert_many(dict_part_data)
Finally, we print the count of rows of PART column to double check.
# Double check
print(part.count())
Conclusions
In this article we create a table and import data from .csv format into it. As you can notice, it is quite an easy task to be done with MongoDB. The code for implementation of other tables can be found in my Github page.
In the next articles I will show how we executed each SQL query from TPC-H in MongoDB with PyMongo. Stay tuned!