Downloading Datastore Data from a Java Application on Google App Engine

One of the most frequently talked barriers for getting onto the cloud is the data lock-in. In this post we would try to look at how we can download data as a CSV from the app engine datastore. This would help us get the data back from the app engine, just in case you want to move away from the datastore.

So there exists a tool which can help you get data, that is the good part. The not so good part is that there is nothing available in Java (as of now) to accomplish this. We have to depend on the python sdk. That should give you a hint that we would need to download the python sdk.

It might also be worthwhile to check that you have python installed on your system. When I type python on the command prompt for my Ubuntu 10.04 system, I get

[sourcecode language=”bash”]
vikas@vikas-laptop:~$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
[/sourcecode]

So, good to go. Assuming that you have downloaded the python sdk, you would find the appcfg.py file in the root of the distribution.

Now, we have to set up a remote servlet with our Java application with which would be able to communicate with the app engine remotely. The following needs to be set up in your web.xml

[sourcecode language=”xml”]
<servlet>
<servlet-name>RemoteApi</servlet-name>
com.google.apphosting.utils.remoteapi.RemoteApiServlet
</servlet>

<servlet-mapping>
<servlet-name>RemoteApi</servlet-name>
<url-pattern>/remote_api</url-pattern>
</servlet-mapping>
[/sourcecode]

Deploy your application to the app engine.

Now we would connect to the app engine and generate a bulkloader configuration file. A configuration file could either be auto-generated, the easier way or be written by hand. It contains information about all our entities and their fields.

There are 2 ways to generate the configuration file, either by using appcfg.py or bulkloader.py. Bothe the examples are as

[sourcecode language=”bash”]
appcfg.py create_bulkloader_config –url=http://<your app>.appspot.com/remote_api –application=<your app_id> –filename=config.yml
[/sourcecode]

[sourcecode language=”bash”]
bulkloader.py –create_config –url=http://<your app>.appspot.com/remote_api –application=<your app_id> –filename=generated_bulkloader.yaml
[/sourcecode]

Once you execute this command, after the necessary authentication the following file is generated

[sourcecode language=”text”]
# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
# * Fill in connector and connector_options
# * Review the property_map.
# – Ensure the ‘external_name’ matches the name of your CSV column,
# XML tag, etc.
# – Check that __key__ property is what you want. Its value will become
# the key name on import, and on export the value will be the Key
# object. If you would like automatic key generation on import and
# omitting the key on export, you can remove the entire __key__
# property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
– import: base64
– import: re
– import: google.appengine.ext.bulkload.transform
– import: google.appengine.ext.bulkload.bulkloader_wizard
– import: google.appengine.ext.db
– import: google.appengine.api.datastore
– import: google.appengine.api.users

transformers:

– kind: Audit
connector: # TODO: Choose a connector here: csv, simplexml, etc…
connector_options:
# TODO: Add connector options here–these are specific to each connector.
property_map:
– property: __key__
external_name: key
export_transform: transform.key_id_or_name_as_string

– property: auditActionType
external_name: auditActionType
# Type: String Stats: 146 properties of this type in this kind.

– property: date
external_name: date
# Type: Date/Time Stats: 146 properties of this type in this kind.
import_transform: transform.import_date_time(‘%Y-%m-%dT%H:%M:%S’)
export_transform: transform.export_date_time(‘%Y-%m-%dT%H:%M:%S’)
[/sourcecode]

As you would notice, it would create configurations for all the entities and then leave out some TODOs for you to change. So for example we replace the

[sourcecode language=”text”]
connector: # TODO: Choose a connector here: csv, simplexml, etc…
connector_options:
# TODO: Add connector options here–these are specific to each connector.
[/sourcecode]

with

[sourcecode language=”text”]
connector: csv
connector_options:
encoding: utf-8
columns: from_header
[/sourcecode]

Once our configuration file is ready, you can start downloading / uploading data to the engine using the configuration file.

To download data, again use any one of the following commands

[sourcecode language=”bash”]
appcfg.py download_data –config_file=config.yml –filename=UserRole.csv –kind=UserRole –url=http://<your app>.appspot.com/remote_api –application=<your app_id>
[/sourcecode]

[sourcecode language=”bash”]
bulkloader.py –download –url=http://0-0-4d.latest.bookhours.appspot.com/remote_api –application=bookhours –config_file=generated_bulkloader.yaml –kind=UserRole –filename=UserRole.csv
[/sourcecode]

When the command completes, you would get a csv file for the UserRole which would be something similar to this depending on your entity.

[sourcecode language=”text”]
roleName,role,key
Consultant,ROLE_CONSULTANT,85004
PM,ROLE_PROJECTMANAGER,88003
Report role,ROLE_REPORT,90002
Administrator,ROLE_ADMIN,126002
[/sourcecode]

There is also a command, which lets you download all the entities all at once

[sourcecode language=”bash”]
appcfg.py download_data –application= –url=http://.appspot.com/remote_api –filename=data.csv
[/sourcecode]

One of the problems of this usage is that the csv file is not a text file. It seems to be a sqlite file. I have to figure out whether I can open it with sqlite.

Another issue that I found with bulkloader is that it seems to ignore the relationships in the generated configuration file. For example, if there is a relationship between User and UserRole then that is not evident from the auto generated configuration file. Refer to my issue raised on GAE forums.

Nevertheless, we are able to get data back from the app engine, albeit for each individual entity as a csv file. Once we have the csv, it becomes easy for us to put the concerns about data lock-in to rest.

Written by 

Vikas is the CEO and Co-Founder of Knoldus Inc. Knoldus does niche Reactive and Big Data product development on Scala, Spark, and Functional Java. Knoldus has a strong focus on software craftsmanship which ensures high-quality software development. It partners with the best in the industry like Lightbend (Scala Ecosystem), Databricks (Spark Ecosystem), Confluent (Kafka) and Datastax (Cassandra). Vikas has been working in the cutting edge tech industry for 20+ years. He was an ardent fan of Java with multiple high load enterprise systems to boast of till he met Scala. His current passions include utilizing the power of Scala, Akka and Play to make Reactive and Big Data systems for niche startups and enterprises who would like to change the way software is developed. To know more, send a mail to hello@knoldus.com or visit www.knoldus.com

Leave a Reply

%d bloggers like this: