Downloading Datastore Data from a Java Application on Google App Engine


One of the most frequently talked barriers for getting onto the cloud is the data lock-in. In this post we would try to look at how we can download data as a CSV from the app engine datastore. This would help us get the data back from the app engine, just in case you want to move away from the datastore.

So there exists a tool which can help you get data, that is the good part. The not so good part is that there is nothing available in Java (as of now) to accomplish this. We have to depend on the python sdk. That should give you a hint that we would need to download the python sdk.

It might also be worthwhile to check that you have python installed on your system. When I type python on the command prompt for my Ubuntu 10.04 system, I get

vikas@vikas-laptop:~$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

So, good to go. Assuming that you have downloaded the python sdk, you would find the appcfg.py file in the root of the distribution.

Now, we have to set up a remote servlet with our Java application with which would be able to communicate with the app engine remotely. The following needs to be set up in your web.xml

<servlet>
    <servlet-name>RemoteApi</servlet-name>
    com.google.apphosting.utils.remoteapi.RemoteApiServlet
</servlet>

<servlet-mapping>
    <servlet-name>RemoteApi</servlet-name>
    <url-pattern>/remote_api</url-pattern>
</servlet-mapping>

Deploy your application to the app engine.

Now we would connect to the app engine and generate a bulkloader configuration file. A configuration file could either be auto-generated, the easier way or be written by hand. It contains information about all our entities and their fields.

There are 2 ways to generate the configuration file, either by using appcfg.py or bulkloader.py. Bothe the examples are as

appcfg.py create_bulkloader_config –url=http://<your app>.appspot.com/remote_api –application=<your app_id> --filename=config.yml
bulkloader.py --create_config –url=http://<your app>.appspot.com/remote_api –application=<your app_id> --filename=generated_bulkloader.yaml

Once you execute this command, after the necessary authentication the following file is generated

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.ext.db
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Audit
  connector: # TODO: Choose a connector here: csv, simplexml, etc...
  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      export_transform: transform.key_id_or_name_as_string

    - property: auditActionType
      external_name: auditActionType
      # Type: String Stats: 146 properties of this type in this kind.

    - property: date
      external_name: date
      # Type: Date/Time Stats: 146 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

As you would notice, it would create configurations for all the entities and then leave out some TODOs for you to change. So for example we replace the

  connector: # TODO: Choose a connector here: csv, simplexml, etc...
  connector_options:
    # TODO: Add connector options here--these are specific to each connector.

with

  connector: csv
  connector_options:
    encoding: utf-8
    columns: from_header

Once our configuration file is ready, you can start downloading / uploading data to the engine using the configuration file.

To download data, again use any one of the following commands

appcfg.py download_data --config_file=config.yml --filename=UserRole.csv --kind=UserRole –url=http://<your app>.appspot.com/remote_api –application=<your app_id>
bulkloader.py --download --url=http://0-0-4d.latest.bookhours.appspot.com/remote_api --application=bookhours --config_file=generated_bulkloader.yaml --kind=UserRole –filename=UserRole.csv

When the command completes, you would get a csv file for the UserRole which would be something similar to this depending on your entity.

roleName,role,key
Consultant,ROLE_CONSULTANT,85004
PM,ROLE_PROJECTMANAGER,88003
Report role,ROLE_REPORT,90002
Administrator,ROLE_ADMIN,126002

There is also a command, which lets you download all the entities all at once

appcfg.py download_data –application= --url=http://.appspot.com/remote_api –filename=data.csv

One of the problems of this usage is that the csv file is not a text file. It seems to be a sqlite file. I have to figure out whether I can open it with sqlite.

Another issue that I found with bulkloader is that it seems to ignore the relationships in the generated configuration file. For example, if there is a relationship between User and UserRole then that is not evident from the auto generated configuration file. Refer to my issue raised on GAE forums.

Nevertheless, we are able to get data back from the app engine, albeit for each individual entity as a csv file. Once we have the csv, it becomes easy for us to put the concerns about data lock-in to rest.

Advertisements

About Vikas Hazrati

Vikas is the Founding Partner @ Knoldus which is a group of software industry veterans who have joined hands to add value to the art of software development. Knoldus does niche Reactive and Big Data product development on Scala, Spark and Functional Java. Knoldus has a strong focus on software craftsmanship which ensures high-quality software development. It partners with the best in the industry like Lightbend (Scala Ecosystem), Databricks (Spark Ecosystem), Confluent (Kafka) and Datastax (Cassandra). To know more, send a mail to hello@knoldus.com or visit www.knoldus.com
This entry was posted in Architecture, Cloud, Java and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s