Thursday, March 12, 2015

How to use Google Apps APIs to handle large files


Editors Note: This post is the second in a series written by Michael Cohn and Steve Ziegler from Cloud Sherpas, a software development and professional services company. We invited Cloud Sherpas to share their experiences building an application on top of Google Apps utilizing some of our APIs. Cloud Sherpas will also be participating in the Developer Sandbox at Google I/O this May where theyll be demoing their use of Google technologies and answering questions from attendees.

SherpaTools Directory Manager allows administrators to easily manage User Profiles and Shared Contacts in the Google Apps directory. SherpaTools is built on Google App Engine (GAE) utilizing the Google Web Toolkit (GWT), and makes heavy use of the Google Apps Management and Application APIs to provide new administrator and end-user features.

Last week we wrote about how SherpaTools Directory Manager uses Two Legged OAuth for authentication, and GAE Task Queue API and the Google Datastore APIs to make it easy to divide large retrievals over long intervals into smaller chunks. This week we will discuss how we used the User Profile API to retrieve the data sets and the Document List Data API to populate a Google Spreadsheet.


The features in SherpaTools use a number of additional Google Apps APIs.  For example, contact data may be imported from or exported to Google Docs Spreadsheets.  In the case of export, contacts are retrieved via the User Profiles or Shared Contacts API and written to memcache.  Next, a new Spreadsheet is created via the Documents List Data API and the CSV byte stream is added as the media type. Below is a technical explanation of how we retrieved the User Profiles and exported them to a Google spreadsheet.


Retrieving and Storing User Profile Entries

The User Profile API makes it easy to divide retrieval of all profile entries into multiple fetch operations.  The one area to tweak for this call is the size of results that can reliably be returned within 10 seconds on Google App Engine.  In our testing, 100 entries per page is about the right size.  Once configured, we will retrieve the feed containing the first set of entries, parse the response, and persist the results to cache.  If that feed contains a link to another page containing more entries, we queue up a task to handle the next page and repeat until all pages are processed:
    ContactQuery query = null;
    //if this is the first page, there is no next link. Construct initial page url
    if (nextLink == null) {
        String domain = getDomain(loggedInEmailAddress);
        String initialLink = PROFILES_FEED + domain + PROJECTION_FULL;
        query = new ContactQuery(new URL(initialLink));
        query.setMaxResults(GAE_OPTIMAL_PAGE_SIZE);
    } else {
        query = new ContactQuery(new URL(nextLink));
    }
    query.setStringCustomParameter(TWO_LEGGED_OAUTH_PARAM, loggedInEmailAddress);
    //fetch next profile feed containing entries
    ProfileFeed feed = contactsService.query(query, ProfileFeed.class);
    List> currentMaps = (List>)memcacheService.get(memcacheKey);
    for(ProfileEntry entry:feed.getEntries()){
        //secret sauce: convert entry into csv column header/value map
        currentMaps.add(converter.flatten(entry));
    }
    //store updated list of converted entry maps back into memcache
    memcacheService.put(memcacheKey, currentMaps);
    if(feed.getNextLink()!=null){
        //start task to get next page of entries
        tasksService.fetchUserProfilesPageTask(spreadsheetTitle, loggedInEmailAddress, feed.getNextLink().getHref(), memcacheKey);
    }else{
        //no more pages to retrieve, start task to publish csv
        tasksService.exportMapsToSpreadsheet(spreadsheetTitle,loggedInEmailAddress,memcacheKey);
    }
}

Exporting Profiles as a Google Docs Spreadsheet

One of the trickiest obstacles to work around in this effort is generating the Spreadsheet file since GAE restricts the ability to write to the File System.  The Spreadsheets Data API was one possibility we considered, but we ended up feeling it was a bit of overkill, having to first create the Spreadsheet using the Docs List API and then populate the Spreadsheet one record per request.  This could have generated thousands of requests to populate the entire Spreadsheet.  Instead, we leveraged an open source library to write csv file data directly to a byte array, and then sent the file data as the content of the newly created Docs List entry in a single request: 
public void saveRowsAsSpreadsheet(String spreadsheetTitle, String loggedInEmailAddress, String memcacheKey) {
    //get list of csv column header/value maps from cache:
    List> rows = (List>)memcacheService.get(memcacheKey);
    //secret sauce: convert csv maps into a byte arrray
    byte[] csvBytes = converter.getCsvBytes(rows);
    SpreadsheetEntry newDocument = new SpreadsheetEntry();
    MediaByteArraySource fileSource = new MediaByteArraySource(csvBytes,"text/csv");
    MediaContent content = new MediaContent();
    content.setMediaSource(fileSource);
    content.setMimeType(new ContentType( "text/csv"));
    //add MIME-Typed byte array as content to SpreadsheetEntry 
    newDocument.setContent(content);
    //set title of SpreadsheetEntry
    newDocument.setTitle(new PlainTextConstruct(docTitle));
    URL feedUri = new URL(new StringBuilder(DOCS_URL).append(?).append(TWO_LEGGED_OAUTH_PARAM).append(=).append(userEmail).toString());
    docsService.insert(feedUri,newDocument);
    //we are done, time to delete the data stored in cache
    memcacheService.delete(memcacheKey);
}
Once completed, this method results in a newly created, populated Google Docs Spreadsheet viewable in the logged-in users Google Docs.


The one difference between these code examples and the function weve ended up with in production is that, for customers with over 1,000 Google Apps users, we have broken up Spreadsheet creation into worksheets of 1,000 rows per sheet instead of creating one Spreadsheet upon completion of retrieval of all of the User Profiles.  This was to avoid running into the default maximum object size ceiling in Memcache, the maximum Google Docs List API upload data size, and the Spreadsheet maximum cell number.  Tacking on this logic was not particularly complicated but ends up being quite a bit harder to follow as an example, so we left it out of this discussion.  

Though not explicitly covered in this post, despite best efforts, these remote calls are subject to all the instability associated with making calls across the web.  We continue to experience occasional timeouts and other infrequent network errors.  Fortunately, if designed to fail correctly, the Task Queue API automatically retries to execute the task until it succeeds.  Since the API is designed this way, we need to make sure that any unrecoverable errors are caught to avoid endless retries of "bad tasks."

Conclusion

This post demonstrates an approach to export a large set of Google Apps contact information into a Google Docs Spreadsheet.  Many similar long-running, divisible operations can use this same approach to spread work across a string of tasks queued in the GAE Task Queue.  Also, the same approach to writing Google Docs files could be used to publish a variety of types of reports from GAE.  We would love to hear what you think of this approach and if you have come up with your own solution for similar issues.

Thanks to the Cloud Sherpas team for authoring this post. Check out SherpaTools at www.sherpatools.com

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.