Create a Data Marvel — Part 3: Hydrating the Model

It simply builds upon syntax we have already discussed above.WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS tsWITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffixCALL apoc.periodic.iterate(‘MATCH (c:Character) WHERE c.resourceURI IS NOT NULL AND NOT exists((c)<-[:INCLUDES]-()) RETURN c LIMIT 100’,‘CALL apoc.util.sleep(2000)CALL apoc.load.json(c.resourceURI+”/comics?format=comic&formatType=comic&limit=100"+$suffix)YIELD valueWITH c, value.data.results as results WHERE results IS NOT NULLUNWIND results as resultMERGE (comic:ComicIssue {id: result.id})ON CREATE SET comic.name = result.title, comic.issueNumber = result.issueNumber, comic.pageCount = result.pageCount, comic.resourceURI = result.resourceURI, comic.thumbnail = result.thumbnail.path + ”.” + result.thumbnail.extensionWITH c, comic, resultMERGE (comic)-[r:INCLUDES]->(c)WITH c, comic, result WHERE result.series IS NOT NULLUNWIND result.series as comicSeriesMERGE (series:Series {id: toInt(split(comicSeries.resourceURI,”/”)[-1])})ON CREATE SET series.name = comicSeries.name, series.resourceURI = comicSeries.resourceURIWITH c, comic, series, resultMERGE (comic)-[r2:BELONGS_TO]->(series)WITH c, comic, result, result.creators.items as items WHERE items IS NOT NULLUNWIND items as itemMERGE (creator:Creator {id: toInt(split(item.resourceURI,”/”)[-1])})ON CREATE SET creator.name = item.name, creator.resourceURI = item.resourceURIWITH c, comic, result, creatorMERGE (comic)-[r3:CREATED_BY]->(creator)WITH c, comic, result, result.stories.items as items WHERE items IS NOT NULLUNWIND items as itemMERGE (story:Story {id: toInt(split(item.resourceURI,”/”)[-1])})ON CREATE SET story.name = item.name, story.resourceURI = item.resourceURI, story.type = item.typeWITH c, comic, result, storyMERGE (comic)-[r4:MADE_OF]->(story)WITH c, comic, result, result.events.items AS items WHERE items IS NOT NULLUNWIND items as itemMERGE (event:Event {id: toInt(split(item.resourceURI,”/”)[-1])})ON CREATE SET event.name = item.name, event.resourceURI = item.resourceURIMERGE (comic)-[r5:PART_OF]->(event)’,{batchSize: 20, iterateList:false, retries:2, params:{suffix:suffix}});To help process this lengthy query, we will break it up into sections and explain each block..Let us start with the first two sections together.//First sectionWITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS tsWITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffixCALL apoc.periodic.iterate(‘MATCH (c:Character) WHERE c.resourceURI IS NOT NULL AND NOT exists((c)<-[:INCLUDES]-()) RETURN c LIMIT 100’,‘CALL apoc.util.sleep(2000)CALL apoc.load.json(c.resourceURI+”/comics?format=comic&formatType=comic&limit=100"+$suffix)YIELD valueJust as with our initial load query, we start the query using the WITH clause to set up and pass the timestamp and url suffix parameters that we will use further down..The next lines of code calls the familiar apoc.periodic.iterate to pull all the characters in Neo4j (MATCH statement). Notice the criteria starting from the WHERE clause. We check the Character nodes to see if the resourceURI field contains a value. Marvel puts the url path for most entities in the resourceURI field, so this is a simple check to see if the Character has a url path for us to retrieve data. If it doesn’t, our call will fail, and it won’t find data.* Hint: this is also a good way to trim the number of API calls. If we know a call will fail, then we should not waste precious resources on it. :)The next criteria checks if a relationship type of INCLUDES already exists for the node. This sees if we have already retrieved and inserted comics for a character. If a relationship exists, then we do not pull the comic info for that character again. This avoids duplicate calls for entities where we have already added that information.Finally, we add a LIMIT 100 to that query to only pull 100 characters at a time from our Neo4j database. We ran across issues where queries would timeout because the server would stop responding. Marvel’s server instances probably have a timeout value to ensure users do not hog resources. Or, it could simply be that they need to bolster the architecture a bit to support heavier requests. ;)Either way, we wanted to reduce the time taken to pull in batches of data, so my colleague suggested a LIMIT clause to create smaller processing for each call..While this would increase the number of calls made to the database, it was better than larger batches failing frequently.* Note: at this point, we had 26 calls to load all of the characters for each alphabet letter..That gave us around 1,000 characters in our Neo4j instance..If we pull 100 at a time, we could have a maximum of 11 batches of up to 100 calls (one for each character in a batch).The second statement within the apoc.periodic.iterate adds 2 seconds of sleep between calls to the API for each character..This tries to avoid the timeout of our Marvel server in the middle of one of our calls..Once we wait, we use the apoc.load.json to hit the API endpoint for the comics pertaining to that character from the resourceURI field on our Character nodes..Again, we yield back the JSON object (YIELD value).//Second sectionWITH c, value.data.results as results WHERE results IS NOT NULLUNWIND results as resultMERGE (comic:ComicIssue {id: result.id})ON CREATE SET comic.name = result.title, comic.issueNumber = result.issueNumber, comic.pageCount = result.pageCount, comic.resourceURI = result.resourceURI, comic.thumbnail = result.thumbnail.path + ”.” + result.thumbnail.extensionWITH c, comic, resultMERGE (comic)-[r:INCLUDES]->(c)To start the next code paragraph, we check that the subsection containing our comics is not null, and unwind the object to get the subsection..Just as we did with the loading of characters, we use MERGE on the comic id to find or create the ComicIssue node and the ON CREATE SET clause to set property values if the node is created..The one thing that is different is the WITH statement and the MERGE line after it..Because we need to use the newly-created node and create a relationship between it and a Character node, we need to pass the ComicIssue node to the next MERGE statement.. More details

Leave a Reply