TagOverflow — Correlating Tags in Stackoverflow

We imported the whole dump of StackOverflow into Neo4j, ran the algorithms and then visualized the results using Neo4j Browser and the large graph visualization tool Graphistry.In September we had the opportunity of running the GraphConnect “Buzzword-Bingo” GraphHack in the offices of StackOverflow in New York which was really cool..Each question is also tagged with one or more Tags.Simple StackOverflow Graph ModelFor the import we:downloaded the StackOverflow dump from the internet archiveextracted it using 7zipconverted the XML files we’re interested in into CSV’s for nodes and relationshipsimport the data using the Neo4j bulk importer in a few minutescreate some indexes and constraints in another few minutesThen the Neo4j Graph Database of StackOverflow was ready to be used.The import steps are all documented in the GitHub repository, we already covered the process a while ago at the 10M question celebration.Our test instance is currently available as a Neo4j Cloud instance:https://1cec42ea.databases.neo4j.io/browser/with user “stackoverflow”password “stackoverflow”for a read-only user.Data ExplorationWe used Google’s Colab Notebooks to work within the Hackathon team which worked really well..neo4j and javascript.Correlated Tags to “neo4j”If we ran a co-ocurrence query for the ruby tag ordered by frequency.MATCH (q:Question)-[t:TAGGED]->(tag:Tag {name:$tagName}), (q)-[:TAGGED]->(other:Tag)WITH tag,other, count(*) AS freqORDER BY freq desc LIMIT 15RETURN other.name, freqTag correlations within the “ruby” tag.You’d see that the results make sense, many of those tags are either major ruby projects or libraries.We could also render these correlations as virtual relationships in Neo4j Browser, by using the apoc.create.vRelationship function on our aggregated data to represent a SIMILAR relationship with the count as a property.MATCH (q:Question)-[t:TAGGED]->(tag:Tag {name:"ruby"}), (q)-[:TAGGED]->(other:Tag)WITH tag,other, count(*) as freqORDER BY freq DESC LIMIT 50RETURN tag, other, apoc.create.vRelationship(tag,'SIMILAR',{freq:freq}, other);Correlation Visualization with Virtual RelationshipsNext we wanted to see how frequently are those other tags used, by looking at their degrees.MATCH (q:Question)-[t:TAGGED]->(tag:Tag {name:$tagName}), (q)-[:TAGGED]->(other:Tag)WITH other, count(*) as freqRETURN other.name, freq, size((other)<-[:TAGGED]-()) AS degreeORDER BY freq DESC LIMIT 10It turned out that rails,arrays and javascript have really high usage..292 Million comparisons (17k²).// tags with at least 100 questionsMATCH (tag:Tag) WHERE size((tag)<-[:TAGGED]-()) > 100 WITH tag// get the questionsMATCH (q:Question)-[:TAGGED]->(tag)// create dict with tag as item and questions as categoriesWITH {item:id(tag), categories: collect(id(q))} as entryWITH collect(entry) as entries// run jaccard, write back resultsCALL algo.similarity.jaccard(entries, {topK:5,similarityCutoff:0.1, write:true})YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100RETURN *;On our contended shared test machine back then ran for 13 minutes to compute the data, on dedicated hardware it would be faster.With the quite high min-similarity of 0.1 and writing only the 5 most similar neighbours, we create 2864 SIMILAR relationships that we can then use to run other graph algorithms on top.In the visualization we saw that we only created “very tight” groups of similarities, like scheme<->racket or sed<->awk, or some small clusters around each of rdf, hadoop, flash and quickbooks!So we re-ran the computation with a lower similarity cutoff of 0.01, which resulted in a total of 44728 SIMILAR relationships.We only persisted the top 5 neighbours of each node, so some similarities that you might expect could be missing, because they didn’t make the cut.Utilize Similiarity RelationshipsNow we used the newly created relationships to run other algorithms, for instance something straightforward as shortest path..how were correlated tags connected transitively.MATCH (from:Tag {name:'html'}),(to:Tag {name:'neo4j'})MATCH path = shortestPath((from)-[:SIMILAR*]-(to))RETURN [n IN nodes(path) | n.name] as nodesWhich leads to this path:["html", "javascript", "json", "jackson", "spring-mvc", "spring-boot", "spring-data", "spring-data-neo4j", "neo4j"]Or this 4th degree neighbourhood of “javascript”.MATCH path=(:Tag {name:"javascript"})-[:SIMILAR*..4]->() RETURN pathBesides that we also quickly ran other graph algorithms on our inferred graph and wrote the results back to our database.call algo.pageRank('Tag','SIMILAR');call algo.labelPropagation('Tag','SIMILAR');call algo.betweenness('Tag','SIMILAR');Now our tags also carried pagerank, partition, centrality attributes that captured their relevance and place in our graph.match (t:Tag) return t limit 5;(:Tag {partition: 26, centrality: 406233.80006818444, name: ".net", count: 268970, pagerank: 2.532907, wikiPostId: 3607476})(:Tag {partition: 4, centrality: 2545764.1141965324, name: "html", count: 752349, pagerank: 6.3226235, wikiPostId: 3673182})(:Tag {partition: 415, centrality: 2731837.0951582957, name: "javascript", count: 1624044, pagerank: 5.2686405, wikiPostId: 3607052})(:Tag {partition: 415, centrality: 642718.2814995827, name: "css", count: 537685, pagerank: 5.447395500000001, wikiPostId: 3644669})(:Tag {partition: 204, centrality: 5703506.726861835, name: "php", count: 1200404, pagerank: 5.8298785, wikiPostId: 3607050})VisualizationNow that the nodes of our graph were enriched with graph metrics, we could visualize them, e.g.. More details

Leave a Reply