Community detection of survey responses based on Pearson correlation coefficient with Neo4j

The answer is no, because this variable will have zero variance.

sourceWe will use the standard deviation metric, which is just the square root of the variance.

MATCH (p:Person)WITH p LIMIT 1WITH filter(x in keys(p) where not x in ['Gender','Left – right handed','Lying','Alcohol','Education','Smoking','House – block of flats','Village – town','Punctuality','Internet usage']) as all_keysUNWIND all_keys as keyMATCH (p:Person)RETURN key,avg(p[key]) as average,stdev(p[key]) as std ORDER BY std ASC LIMIT 10ResultsWe can observe that everybody likes to listen to music, watch movies and have fun with friends.

Due to the low variance, we will eliminate the following questions from our further analysis:“Personality”“Music”“Dreams”“Movies”“Fun with friends”“Comedy”High correlation filterHigh correlation between two variables means they have similar trends and are likely to carry similar information.

This can bring down the performance of some models drastically (linear and logistic regression models, for instance).

sourceWe will use the Pearson correlation coefficient for this task.

Pearson correlation adjusts for different location and scale of features, so any kind of linear scaling (normalization) is unnecessary.

Find top 10 correlations for gender feature.

MATCH (p:Person)WITH p LIMIT 1WITH filter(x in keys(p) where not x in ['Gender','Left – right handed','Lying','Alcohol','Education','Smoking','House – block of flats','Village – town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keysMATCH (p1:Person)UNWIND ['Gender_vec'] as key_1UNWIND all_keys as key_2WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2] ,0)) as vector_2WHERE key_1 <> key_2RETURN key_1,key_2, algo.

similarity.

pearson(vector_1, vector_2) as pearsonORDER BY pearson DESC limit 10ResultsMost correlated feature to gender is weight, which makes sense.

The list includes some other stereotypical gender differences like the preference for cars, action, and PC.

Let’s now calculate the Pearson correlation between all the features.

MATCH (p:Person)WITH p LIMIT 1WITH filter(x in keys(p) where not x in ['Gender','Left – right handed','Lying','Alcohol','Education','Smoking','House – block of flats','Village – town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keysMATCH (p1:Person)UNWIND all_keys as key_1UNWIND all_keys as key_2WITH key_1,key_2,p1WHERE key_1 > key_2WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2],0)) as vector_2RETURN key_1,key_2, algo.

similarity.

pearson(vector_1, vector_2) as pearsonORDER BY pearson DESC limit 10ResultsResults show nothing surprising.

The only one I found interesting was the correlation between snakes and rats.

We will exclude the following questions due to high correlation from further analysis:“Medicine”“Chemistry”“Shopping centres”“Physics”“Opera”“Animated”Pearson similarity algorithmNow that we have completed the preprocessing step we will infer a similarity network between nodes based on the Pearson correlation of the features(answers) of nodes that we haven’t excluded.

In this step we need all the features we will use in our analysis to be normalized between one and five as now, we will fit all the features of the node in a single vector and calculate correlations between them.

Min-max normalizationThree of the features are not normalized between one to five.

These are‘Height’“Number of siblings”‘Weight’Normalize height property between one to five.

We won’t use the other two.

MATCH (p:Person)//get the the max and min valueWITH max(p.

`Height`) as max,min(p.

`Height`) as minMATCH (p1:Person)//normalizeSET p1.

Height_nor = 5.

0 *(p1.

`Height` – min) / (max – min)Similarity networkWe grab all the features and infer the similarity network.

We always want to use similarityCutoff parameter and optionally topK parameter to prevent ending up with a complete graph, where all nodes are connected between each other.

Here we use similarityCutoff: 0.

75 and topK: 5.

Find more information in the documentation.

MATCH (p:Person)WITH p LIMIT 1WITH filter(x in keys(p) where not x in ['Gender','Left – right handed','Lying','Alcohol','Education','Smoking','House – block of flats','Village – town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy','Medicine','Chemistry','Shopping centres','Physics','Opera','Animated','Height','Weight','Number of siblings']) as all_keysMATCH (p1:Person)UNWIND all_keys as keyWITH {item:id(p1), weights: collect(coalesce(p1[key],3))} as personDataWITH collect(personData) as dataCALL algo.

similarity.

pearson(data, {similarityCutoff: 0.

75,topK:5,write:true})YIELD nodes, similarityPairsRETURN nodes, similarityPairsResultsnodes: 1010similarityPairs: 4254Community detectionNow that we have inferred a similarity network in our graph, we will try to find communities of similar persons with the help of Louvain algorithm.

CALL algo.

louvain('Person','SIMILAR')YIELD nodes,communityCountResultsnodes: 1010communityCount: 105Apoc.

group.

nodesFor a quick overview of community detection results in Neo4j Browser, we can use apoc.

group.

nodes.

We define the labels we want to include and group by a certain property.

In the config part, we define which aggregations we want to perform and get returned in the visualization.

Find more in the documentation.

CALL apoc.

nodes.

group(['Person'],['community'], [{`*`:'count', Age:['avg','std'],Alcohol_vec:['avg']}, {`*`:'count'} ])YIELD nodes, relationshipsUNWIND nodes as node UNWIND relationships as relRETURN node, rel;ResultsCommunity preferencesTo get to know our communities better, we will examine their average top and bottom 3 preferences.

MATCH (p:Person)WITH p LIMIT 1WITH filter(x in keys(p) where not x in ['Gender','Left – right handed','Lying','Alcohol','Education','Smoking','House – block of flats','Village – town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Height','Number of siblings','Weight','Medicine', 'Chemistry', 'Shopping centres', 'Physics', 'Opera','Age','community','Comedy','Gender_vec','Internet','Height_nor']) as all_keysMATCH (p1:Person)UNWIND all_keys as keyWITH p1.

community as community, count(*) as size, SUM(CASE WHEN p1.

Gender = 'male' THEN 1 ELSE 0 END) as males, key, avg(p1[key]) as average, stdev(p1[key]) as stdORDER BY average DESCWITH community, size, toFloat(males) / size as male_percentage, collect(key) as all_avgORDER BY size DESC limit 10RETURN community,size,male_percentage, all_avg[.

3] as top_3, all_avg[-3.

] as bottom_3ResultsResults are quite interesting.

Just looking at the male percentage it is safe to say that the communities are almost all based on gender.

The biggest community are 220 ladies, who strongly agree with “Compassion to animals”, “Romantic” and interestingly “Borrowed stuff” but disagree with “Metal”, “Western” and “Writing”.

Second biggest community, mostly male, agree with “Cheating in school”, “Action” and “PC”.

They also don’t agree with “Writing”.

Makes sense as the survey was filled out by students from Slovakia.

Gephi visualizationLet’s finish off with a nice visualization of our communities in Gephi.

You need to have the streaming plugin enabled in Gephi and then we can export the graph from Neo4j using the APOC procedure apoc.

gephi.

add.

MATCH path = (:Person)-[:SIMILAR]->(:Person)CALL apoc.

gephi.

add(null,'workspace1',path,'weight',['community']) yield nodesreturn distinct 'done'After a bit of tweaking in Gephi, I came up with this visualization.

Similarly as with apoc.

group.

nodes visualization we can observe, that the biggest communities are quite connected between each other.

.

. More details

Leave a Reply