Prerequisites:
What is Cypher?
Cypher is a declarative language for querying and manipulating Neo4j graph databases. Essentially, Cypher is to Neo4j graphs what SQL is to relational database systems. Cypher’s functionality is expanding and improving with every version, so keep your eye on its Neo4j Cypher Refcard resource page[1]
Snippets of Cypher code for creating Soil Survey nodes and relationships
- Create
Hort_Clientnodes
// import Hort_Client nodes
CREATE INDEX ON :Hort_Client(client);
CREATE INDEX ON :Hort_Client(name);
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///soil_survey.csv" AS line
WITH line LIMIT 10000
MERGE (hc:Hort_Client {client: line.Hort_Client, name: 'hc_' + line.Hort_Client});
- Also:
-
- importing
Hort_Clientnodes
- importing
-
CREATE INDEX- adds an index for each property of the node. Note that we have two,clientandname
- -
USING PERIODIC COMMIT 1000- every 1000 records/lines are treated as a single transaction after which those records will be written to disk - -
LOAD CSV WITH HEADERS FROM- ‘soil_survey.csv’ file has headers that we added manually usingsed - -
LIMIT 10000- a maximum number of lines that you wish to import. Even if there are more lines in the file, the import will stop at the limit. This is also good for testing, if you don’t want to load that 200M+ record file just yet - -
MERGE- since there are many instances of theHort_Clientinside the import file, we only want to create a single unique node
- Create
Soil_Servicenodes ```sql // import Soil_Service nodes CREATE INDEX ON :Soil_Service(ss_id); CREATE INDEX ON :Soil_Service(name);
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM “file:///soil_survey.csv” AS line WITH line LIMIT 10000 MERGE (ss:Soil_Service {ss_id: line.Soil_Service, name: ‘_’ + line.Soil_Service});
Also:
: - as above but import `Soil_Service` nodes
3. Create `REQUESTS` relationship between `Hort_Client` and `Soil_Service`
```sql
// Hort_Client-->Soil_Service
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///soil_survey.csv" AS line
WITH line LIMIT 10000
MATCH (hc:Hort_Client {client: line.Hort_Client})
MATCH (ss:Soil_Service {ss_id: line.Soil_Service})
MERGE (hc)-[:REQUESTS]->(ss);
- Also:
-
MATCH- locateHort_ClientandSoil_Servicenodes whose properties match the values of the current file line being read in
-
MERGE (hc)-[:REQUESTS]->(ss)- once the two above nodes are found, create a relationship between them, labelledREQUESTS
-
- import/ stores files, such as Json or CSV, that you can import to graph. I also put my Cypher scripts in here. More on that later
-
- conf/ stores a customised neo4j.conf file. More later about modifying specific settings
The resulting Cypher file will be a series of statements that will index node properties, create nodes and build relationships between them.
Running preliminary data exploration on soil_survey.csv with your Neo4j Browser
NB: Ensure you’ve parked your CSV file in ~/neo4j/import and Neo4j service is running
Here we can explore different aspects of the CSV file about to be imported. You can use the following Cypher queries that you can execute inside your Neo4j Browser.
- Count total number of lines.
LOAD CSV WITH HEADERS FROM "file:///soil_survey_sample.csv" AS line WITH line RETURN count(line);╒════════════╕ │"line_count"│ ╞════════════╡ │2837 │ └────────────┘ - Get fields and values from a typical line
LOAD CSV WITH HEADERS FROM "file:///soil_survey_sample.csv" AS line WITH line LIMIT 1 RETURN line as fields_and_values;{ "Solution": "5397", "Soil_Service": "54593", "Region": "Northbury", "Contractor": "1091", "Soil_Issue": "Erosion", "Date_Reported": "2007-05-07", "DaysToAction": "287", "Date_Actioned": "2008-02-18", "Locality": "3656", "Hort_Client": "159" } - Get fields and values from a typical line, nicely formatted
LOAD CSV WITH HEADERS FROM "file:///soil_survey_sample.csv" AS line WITH line LIMIT 1 RETURN line.Hort_Client as Hort_Client, line.Soil_Service as Soil_Service, line.Soil_Issue as Soil_Issue, line.Solution as Solution, line.Date_Reported as Date_Reported, line.Date_Actioned as Date_Actioned, line.DaysToAction as DaysToAction, line.Contractor as Contractor, line.Locality as Locality, line.Region as Region;╒═════════════╤══════════════╤════════════╤══════════╤═══════════════╤═══════════════╤══════════════╤════════════╤══════════╤═══════════╕ │"Hort_Client"│"Soil_Service"│"Soil_Issue"│"Solution"│"Date_Reported"│"Date_Actioned"│"DaysToAction"│"Contractor"│"Locality"│"Region" │ ╞═════════════╪══════════════╪════════════╪══════════╪═══════════════╪═══════════════╪══════════════╪════════════╪══════════╪═══════════╡ │"159" │"54593" │"Erosion" │"5397" │"2007-05-07" │"2008-02-18" │"287" │"1091" │"3656" │"Northbury"│ └─────────────┴──────────────┴────────────┴──────────┴───────────────┴───────────────┴──────────────┴────────────┴──────────┴───────────┘ - Get a count of unique values found in fields of interest
LOAD CSV WITH HEADERS FROM "file:///soil_survey_sample.csv" AS line WITH line WITH line.Hort_Client as client, line.Soil_Service as service, line.Soil_Issue as issue, line.Solution as solution, line.Locality as locality, line.Region as region RETURN count(DISTINCT client) as Hort_Client, count(DISTINCT service) as Soil_Service, count(DISTINCT issue) as Soil_Issue, count(DISTINCT solution) as Solution, count(DISTINCT locality) as Locality, count(DISTINCT region) as Region;╒═════════════╤══════════════╤════════════╤══════════╤══════════╤════════╕ │"Hort_Client"│"Soil_Service"│"Soil_Issue"│"Solution"│"Locality"│"Region"│ ╞═════════════╪══════════════╪════════════╪══════════╪══════════╪════════╡ │21 │2670 │12 │255 │1925 │4 │ └─────────────┴──────────────┴────────────┴──────────┴──────────┴────────┘ - Derive statistics from a numerical field of interest, e.g.
DaysToActionLOAD CSV WITH HEADERS FROM "file:///soil_survey_sample.csv" AS line WITH line WITH toInt(line.DaysToAction) as action_delay RETURN min(action_delay) as min_delay, max(action_delay) as max_delay, avg(action_delay) as mean_delay, stDev(action_delay) as std_dev, percentileDisc(action_delay, 0.25) as _25percentile, percentileDisc(action_delay, 0.5) as _50percentile, percentileDisc(action_delay, 0.75) as _75percentile, percentileDisc(action_delay, 0.9) as _90percentile;╒═══════════╤═══════════╤══════════════════╤═════════════════╤═══════════════╤═══════════════╤═══════════════╤═══════════════╕ │"min_delay"│"max_delay"│"mean_delay" │"std_dev" │"_25percentile"│"_50percentile"│"_75percentile"│"_90percentile"│ ╞═══════════╪═══════════╪══════════════════╪═════════════════╪═══════════════╪═══════════════╪═══════════════╪═══════════════╡ │-140 │365 │103.14698625308418│77.27355038508219│49 │84 │133 │216 │ └───────────┴───────────┴──────────────────┴─────────────────┴───────────────┴───────────────┴───────────────┴───────────────┘- Also:
-
- Take note that the calculated field
min_delayindicates an error in data, whereDate_Actionedoccurs beforeDate_Reported. How about inspecting the data and fixing this error before importing it into the graph?
- Take note that the calculated field
- View a random sample of five records from the file about to be imported
LOAD CSV WITH HEADERS FROM "file:///soil_survey_sample.csv" AS line WITH line SKIP toInteger(100*rand())+ 1 LIMIT 5 RETURN line.Hort_Client as Hort_Client, line.Soil_Service as Soil_Service, line.Soil_Issue as Soil_Issue, line.Solution as Solution, line.Date_Reported as Date_Reported, line.Date_Actioned as Date_Actioned, line.DaysToAction as DaysToAction, line.Contractor as Contractor, line.Locality as Locality, line.Region as Region;╒═════════════╤══════════════╤════════════════╤══════════╤═══════════════╤═══════════════╤══════════════╤════════════╤══════════╤═══════════╕ │"Hort_Client"│"Soil_Service"│"Soil_Issue" │"Solution"│"Date_Reported"│"Date_Actioned"│"DaysToAction"│"Contractor"│"Locality"│"Region" │ ╞═════════════╪══════════════╪════════════════╪══════════╪═══════════════╪═══════════════╪══════════════╪════════════╪══════════╪═══════════╡ │"170" │"3796" │"HighAlkalinity"│"766" │"2008-02-18" │"2008-07-08" │"141" │"2295" │"2616" │"Eastling" │ ├─────────────┼──────────────┼────────────────┼──────────┼───────────────┼───────────────┼──────────────┼────────────┼──────────┼───────────┤ │"170" │"2135" │"Erosion" │"2104" │"2008-02-18" │"2008-11-18" │"274" │"2295" │"1471" │"Eastling" │ ├─────────────┼──────────────┼────────────────┼──────────┼───────────────┼───────────────┼──────────────┼────────────┼──────────┼───────────┤ │"159" │"52067" │"HighAlkalinity"│"765" │"2008-02-25" │"2008-05-20" │"85" │"1091" │"3487" │"Northbury"│ ├─────────────┼──────────────┼────────────────┼──────────┼───────────────┼───────────────┼──────────────┼────────────┼──────────┼───────────┤ │"159" │"6116" │"HighAlkalinity"│"765" │"2008-02-25" │"2008-06-24" │"120" │"1091" │"409" │"Northbury"│ ├─────────────┼──────────────┼────────────────┼──────────┼───────────────┼───────────────┼──────────────┼────────────┼──────────┼───────────┤ │"160" │"6241" │"Erosion" │"8775" │"2008-02-25" │"2008-04-15" │"50" │"1250" │"2777" │"Eastling" │ └─────────────┴──────────────┴────────────────┴──────────┴───────────────┴───────────────┴──────────────┴────────────┴──────────┴───────────┘
You have a background understanding of Cypher and how its statements related to the graph model, plus you have been able to get a preliminary peak at the structure of the CSV file you are about to import