Generate your own CSV/TSV data quickly with urandom + hexdump

Written by mannem on . Posted in Data-sets

In this article, we will use hexdump + urandom which are included in most linux distributions to quickly generate random data. The values of the first row should be unique and can also be used as hash or index key.

This can be useful, if you wanna upload this data to NoSQL databases like DynamoDB with Primary key as col1 values (or with sort key as second column values)

Here are some one liners to generate data :

UK police data

Written by mannem on . Posted in Data-sets provides a complete snapshot of crime, outcome, and stop and search data, as held by the Home Office at a particular point in history.

The actual data is located on S3 under bucket policeuk-data and can be accessed with a URL similar to , (Where yy,mm are year and month that can be replaced accordingly)

The Structure:

All files are organized by YEAR and MONTH.

Each month has a ZIP file with CSV files inside the zip file.

The January 2015 file contains data for all months starting from 2010-12 to 2015-01

Contents of a sample file:

The columns in the CSV files are as follows:

Reported byThe force that provided the data about the crime.
Falls withinAt present, also the force that provided the data about the crime. This is currently being looked into and is likely to change in the near future.
Longitude and LatitudeThe anonymised coordinates of the crime. See Location Anonymisation for more information.
LSOA code and LSOA nameReferences to the Lower Layer Super Output Area that the anonymised point falls into, according to the LSOA boundaries provided by the Office for National Statistics.
Crime typeOne of the crime types listed in the Police.UK FAQ.
Last outcome categoryA reference to whichever of the outcomes associated with the crime occurred most recently. For example, this crime's 'Last outcome category' would be 'Offender fined'.
ContextA field provided for forces to provide additional human-readable data about individual crimes. Currently, for newly added CSVs, this is always empty.

The Challenge:

  • The given data contains some inbuilt errors in the Easting, Northing , Crime_type fields.
  • Data is in CSV format with commas in data itself.
  • The CSV files contains column HEADERS i.e the first record in a CSV file is a header record containing column (field) names

What is unique ?

  • The same data can be accessed over API. The API is implemented as a standard JSON web service using HTTP GET and POST requests. Full request and response examples are provided in the documentation.
  • The response contains ID of the crime which may be unique and can used as HashKey while storing and Querying in NoSql.
  • The JSON file can also be used for as index document for Elasticsearch.

Example API call via REST:

Example Responce:

More details on API access can be found here: