Some clean up is useful for your Data Sets #1

Open amaajemyfren opened this issue · 1 comment
amaajemyfren commented over 1 year ago


The problem as I see it:

I heard about your data sets from a tweet by David Ndii and as of now I love them. As far as Kenyan Data is concerned it allows me to be able to read the data without having to go to the work of extracting the data from KNBS PDF documents. That is helpful.

KNBS data is written, I think, to be read in a book. As such it has a lot of intermittent summaries - your data is usually the same as the data in the KNBS and so it is hard to process because you have to remove the summaries before you can analyse it.

Take this data for example. The first row shows the country Kenya - this is the summary of the whole data! So if I was to sum the Conventional Households column e.g. It would be entering the data twice. Well not twice ... let me explain.

The Rural and Urban columns are also summaries ... but they are useful summaries Many counties are in both rural and urban setting it is important to have that summary - but it adds an additional layer of duplication. Further more the actual data also has county summaries. e.g. Take MOMBASA, It is comprised of the following sub-counties.

  2. JOMVU
  5. MVITA
  6. NYALI

(Look at the Conventional Households data -> CHANGAMWE, 46,439 + JOMVU, 53,214 + KISAUNI, 88,159 + LIKONI, 80,002 + MVITA, 38,710 +NYALI, 69, 771 = MOMBASA 376,295

So summing up that column would repeat the number of households 4 times (the actual number, plus the summary per county, plus the summary rural/urban plus the total of Kenya)


What I would propose is that you please clean this out e.g:

  1. First remove the Kenya column - anyone can use excel to calculate that one.
  2. The summary Rural/Urban could be left there but have a space separating it with the rest of the data - alternatively you could put it (with a space) after the end of the data. It should also have its own title row with County/Sub-County replaced by Area Type. In Excel spreadsheets it could be in another workbook.
  3. The whole sheet could have a new column added on the left titled county and the name of the county added to it. This means you remove e.g. MOMBASA from being its own row and for each of its sub-counties place it on the left of the sub-county... like so:

Proposed Header

And you are done ...

Now incase you do not know the county level data for Kenya they are the following that I was using to clean up your data could help it has the County and Sub Counties in the wrong order ... but you can change that.

( In case your formatter removes it here it is)

I think the rule is - the first time you find the county name remove it and start adding it on the left column. As you can see there are some sub-counties that have the same name as the county.

amaajemyfren commented over 1 year ago

Well .... maybe the issues don't work!