Lets begin with a few questions about the data:
A.) Where does the data come from? Is it manually entered which is subject to human error and inconsistency in the data being entered?
If a human is entering the data they could decide on their own to change a key field's name, or change a naming convention, or go with short hand for something that was normally typed out if say they are busy that day and rush through something with the intention of going back and changing it later and they never get the opportunity or forget.
B.) Is it downloaded from a database or system?
How often is this database/system subject to changes, upgrades, or anything that might alter the output of the data that is downloaded? Can you rely on the data being consistent in it's form, verbiage, key fields, naming conventions, etc.?
You will need to allow for the possibility of changes in your data and an easy way to locate, update, and account for them when they happen.
C.) Are you in control of the date range, time frame, or period to which the data relates? To what level of detail, or control do you have to ensure consistency in the date range, time frame or period for the data you are using?
This knowledge serves multiple different purposes.
The size of the files you are dealing with could be controlled by decreasing the time frame you are working with to allow faster downloads from a system or file sharing between people an example of this would be file size limits on email attachments.
Do you need to store downloads from each time frame so that you might access them quickly without having to download them over and over, and an even more critical factor is it a possibility that you might loose access to older time frames of information from pervious months or years should something change beyond your control?
If you think you will always have access to whatever timeframe you need and then one day you don't and there is no way to gain access to older data what are you going to do?
If you are dealing with data from a timeframe does the data itself actually contain information about that timeframe, a date column or some other way to know the time frame to which it relates or do you need to create one?
Do not assume you are going to remember that a certain files contents relates to a timeframe if you do not have a date column, a file naming convention, or some other means to know the time frame for the data.
These are just some of the questions you should be asking yourself before starting a project and there are many more questions that can effect the design and process by which your create your project to analyze the data.
You are not a fortune teller, you do not know the future, accept this and design your project with the flexibility to adapt to changes as they happen once you have identified the possible changing variables.
Set check points in your projects data analysis that serve as simple binary tests to ensure that you have caught a change if one or multiple occur, this could be as simple as a check box to give yourself a visual indicator that something has been done or a formula that produces a result and compare that to a constant value to draw attention to a change in your data even if it does not break your project it maybe a sign that something else could change or cause an issue you did not plan for.
Comments