Test Data Sets
From my last post, I introduced the idea of creating test data sets for the purpose of finding an algorithm to tease apart the influence of individual news articles. I have done just that and am posting the data sets for further analysis.
My method for generating these test files was as the following pseudocode describes:
-Take 3 parameters, TIME-STEP, TIME-FRAME, and COUNT.
-Create COUNT news articles, each with the following encoded in their summary field:
-Time-frame equal to TIME-FRAME
-Influence randomly set between [-1,1]
-For each timestep 0 through (TIME-STEP * COUNT)
-Find all news articles before current time, within their Time-frame value of now
-Add the sum of those news articles’ Influence values to the current price
-Record the current price
Because we defined a constant TIME-FRAME ahead of time, a simpler algorithm could have been used, but I am planning on attempting experiments with variable time-frames at a later date, so this was a sensible solution to save myself some work in the future.
I created 6 data sets, each with 500 data points, as follows:
Data set 0
TIME-STEP: 1
TIME FRAME: 1
Data set 1
TIME-STEP: 1
TIME FRAME: 2
Data set 2
TIME-STEP: 1
TIME FRAME: 5
Data set 3
TIME-STEP: 1
TIME FRAME: 10
Data set 4
TIME-STEP: 1
TIME FRAME: 50
Data set 5
TIME-STEP: 3
TIME FRAME: 17
The motivation for choosing the values for data-sets 1-4 are simple, to see the effects of using longer and longer time-frames relative to time-steps. Data set 5 exists for the sole purpose of seeing if any problems are present with weird offsets. If we see anything unexpected there, future research may be necessary.
I have attached a zip file of the corpus, if you are interested: here.




