VOL 154 .... No. 8

WEDNESDAY, MAY 16, 1979

Two Approaches to News Rating

Categories: Programming

flan

Last time, I generated a few data sets for testing different methods of rating the training news articles.  This time, I actually implemented two of them, the naive approach I had used before, and the new-and-improved version taking into account temporal interference.

Let us first examine the architecture I am using.  I created an interface Sentiment, which is to be implemented by classes that take a ticker and can find some sentiment estimations from it.  The signature for the method is as follows:

/**
* Get the sentiment for the news articles in a ticker
* @param ticker   the ticker to inspect
* @param forecast the forecast parameter
* @param index	   the data index to inspect
* @return		 an array where
* 					array[0][i] = influence of i
* 					array[1][i] = time-frame of i, in terms of multiples of forecast
*/
public double[][] getSentiment(Ticker ticker, long forecast, int index);

The idea behind doing it this way, returning a two dimensional array, is that I may eventually want to implement a more complex learning method that can also tease apart what scale it is that a news article has an effect, since I imagine some news articles represent more long-term impact, while others represent more short-term impact.  But that’s for the future.

The implementation for the naive approach is reproduced below:

public double[][] getSentiment(Ticker ticker, long forecast, int index) {
	DataHistory data = ticker.getDataHistory(index);
	NewsCorpus corpus = ticker.getCorpus();
 
	double[][] result = new double[2][corpus.getCount()];
	for(int i = 0; i < corpus.getCount(); i++) {
		NewsStory story = corpus.getNews(i);
		long time = story.getTime();
		double price1 = data.getData(time);
		time += forecast;
		double price2 = data.getData(time);
 
		result[0][i] = (price2 - price1) / forecast;
		result[1][i] = 1;
	}
	return result;
}

The essential idea is that you just look ahead and see how the price moved some time after the news article. This was the approach I had used on my original project. Having now had the chance to test this method on my theoretically generated data, I can see just how ineffective this is.

For data set 1, the performance (in terms of proportion of sentiments in which the correct sign was estimated), was 0.84. That value is not entirely bad, and would be acceptable, except we find that the data set is actually quite simple. For the other sets, the performance was 0.72, 0.62, 0.56, and 0.70 respectively. On the hard data set (#4), the value was only slightly better than chance by random guessing.

I also tried to see what would happen on a larger data set, a newly generated one, which can be found here.  This corpus had a time-frame of 50 and a time-step of 1, making it similar to data set 4.  Performance on data set 6 was poor, again at about 0.56, indicating the performance is very heavily correlated to the ratio of time-frames to time-steps.

The summary of results for the naive approach can be found here.

It’s no wonder my initial attempts failed so horribly, at the very base of the methodology was a very poorly thought-out concept.

My next attempt was to try the method that teased apart temporal influences.  The general approach was to observe a price change, and then subtract off the effect from all news articles before it, within the forecast parameters.  The value we used to subtract off was an assumed value, initialized to the naive price move.  This approach is then repeated several times, until we come to some steady state.

My initial findings were that this method performed with almost identical accuracy as the previous test.  There had to be something wrong.

It was then that I realized that I also needed to be subtracting off the influence from news articles ahead of the target news article.  This increased the necessary number of iterations, but the results amazed me.

Keep in mind, there is a parameter here, the epsilon value, the acceptable error.  High values of epsilon meant longer run-times, but often more accurate results.

With an epsilon value of 1E-4, the number of iterations required for the 5 main data sets were: 18, 40, 60, 125, 33.

At that epsilon, the accuracy values were: 0.96, 0.93, 0.92, 0.87, 0.99.

Detailed results for this test can be found here.

At that value of epsilon, I was not able to run the test on the data set number 6.  Early anecdotal notes on performance, are that the accuracy seems to also be correlated to the ratio of time-frame to time-step, but does not seem to drop off as quickly.  Number of iterations seems to be almost linear to the same ratio.

Wanting to do some more experiments on performance, I tried setting the value of epsilon to 1E-8.

The first data set required 1387 iterations, but had accuracy 0.97.  The second data set took too long to finish.  The fifth data set took only 189 iterations, and only got a single sentiment wrong.

Reducing the epsilon value to 1E-2, we needed iterations  4, 11, 23, 27, 13 and got accuracies 0.94, 0.90, 0.88, 0.82, 0.94.  I gave the algorithm 3 minutes to do data set 6, and it was still unable to finish.

At 1E-1, or 0.1, we found runtime to be:  3, 8,  20, 21, 11 and accuracy to be: 0.93, 0.90,  0.88, 0.80,  0.94.  Here we start to notice less significant dropoff.  Keep in mind that each iteration probably causes the values to move less and less, and the values at each iteration are deterministic and not dependent on epsilon.  At this epsilon, data set 6 could be processed in 120 iterations (which took about 6 minutes), and got an accuracy of 0.89, which is actually greater than the epsilon for data set 4 at epsilon of 1E-4, but not by much.

Seeing how this one performed, however, led me to discover a wildly inefficient chunk of code that would make this algorithm painfully slow as we increase the number of news articles, the nested for loops that check every single article against every other at each iteration.  I did it this was, because the news articles might not necessarily be sorted, however, this really only needs to be done once, and we can cache the results.

Making this change did not change the results obtained at all, but did make the runtime seriously faster.

At 1E-1, data set 6 was now complete in under 30 seconds.  I then ran data set 6 with different values of epsilon.  At 1E-2, we required 142 iterations, with an accuracy of 0.89.  At 1E-4, we required 221 iterations, with accuracy 0.91.  Feeling good, I tried 1E-8, but lost patience.  At 1E-5, 313 iterations with accuracy 0.92.  At 1E-6: 1173 iterations (a sharp jump!), but the accuracy too increased to 0.94.

You can draw your own conclusions about what’s really going on here from these several data points, I’m not making any real claims yet, but I can imagine there is still a lot to learn.

Here is the code, as it currently exists:

private double epsilon = 1E-8;
 
@SuppressWarnings("unchecked")
@Override
public double[][] getSentiment(Ticker ticker, long forecast, int index) {
	//Get the observed price moves
	double[] priceMove = (new NaivePriceMove()).getSentiment(ticker, forecast, index)[0];
	int count = priceMove.length;
 
	//Create the empty result table
	double[][] result = new double[2][count];
	for(int i = 0; i < count; i++) {
		result[0][i] = priceMove[i];
		result[1][i] = 1;
	}
 
	NewsCorpus corpus = ticker.getCorpus();
 
	//Cache the proximities
	List<Integer>[] proximities = (List<Integer>[]) new List<?>[count];
	for(int current = 0; current < count; current++) {
		proximities[current] = new ArrayList<Integer>();
		long time = corpus.getNews(current).getTime();
		List<NewsStory> range = corpus.getNews(time - forecast + 1, time + forecast - 1);
		//Go through the other news events
		for(NewsStory story : range) {
			int other = story.getId();
			if(other != current) 
				proximities[current].add(other);
		}
	}
 
	//Keep going until we reach an acceptable error
	double error = Double.MAX_VALUE;
	int iter = 0;
	while(error > epsilon) {
		iter++;
		error = 0;
		//Go through each news event
		for(int current = 0; current < count; current++) {
			double observed = forecast * priceMove[current];
			long time = corpus.getNews(current).getTime();
			//Go through the other news events
			for(int other : proximities[current]) {
				long otherTime = corpus.getNews(other).getTime();
				long difference = Math.abs(time - otherTime);
				//Depending on how much overlap, discount the influence of the other
				long effect = forecast - difference;
				observed -= result[0][other] * effect;
			}
			observed /= forecast;
			//Error is max error
			double difference = observed - result[0][current];
			difference *= difference;
			error = Math.max(difference, error);
			result[0][current] = observed;
		}
	}
	System.out.println(iter + " iterations.");
	return result;
}

related post

Tags: , ,
Comments are closed.