Model Assessment ================ In this analysis, we’ll use a case study looking at an individual’s email habits. We hypothesize that there are two types of emails that arrive in a person’s inbox and require a response: 1. Those that have no task associated other than to reply with an immediately available piece of information 2. Those that require a task to be accomplished before they can be replied to. The outflow of the first of these types of emails can be modeled as a first order delay on their inflow, which would be equivalent to assuming that email sending and email checking follow a poisson process, and the time between message arrival and the next inbox check follows an exponential distribution. The second type can be modeled as a higher order delay, as the ‘task’ associated with the email’s arrival first must be acted upon and then written about. SD Model ~~~~~~~~ Such a model would look something like this: .. figure:: ../../../source/models/Emails/full_model.png :alt: Model Screenshot Model Screenshot Normally, emails would arrive over a period of time, and would be responded to continuously, and the distribution itself would get all mixed up in the day to day fluctuations. However, if we time-shift the arrival of each email to arrive at time 0, then the output of each of these stocks will represent the distribution of time between arrival and reply to each type of message. To do this we create a 1-timestep pulse input with the full volume of the messages, and build our delay around these values: :: Easy Arrival= Easy Email Volume / TIME STEP * PULSE(0, TIME STEP ) Easy Reply= DELAY N(Easy Arrival, Easy Reply Time, 0, 1) Below we see how the sum of replies to easy and hard emails creates a distribution of response times. .. code:: ipython2 %pylab inline import pysd import pandas as pd import scipy.optimize .. parsed-literal:: Populating the interactive namespace from numpy and matplotlib .. code:: ipython2 model = pysd.read_vensim('../../models/Emails/Emails_in_2_fixed_categories.mdl') params = {'total_emails':1000, 'easy_fraction':.5, 'easy_reply_time':1, 'hard_reply_time':10} model.run(params, return_columns=['net_email_output', 'easy_reply', 'hard_reply']).plot() plt.title('Timeseries output representing a distribution over time deltas') plt.xlim(0,20); .. image:: Comparing_Models_files/Comparing_Models_2_0.png Data ~~~~ Now our task will be to compare this output distribution with data collected on response times from an actual inbox. In this case, mine. The data comes (from our `data processing script `__) with a single time delta per row, so we take a histogram to get a distribution we can compare with model output. As we’ve taken a histogram bin size of 1 day, then the height of the bins themselves represents the number of messages per day we can expect to come out of our model. If we changed the bins size, we would need to scale these values to preserve the units. .. code:: ipython2 response_times = pd.read_csv('../../data/Emails/days_to_join_conversation.csv', names=['id','Days'], index_col='id') num_conversations = len(response_times) print('Loaded %i conversations' % num_conversations) counts, edges, _ = plt.hist(response_times['Days'], bins=range(25), histtype='stepfilled', alpha=.5) data = pd.Series(data=counts, index=1.*(edges[:-1]+edges[1:])/2) #take the average location in the bin data.plot(linewidth=2, color='k', xlim=(0,25), title='Histogram of response times'); plt.xlabel('Days to first response'); .. parsed-literal:: Loaded 1403 conversations .. image:: Comparing_Models_files/Comparing_Models_4_1.png Fitting ~~~~~~~ Our task is now to fit the model output to the observed distribution. To do this we’ll construct a helper function to take an array of parameters (which the optimizer provides), run the model, and calculate the sum of squared errors between the model output and the data distribution. .. code:: ipython2 param_names = ['easy_fraction', 'easy_reply_time', 'hard_reply_time'] def error(param_list, data, num_conversations): params = dict(zip(param_names, param_list)) params['total_emails'] = num_conversations #run the model with the new parameters, returning the info we're interested in result = model.run(params=params, return_columns=['net_email_output'], return_timestamps=list(data.index.values)) #the sum of the squared errors sse = sum((result['net_email_output'] - data)**2) return sse error([.5, .5, 20], data, num_conversations) .. parsed-literal:: 128962.22585551594 Fitting is relatively straightforward, we pass the error function to scipy’s optimizer, with a few bounds, and an initial guess. .. code:: ipython2 params = {'easy_fraction':.8, 'easy_reply_time':1.1, 'hard_reply_time':7} bounds = {'easy_fraction':[0.001,.999], 'easy_reply_time':[0.001,21], 'hard_reply_time':[.001,52]} res = scipy.optimize.minimize(error, [params[key] for key in param_names], args=(data, num_conversations), method='L-BFGS-B',#'L-BFGS-B', #'SLSQP', #'TNC' bounds=[bounds[key] for key in param_names]) res .. parsed-literal:: status: 0 success: True nfev: 512 hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64> fun: 10986.275201802197 x: array([ 0.84894292, 0.71866884, 20.6090577 ]) message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' jac: array([ 9946.84851321, 31919.28553861, 95.6661097 ]) nit: 12 Plotting the result ~~~~~~~~~~~~~~~~~~~ The fit parameters can be seen below, with the calculated data distribution in black. .. code:: ipython2 params=dict(zip(param_names, res['x'])) params['total_emails'] = num_conversations model.run(params, return_columns=['net_email_output', 'easy_reply', 'hard_reply']).plot() data.plot(linewidth=2, color='k', xlim=(0,25), title='Histogram of response times'); plt.xlabel('Days to first response'); .. image:: Comparing_Models_files/Comparing_Models_10_0.png