Waiting for R2...
How long does it take for a computational neuroscience paper to get accepted after a submission to Nature Neuroscience? I wanted to know, so I built a scraper.
Every academic knows that quite some time can pass from submission of a paper to a journal until acceptance and eventually publication. Whether it's the tedious communications with the Editors about the font size of a figure label or the questions in suspiciously minute detail from Reviewer 2.
Past publications in a journal can be indicative for how long your own submission process could take. For this case study, I have looked at how long it takes for papers in the field of Computational Neuroscience (which is the field I am working in) to be accepted at Nature Neuroscience, one of the most prestigious neuroscience journals out there.
To get this information, we can crawl publicly available data from the Nature website. Every article's page has the time of submission, acceptance and publication that we can gather to compute an average. An example is shown below.
We use the wonderful functions provided in requests_html
to scrape the data. The script below goes through all search results from the Nature website, filtered by a specific subject, in this case it is "computational-neuroscience". We get a list of all articles by filtering the relevant elements we want to iterate through using the function base_html.html.find()
. Then, we can easily get the relevant elements of the search results (the title and the html links to each article for example) using the attributes .text
and .links
.
#hide_output
from requests_html import HTMLSession
session = HTMLSession()
data = pd.DataFrame(columns=['journal', 'url', 'title', 'received', 'accepted', 'published', 'timeForAcceptance', 'timeForPublishing'])
for page_nr in range(1, 5):
base_url = "https://www.nature.com/search?order=relevance&journal=neuro&subject=computational-neuroscience&article_type=research&page="
base_url += str(page_nr)
base_html = session.get(base_url)
articles_journal = base_html.html.find('div.grid.grid-7.mq640-grid-12.mt10')
articles = base_html.html.find("a[href*=articles]")
for ai, a in enumerate(articles_journal):
filter_journal = 'Nature Neuroscience'
if filter_journal in a.text:
print(" - - - - - - - - ")
print('Article nr {} on page {} in Journal {}'.format(ai, page_nr, filter_journal))
article_title = articles[ai].text
print("\"{}\"".format(article_title))
article_suburl = list(articles[ai].links)[0]
article_url = "https://www.nature.com{}".format(article_suburl)
print("Getting {}".format(article_url))
article_html = session.get(article_url)
dates = article_html.html.find("time")
print('> Received: ', dates[1].attrs['datetime'])
print('> Accepted: ', dates[2].attrs['datetime'])
print('> Published: ', dates[3].attrs['datetime'])
received_date = parser.parse(dates[1].attrs['datetime'])
accepted_date = parser.parse(dates[2].attrs['datetime'])
published_date = parser.parse(dates[3].attrs['datetime'])
timeForAcceptance = accepted_date - received_date
timeForPublishing = published_date - received_date
print(timeForAcceptance, 'between',received_date, 'and', accepted_date)
data = data.append({'journal' : filter_journal,
'url' : article_suburl,
'title' : article_title,
'received' : received_date,
'accepted' : accepted_date,
'published' : published_date,
'timeForAcceptance' : timeForAcceptance,
'timeForPublishing' : timeForPublishing}, ignore_index = True)
Let's have a look at the aggregated Dataframe:
data
We need to put the data into buckets of years and days for calculating the average.
data['year'] = data.apply(lambda row: row.published.year, axis=1)
data['days'] = data.apply(lambda row: row.timeForAcceptance.days, axis=1)
data['daysp'] = data.apply(lambda row: row.timeForPublishing.days, axis=1)
Let's plot the data:
plt.figure(figsize=(6, 3), dpi=300)
ax=plt.gca()
xfmt = md.DateFormatter('%Y')
ax.xaxis.set_major_formatter(xfmt)
years = [parser.parse(str(data.groupby(by='year').days.mean().index[d])) for d in range(len(data.groupby(by='year')))]
years_beginning = [datetime.datetime(y.year, month=1, day=1) for y in years]
mean_time = data.groupby(by='year').days.mean()
std_time = data.groupby(by='year').days.std()
plt.plot(years_beginning, mean_time, label='Yearly mean until accepted', c='C3')
plt.plot(years_beginning, data.groupby(by='year').daysp.mean(), label='Yearly mean until published', c='C0')
plt.legend(fontsize=10)
for di in range(len(data)):
plt.scatter(data.iloc[di].received, data.iloc[di].days, c='C3', s=5, edgecolor='k', linewidth=0.5)
plt.xlabel("Time of submission")
plt.ylabel("Days")
plt.title("Time for acceptance of Computational\nNeuroscience papers in Nature Neuroscience")
plt.savefig("../images/icon_natureneuroscience.png");