Loading...

Follow George J. Mount on Feedspot

Continue with Google
Continue with Facebook
or

Valid

Very happy to share that my DataCamp course “Survey and Measure Development in R” is now available.

Check it out here — the first chapter is free.

From the course description:

How can we measure something like “brand loyalty?” It’s an obvious measure of interest to marketers, but we can’t quite take a ruler to it. Instead, we can design and analyze a survey to indirectly measure such a so-called “latent construct.” In this course, you’ll learn how to design and analyze a marketing survey to describe and even predict customers’ behavior based on how they rate items on “a scale of 1 to 5.”

You’ll wrangle survey data, conduct exploratory & confirmatory factor analyses, and conduct various survey diagnostics such as checking for reliability and validity.

About DataCamp:

DataCamp provides online data science learning services to 3.3 million users in over 190 countries and 1,000 business customers including Airbnb, Kaiser Permanente and HSBC. Learn more and get started here.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I’ve been a dilettante in eBay sales since college and, wanting to add a little more rigor to what I sell and how to price it, looked to the spreadsheet for guidance. 

If you’ve ever sold online before, you know the cost of shipping is literally a “make-or-break” input in profitability. Consumers more or less expect free shipping nowadays, so it’s easy not to factor this into the final tally and end up losing money (not to mention your precious time, which I am sure you would rather not spend in line at the post office).

Download the exercise file here

So, you want to “bake this into” the final selling price to some extent while still pricing the item at a level someone will actually pay for. What a dilemma! Let’s use Excel to plot a course forward. 

Break-even analysis in Excel for eBay and e-commerce - YouTube

Setting up the P&L

Even something as simple as shipping a package on eBay can include multiple costs and inputs. Let’s keep things simple here by focusing on the sales price, cost of item sold, and cost of postage. These will be our “variable input” cells that can change and drive the rest of the model. 

We’ll also add some “fixed costs” for the cost of packaging supplies such as a mailer, paper and ink. 

Our final “P&L” will look like this. Each calculated cell’s formula is displayed next to it, courtesy of the FORMULATEXT()function:

Expecting to see cell references and not words in cells B6and B8? I named these cells SalesPrice and PostageCost, respectively. You can name a cell by clicking inside the box to the left above the worksheet and typing over that cell’s respective reference (B1, in this case). 

Double-check and toggle between these named cells by clicking the drop-down button in this menu: 

Our very simple shipping model is set up. We want to experiment with what levels of pricing and shipping costs lead to a profitable outcome. We could of course just do this by keying in figures of interest and seeing “how the cookie crumbles:”

However, this feels like we are “walking and chewing gum,” by trying to balance the competing interests of sales price and cost of postage. There is a lot of information to keep in mind as we enter in new data and try to memorize the results as we compare them.

It’s like we want to compare several combinations of each of the two at once! 

That’s precisely what we are going to do, using two-way data tables. This is a stalwart Excel tool for financial modeling — which is a fancy term to describe what we are doing by trying to make a buck on eBay! 

Preparing the data table

A data table will allow us to compare what profits result from various combinations of price and postage costs. Let’s build our table in cells A19:F24. Across row 24 we have various potential prices. We can put these in any interval we want, but it’s typical for prices to be in nice round numbers, so we’ll do intervals of five. 

Shipping costs go down column A. We’ll do these in intervals of three to reflect the wider likely values of postage costs. 

Then we have our value of interest, profit, in cell B15 (and yes, I named it!). 

What we are going to do next is to “tether” this table to our above model such that each combination of values is “plugged into” the model and we get a resulting answer inside the table. First things first — locate the data table by heading to Data in the home ribbon and selecting “What-if Analysis.” Data Table will be the final selection in this drop-down.  

Have cells A19:F24 selected for the next part. 

Now we want to “tether” these two values to the table. Remember that sale prices go across the row and postage costs down the column, translation: cell B1 is our “Row input cell” and cell B2 is our “Column input cell.”

Fantastic! Now we have the resulting profit of each given combination of price and postage. 

Conditional formatting to top it off

Why don’t we go ahead and make this table “pop” by applying conditional formatting. You can do this by heading to Conditional Formatting on the Home tab of the ribbon, selecting “Highlight Cells Rules,” selecting “Less Than,” and formatting cells that are less than 0 as red. This will highlight the unprofitable combinations of profit and postage. 

Cleaning the corner

Remember cell A19? That’s our “plug” value that isn’t really doing much — we are interested in the profit values inside the table. We can’t delete this number because that will “unlink” our model. What we can do is reformat it so it’s blank. 

To do that, click inside the cell, then select Ctrl + 1 on your keyboard. Go to “Custom” formatting and create a format type ;;;. Hit OK and you will see our output is gone, but we’ve retained the cell reference. 

Doing it with modeling

Not only is this exercise something that can be applied to real-life scenarios and something that I personally will be using to estimate eBay profits, it shows the real power of Excel in designing and innovating with data. Setting up a financial model like this in R is possible but it just wouldn’t have the “pop” that we get from visually calculating our inputs, outputs and combinations thereof. That’s why of all the use cases of Excel, financial modeling may have the best staying power and is an area to check out if you want to improve your Excel skills. 

For this, I would recommend Financial Modeling in Excel for Dummies by my friend Danielle Stein Fairhurst. Check out my review here.

Now, get out there and sell, and remember me when you’re a retail mogul!

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

It’s wild to think that everything we do with computers comes down to manipulating its 0 and 1 states. So, it makes sense that using Boolean logic (that is, the manipulation of 0 and 1 or FALSE and TRUE states) can prove incredibly powerful in data manipulation.

Download the exercise file here.

Let’s take an example here. In cells A1:I49, we have a weather forecast in tabular format with address and attribute in the columns and the dates across the rows. We want to populate the “itinerary”-style table starting in cell L1. 

We want to “look up” values here, so an obvious choice might be VLOOKUP().

However, consider that we are looking up two attributes at once — both the destination and the date, and that the values we want to look up (weather attributes) are on a different “axis” in the lookup table (They are store in one column rather than one row.). Those are some heady obstacles. Our runner-up, then, might be PivotTables simply to “reshape” the data. This would work just fine.

But, let’s “hack” a solution ourselves: by using SUMPRODUCT() for array multiplication.

But first, SUMJOKE()

The first thing I like to bring up when discussing SUMPRODUCT() is this knee-slapper from our friend Jordan Goldmeier:

Q: How does the #Excel developer style their hair? A: With SUMPRODUCT(). #mvpbuzz

— Jordan Goldmeier (@Option_Explicit) October 8, 2013

Now that we have that comedic relief out of the way, let’s continue with the hacking!

SUMPRODUCT()with conditional logic

Generally we use SUMPRODUCT() to multiply entire arrays together to, for example, calculate a weighted average

We can also combine it with conditional logic essentially to “look up” a value by multiplying corresponding 1’s and 0’s together with our lookup value. Each cell in our table is essentially the value of a combination of three unique dimensions — weather quality (Max Temp, Min Temp, etc.), destination, and date. 

Let’s “slice” our lookup table starting in cell N2:

=SUMPRODUCT(($A$2:$A$49=$L2)*($B$2:$B$49=N$1)*($C$1:$I$1=$M2))

You will see that each cell is flagged as a “1.” So far, so good! Essentially you just multiplied 1 * 1 * 1 for each cell. Boolean logic at work! 

By modifying any of these attributes, we will get a 0 as a resulting value. It’s essentially like multiplying 1 * 1 * 0… which returns a 0.

From here, we’ll ’round out’ our formula by multiplying our three arguments together by a fourth — the weather data itself (cells $C$2:$I$49). Our formula in N2 then becomes

=SUMPRODUCT(($A$2:$A$49=$L2)*($B$2:$B$49=N$1)*($C$1:$I$1=$M2)*$C$2:$I$49)

In “pseudo-code,” it’s like our final output is 1 * 1 * 1 * the information we want.

Awesome! Now we have a tidy itinerary for a nice and dry cross-country trip that looks suspiciously like my own Summer of George 2016 road trip!

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

The Formula: The Universal Laws of Success by Albert-László Barabási, available for pre-order.

My initial impression of this book was one of skepticism. The Universal Laws of Success, really? Most books with this sweeping of a title over-promise and under-deliver, offering a smattering of anecdotes to surprise and delight without really defining or systematically applying the principles in question.

Fortunately this latest popular title by physicist-turned-network-scientist Albert-László Barabási proves an exception as this book finds a groove between academic rigor and real-life relevance.

Success Is Not About You | The Laws of Success with Albert-László Barabási - YouTube
Network science is cutting-edge research and Barabasi has a knack distilling his respected scientific findings to something suitable for reading on an airplane or at a coffee shop. By the time you are done reading, the “Universal Laws of Success” may still feel a sweeping statement, but you cannot say that Barabasi has reached this conclusion through anecdote or logical sleight of hand.

As someone with an academic training of my own, I could see the veins of “academese” pulsing through the pages of this book. Academics love nothing if not definitions and while this exercise can be tedious it is often necessary to set the stage for any kind of systematic review of a subject.

“The popular definition of success,” Barabasi thus argues, “reinforces the perception that ‘success’ is as loose a concept as “love.” The topic’s vagueness kept scientists away—they assumed that it couldn’t be studied.” Thus the author is quite careful to define his terms and how they will be measured.

And as a network scientist, we might expect that Barabasi comes about his findings from network analysis. But this execution is an innovation of itself, as Barabasi argues: “Realizing that success is a collective phenomenon throws that perception out the window.”

Elaborating on that, the author writes: “Your success isn’t about you and your performance. It’s about us and how we perceive your performance. Or, to put it simply, your success is not about you, it’s about us.” So while intrinsic satisfaction gained from personal mastery is a noble effort, this kind of success, the author admits, is beyond the scope of his research, as framed by his definitions and methods.

While this academic-infused perspective of success may seem somewhat incongruent to one’s everyday perspective of success, it makes for some powerful research. Barabasi derives five “universal laws” of success, on the way introducing concepts like “fitness” and “popularity” and offering some algebraic equations for how these concepts inter-relate.

It’s hard to call the findings “anecdotal” as they are backed by pretty mind-boggling research. By mining datasets as huge as Wikipedia and comprehensive statistics from professional tennis, modern art galleries and more, Barabasi has some rigorous research to draw from while writing this book.

Why are wine judges inconsistent in their scoring? Why do the last performers at concert competitions almost always win? What makes some researchers win a Nobel Prize while other perhaps more-deserving researchers do not? Walk through these cases and more in the book.

As I said, this book, while digestible is steeped in academic terms and definitions. Some of it goes a little too far, even for me. Take, for example, the idea of “preferential attachment” which explains that things which fare successful early on become more successful over time due to that early success. To me this just sounds like a positive feedback loop and indeed the Wikipedia page does mention that term — along with lots of other equations and research, so it’s obviously more complicated than that to researchers, but for a lay reader it was a bridge too far in this book.

And, as you would expect in any business book, there is some lighthearted motivational talk in The Formula. Take its fifth law: “Success can come at any time as long as we are persistent.” Yes, we get the usual anecdotes about late success in life, like Ray Kroc and Alan Rickman. You get these kinds of anecdotes in every business book.

But what distinguishes The Formula from the usual business book is it moves beyond providing delightful anecdotes. The premise of the book comes from research and is backed up primarily from research by a highly-respected researcher. Thus a book about “the science of success” makes satisfactory reading for both those interested in science and success and comes well-recommended.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I have had a long history with Excel TV and, like Excel TV (and Excel itself) the channel has changed over the years.

Gone are their regular live-streaming interviews with leading Excel authorities. Excel TV’s main product is now online courses. These are pre-recorded classes taught by the same caliber of talent as that of the interviews.

Excel TV’s first course on dashboards remains among my favorite and I was thrilled to learn that in its latest course Excel TV turns its attention to the world of Big Data and the Microsoft BI ecosystem writ large.

The course, entitled “Data Science with R and Power BI,” is a well-researched course on combining these applications to deliver insight from data.

Course basics

Course instructor Ryan Wade.

This course is taught by Ryan Wade who has over 20 years of experience in business intelligence. It is delivered on Excel TV’s easy-to-use course platform; if you are a student of its other courses, it is easy to navigate between them.

Lectures are delivered over screenshot video with crisp audio and visuals and include a link to download all source code. Ideally this would include source data as well to reproduce the results, but I have been able to modify the supplied code to apply to my own data.

Understanding R’s place in Microsoft’s World

Probably the biggest strength of this course is how it clearly positions R as a tool to use within the Microsoft ecosystem. This pairing should serve as no surprise to astute Microsoft watchers as Microsoft has for years maintained its own distribution of R with Microsoft R Open and for some time has made R visuals available in Power BI.

Ryan makes full use of the Microsoft BI stack from using its R distribution to using the R Tools for Visual Studio development environment to using SQL Server to store data to (obviously) Power BI to present it. There is a lot going on between these various applications and an outside primer on SQL Server and Power BI might be useful.

Pain points defined and accounted for

Every application has its strengths and weaknesses and it appears that by incorporating R so handily into its BI stack Microsoft has tacitly noted some places where R can fill in some gaps.

Ryan does a great job at explicitly identifying and providing examples of these “pain points.” For example, much of the course focuses on mining unstructured text data from the web and on using regular expressions to clean text, weaker points in Power BI.

The course also includes solid introductions to the popular ggplot2 package for data visualization and the dplyr package for data manipulation.

Of course R itself is not without its problems one of which is memory management and capacity. For this Ryan shows how to use Microsoft’s SQL server to overcome this pain point and soon enough you will have integrated R with SQL Server, Power BI and Visual Studio on your computer. This is a very sensible and well-constructed BI stack.

Meeting in the middle

As alluded to before, this course’s curriculum lies at getting you started in data science at the intersection of R and Power BI. I illustrate this with the above Venn Diagram. What I hope to show is that this course is not best suited as an introduction to R or Power BI but rather an introduction to using these tools together (plus SQL Server, I would add). While the course does go into some basic data types in R, novices might have difficulty comprehending the videos and code. This holds true to a lesser extent for Power BI.

At the risk of a shameless plug, for a more comprehensive introduction to R for the Excel user, I suggest (wait for it) my own course, “R Explained for Excel Users.” Here you will get a more brass-tacks introduction to R which will leave you in a better position to tackle more advanced courses such as Excel TV’s.

On the whole, I recommend Excel TV’s Data Science with R and Power BI. The ability to construct data science application using a combination of applications such as in this course is quite powerful and impressive, and the course does a nice job at tailoring a curriculum based on this specific use case.

Ready to get started? Learn more about the class here.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

A recent project of mine has been setting up a Twitter bot on innovation quotes. I enjoy this project because in addition to curating a great set of content and growing an audience around it, I have also learned a lot about coding.

From web scraping to regular expressions to social media automation, I’ve learned a lot collecting a list of over 30,000 quotes related to innovation.

Lately I’ve been turning my attention to finding quotes about computer programming, as digital-savvy is crucial to innovation today. These exercises prove great blog post material and quite “meta,” too… writing code to read quotes about writing code. I will cover one of what I hope to make a series below. For this example…

Scraping DevTopics.com’s “101 Great Computer Programming Quotes”

This is a nice set of quotes but we can’t quite copy-and-paste them into a .csv file as in doing so each quote is split across multiple rows and begins with its numeric position. I also want to eliminate the quotation marks and parentheses from these quotations as stylistically I tend to avoid them for Twitter.

While we might despair about the orderliness of this page based on this first attempt, make no mistake that there is well-reasoned logic running under the code with its HTML, and we will need to go there instead.

Part I: Scrape

To do this I will load up the rvest package for R and SelectorGadget extension for Chrome.

I want to identify the HTML nodes which hold the quotes we want, then collect that text. To do that, I will initialize the SelectorGadget, then hover and click on the first quote.

In the bottom toolbar we see the value is set as li, a common HTML tag for items of a list.

Knowing this, we will use the html_nodes function in R to parse those nodes, then html_text to extract the text they hold.

Doing this will return a character vector, but I will convert it to a dataframe for ease of manipulation.

Our code thus far is below.

#initialize packages and URL
library(rvest)
library(tidyverse)
library(stringr)

link <- c("http://www.devtopics.com/101-great-computer-programming-quotes/")

#read in our url
quotes <- read_html(link)

#gather text held in the "li" html nodes
quote <- quotes %>% 
  html_nodes("li") %>% 
  html_text()

is.vector(quote)

#convert to data frame
quote <- as.data.frame(quote)
Part II: Clean

Gathering our quotes via rvest versus copying-and-pasting, we get one quote per line, making it more legible to store in our final workbook. We’ve also left the numerical position of each quote. But some issues with the text remain.

First off, looking through the gathered selection of text, I will see that not all text held in the li node is a quote. This takes some manual intervention to spot, but here I will use dplyr’s slice function to keep only rows 26 through 126 (corresponding to 100 quotes).

We still want to eliminate the parentheses and quotation markers, and to do this I will use regular expression functions from stringr to replace them.

a. Replace “(“, “)”, and ““” with “”

This is not meant as a comprehensive guide to the notorious regular expression, and if you are not familiar I suggest Chapter 14 of R for Data Science. So I assume some familiarity here as otherwise it becomes quite tedious.

Because “(” and “)” are both metacharacters we will need to escape them. Placing these three characters together with the “or” pipe (|) we then use the str_replace_all function to replace strings matching any of the three with nothing “”.

b. Replace “”” with ” “

The end of a quotation is handled differently as we need a space between the quotation and the author; thus this expression is moved to its own function and we use str_replace to replace matches with ” “.

Bonus: Set it up for social media

Because I intend to send these quotes to Twitter so I will put a couple finishing touches on here.

First, using the paste function from base R, I will concatenate our quotes with a couple select hashtags.

Next, I use dplyr’s filter function to exclude lines that are longer than 240 characters, using another stringr function, str_length.

The quote for Part II is displayed below.

#get the rows I want
quote <- slice(quote, 26:126)

#delete the characters I don't want

charsd <- c("\\(|\\)|“")

quote$quote <- str_replace_all(quote$quote,charsd,"")

quote$quote <- str_replace(quote$quote,"”"," ")

#filter lines >240 characters
quote$quote <- paste(quote$quote, "#quote #coding")
quote <- filter(quote, str_length(quote)< 240)

#write csv
write.csv(quote,"C:/RFiles/tech2quotes.csv")

Finally, find the complete code below.

From web scraping to dataframe manipulation to regular expression, this exercise packs a punch in dealing with real-world unstructured text data — and it comes with some enjoyable reading, too.

I hope this post inspires you to tackle the world of text, and I plan to walk through a couple more of these.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

“I love to travel” is such a platitude these days, usually synonymous with “I like to go on vacation in faraway places.” But the best travel I have found often starts locally, with those quirky places you drive by all the time and never take the time to stop by and explore.

I have been on a kick recently adding these kinds of places to the Atlas Obscura, a site dedicated to chronicling the world’s unusual sites. So, while I have never been to Scandinavia, I did travel less than an hour away to the Finnish Heritage Museum, became a friend to my local Finnish community, and wrote a post about it which has received nearly 500 shares on Atlas Obscura.

Not bad for no jet lag, right?

Firsthand experience is also data

This blog’s mission is to help analysts tackle their data such that they have more time to “get out of the building” and think creatively. Research often requires not just spreadsheets and models but in-the-field reporting and experience.

Unfortunately too many analysts are given virtually no time to experience firsthand the transactions or products for which they spend their days reporting data about. If you’ve got a chance to walk through a location or experience a customer’s journey, do it — you’ll learn a ton, and hey, you’ve got to do something with that time you’re saving with your reports automated, right?

Visiting and writing about obscure local places has benefited me greatly in this aspect of being a good analyst. Here’s how:

I learn to be resourceful locally. 

“I have traveled far and wide in Concord,” Henry David Thoreau stated. Although a well-traveled individual, Thoreau was most proud of becoming a frequent traveler in his own hometown.

“The grass is always greener” and we often apply this to both our personal and professional pursuits. Perhaps there is a way to “travel far and wide” in your own organization or profession? Sometimes your business does need a “paradigm shift” — expanding into new markets, adopting new software, etc. Maybe you do really need to relocate to really make it in X or Y profession. But often the most fulfilling “travel” is the most local.

For me, visiting odd local places has reminded me to leave no stone unturned and to assume nothing when approaching a new business challenge. Become familiar enough with something and you will assume it can’t surprise you. Look more closely and you may indeed be surprised.

I learn to work across media.

As an analyst you normally don’t have to make sure that you understand the data but also your manager, co-workers in other departments, executives, etc. And for this you need to master presentation of information in multiple media, whether it is via a dashboard, a presentation, or a written report.

If there is one thing I’ve learned from visiting odd sites it’s that information must be gleaned from a variety of media. Talk to people. Sit silently in a corner and reflect. Take pictures. Sketch. These different data points are going to intersect to make some interesting findings.

But findings are really nothing without an audience and here you must take this variety of “raw” data and synthesize it yet again into some other media that your audience will understand.

My Atlas Obscura posts are a combination of photographs, web-based research and first-hand experience which come together and form a web of information which future visitors can learn from.

I get out of the building. 

This is Steve Blank’s famous dictum for entrepreneurs and it holds up for people in any business capacity. As I mentioned earlier I find in-the-fiend research critical for analysts, but unfortunately they are bound in a straitjacket of difficult-to-produce spreadsheets and reports.

It’s one thing to know a product’s selling price and turnover, but what’s it like to go to the store, purchase one and watch a loved one unwrap it for their birthday? Big data is becoming increasingly sophisticated at analyzing so-called “unstructured” data like online reviews that might shed some light on it. There is still a ton to learn from directly experiencing this process. 

This means for me, visiting obscure places has encouraged me to seek firsthand experience of that which I am modelling.

Everywhere an analyst

To me an analyst is ultimately one who senses patterns across multiple media and uses this information toward a stated goal. By this end an analyst is not only interested in the world of spreadsheets and reporting but in any data sources that may provide insights. Firsthand experience is a powerful information source which the practice of visiting and writing about obscure local places has sharpened.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

This post serves as a follow-up on a previous post about scheduled collection of Weather.gov’s XML feed in R, which itself was a follow-up to retrieving real-time data from Weather.gov in Excel.

Reflecting on the best way to accomplishing this automation, I noticed something back on Weather.gov’s update page: an option for a 2-day weather history! Duh! Why automate collection every hour when I could use this link to get history every day (or more)?

Turns out this link brings you to an html page with a table recording weather updates on the hour with lots of information — more that the XML page, in fact!

Automated Daily Weather Collection from Weather.gov in R - YouTube

For this I will use R’s htmltab package to read this table into an R table, then do some manipulation before getting it to our workbook.

Let’s get started. First time using R? Check out my free course, “5 Things Excel Users Should Know About R.”

1. Inspect the table

To figure out how to pull this table into R, we need to look under the hood of our website. To do that, right-click somewhere in the table in your browser and click “select element.”

2. Copy the table’s XPath

Here an editor comes up on your page. Notice that when you hover over different parts of this script, different parts of the web page are highlighted. Keep hovering until you see that the table we want to download is highlighted. We need to get some information about this table to write a script to collect it.

Once you have the table highlighted, right-click on this line of code and select Copy – XPath.

We will be using this in the R Script below.

3. Assemble R Script

This script will save the weather information for the past 24 hours as a .csv file based on today’s date. I read in the web page, point to the html table based on the XPath which we identified above, keep the first 24 rows for the first 24 hours (aka, today) of the weather data, and save the file as today’s date.

4. Schedule it to run

For this again you could use the Windows Task Scheduler, setting the script to run every day at midnight. Check out the previous post for more on the Task Scheduler.

So there you have it, a daily download of hourly weather readings from any recording site of the National Weather Service delivered directly to you via the power of R.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

One of my most popular posts is on retrieving real-time weather into Excel.

A common question here is how to get recorded weather updates; that is, keep a log of the weather updates for a specific place.

For that I will write an R script to parse weather.gov’s XML feed for a location and then use the task scheduler to automate going to the site, pulling the weather info & appending to our historical records.

Let’s get started. First time using R? Check out my free course, “5 Things Excel Users Should Know About R.”

1. Read our XML site

We will use the xml2 library to read in our .xml feed from the web and parse the various strings we are interested in. This is similar to what we did with WEBSERVICE and FILTERXML in our Excel example.

First, let’s load the xml2 library and read in our site. I will be using the feed from the Akron airport for this example.

I am also going to create a blank data frame called “history.” You’ll see why in a minute:

library(xml2)

#first time set a blank history data frame
history <- c()

#read in the web file
weather <- read_xml("https://w1.weather.gov/xml/current_obs/KAKR.xml")

2. Parse our nodes of interest

Next we will parse our nodes of interest using the xml_find_first function. I will use XPath to select these nodes – if you remember from our previous post, these will always start with “.//” and then the tag.

After that, we use xml_text and related functions to extract the information. Because our temperature variable is numeric, I use xml_double instead.

#get the temperature  
temp_f <- xml_find_first(weather, ".//temp_f")
temp_f <- xml_double(temp_f)

#get the weather description
temp_desc <- xml_find_first(weather, ".//weather")
temp_desc <- xml_text(temp_desc)

#get the observation times
obs_time <- xml_find_first(weather, ".//observation_time_rfc822")
obs_time <- xml_text(obs_time)

3. Merge the real-time update with your historical records

Now you will see why we set up the blank data frame above. Each time we run this script we will want to append our new update to our historical records. Of course at this point we don’t have any history, so the data frame is blank. That will change after this, and we will modify our code accordingly.

For now, though, I will combine our three pieces of real-time information using the cbind function (which binds columns), then I’ll append this to our historical records using rbind (which binds rows).

At this point you can write the file to a .csv file if you’d like.

#put this info together
realtime <- cbind(temp_f, temp_desc, obs_time)

#merge the realtime and historical
historyupdated <- rbind(history, realtime)


#write to a csv if you want
write.csv(historyupdated, "C:/RFiles/historyupdate.csv",
          row.names = FALSE)

4. Save the R data and set it up to load next time 

We want to save this log as an R file to read up as our historical file next time. To do this I will save an RDS file (this is R’s internal data record file extension).

So next time instead of starting up with a blank historical data frame, we’ll read this log in instead.

#RESAVE THE FILE
saveRDS(historyupdated, "C:/RFiles/weatherlog.rds")

#AFTER THAT, load up the history
history <- readRDS("C:/RFiles/weatherlog.rds")

See the full code below to get a better sense of how this works.

One last thing…  Save your R Script somewhere convenient and simple. For example I have mine in C:\RScripts. This will make the next step easier.

5. Automate retrieval with the task scheduler

The above script is pretty cool but requires the user to run it at regular times, which would get tedious.

Instead we are going to use the Task Scheduler in Windows to automate running this code.

Find the Task Scheduler using your Windows search bar and head up to Action | Create Task.

Name the task what you’d like. I’d also suggest you set to “Run with highest privileges.” Remember that this script will only run when you are logged onto your computer (unless you say otherwise) and will certainly not run if the computer is off.

Next, go to Triggers. We are going to automate this script to run every hour, which is how often the weather is updated at this location. I will set the task to repeat every hour for an indefinite duration.

Almost there! This next part takes some practice. Go to Actions and create a new action.

Here you first select where your Rscript.exe file resides on your computer. For me it is C:\Program Files\R\R-3.4.3\bin\j386\Rscript.exe. Whew! Fortunately you can browse to this location.

Under Add arguments, type the name of the R script from above. Mine is named WeatherLog.R.

Now here is why I suggested you keep the file somewhere simple. Under Start in, you will put the folder where this file is located.

Now the weather comes to you.

So long as nothing in your system changes, you will get weather updates on the hour delivered to your CSV file. From here you could feed this into a workbook using Get & Transform. Bonus — from here you can use the Query Editor to transform the observation time column to your preferred format.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview