Scrum is a framework for organizing work in a team. We use scrum ceremonies to organize all software development and course content creation at Cognitive Class.
Scrum is well suited for agile software development. It’s possible to use scrum to organize other kinds of teams. For example, The Rhesus Chart is a fun Sci-Fi novel by Charles Stross about a coven of vampires using Scrum to take over London. For now, my scrum experience is focused on software development.
Agile software development is a fuzzy term that encompasses everything that’s not waterfall software development; waterfall being a multi month cycle of planning-developing-testing with discreet stages. Scrum is a specific, iterative way of doing agile.
Scrum is composed of several scrum ceremonies:
Review or demo
Scrum is important both from a task-management perspective — what will I accomplish today? — and a project management perspective — when will the feature be finished?
As a vampire, Mhari needs to use Scrum to take over London
Who am I?
My name is Leons Petrazickis and I am a Platform Architect for Cognitive Class. Cognitive Class is an educational site that let you take free online courses in data science and AI at your pace and at your place. Course completion is rewarded with badges and recognized with certificates backed by IBM. Many courses have hands-on labs, and the technical underpinnings of that lab environment is my area of specialty.
We use scrum ceremonies at Cognitive Class both for organizing our software development and for organizing our course content creation.
I’m fifteen years into my career as a developer or, as they say in California, engineer. I’ve built a lot of cool things. In the context of our scrum, I am the Scrum Master.
Our team has many developers, data scientists, and paid interns. Scrum ceremonies have helped us organize our work and deliver results.
Did you know you can take free courses in Python, Deep Learning, and AI at Cognitive Class?
Who are the stakeholders in scrum ceremonies?
There are three types of contributors in scrum:
The scrum master, who acts as a mentor.
The product owner, who insists on business value and deadlines.
Older descriptions of scrum divided people into pigs and chickens. Pigs have a lot of stake in a project and chickens contribute occasionally. All three above are pigs in that sense.
Planning and scheduling as scrum ceremonies
A key component of scrum is breaking up work into small chunks. If a feature is going to take more than two weeks to implement, it doesn’t fit in a scrum context. You need to break it up into smaller, self-contained features before you can start on it.
Once you have a backlog of small features, you can schedule and assign them into 1-week or 2-week sprints for your team. It is seldom sensible to plan more than one sprint in advance. As Helmuth von Moltke observed, no plan survives contact with the enemy.
Low tech is often best for managing a small backlog. There are many software tools for planning work, but the closer the tools are to post-its on a whiteboard, the more effective they will be, at least from a task management perspective. Post-its on a whiteboard are less valuable from a project management perspective which is more concerned with due dates and velocity of work.
A feature has to be described as a user story. Purely technical tasks aren’t stories, nor are simple bug fixes. A story takes the following format:
As a [role], [name] needs [feature description] in order to [benefit description].
As a data scientist, Alice need a visualization library in order to communicate the key data in her reports to executives.
As a miser, Ebeneezer Scrooge needs three ghost visits in order to learn to be a nicer person.
Daily standup as a scrum ceremony
The daily standup can be eponymous with scrum itself — often we just call it the scrum.
The reason it’s a “stand” “up” is because people who are standing are uncomfortable and so less likely to ramble about unrelated topics. Stands ups have to be quick to not waste people’s time. There’s rarely a true need for someone to say more than three sentences in a standup.
The traditional standup consists of every person speaking in turn and answering these three questions:
What did I accomplish yesterday?
What will I accomplish today?
Am I blocked by something?
An important phrase to remember for a scrum master: “Let’s discuss that AFTER we finish the stand up.”
As a scrum master, my role is to mentor everyone on the team. The scrum master should help individual contributors get past any roadblocks, as well as recognize architectural, security, technical and other challenges in work descriptions.
This year my focus has shifted from scrum being used to organize developers working on a shared project to something more like a Scrum of Scrums being used to organized our interns each working on their own individual project. There, another question becomes important:
When will I finish all my work in this sprint?
“Finished” is a contentious word. To a developer, “finished” can sound the same as “development-complete”, but that doesn’t finish the user story. “Finished” means “available to the person named in the story”. Your feature is not finished until it’s available in all the production environments to all the end users.
Review as a scrum ceremony
One way to prove that a new feature that you delivered is finished is to demonstrate it to the rest of the team. The more unconnected everyone’s projects are, the more important it is to have regular demos. The demos communicate the value and nature of your work to your teammates.
The end of a sprint is a good time to demo.
Retrospective as a scrum ceremony
After the team finishes a big release, it’s good to discuss what worked and what didn’t, what was learned and what mistakes were made. The discussion should be blameless — not in the sense that you need to anonymize faux pas, but in the sense that we all make mistakes and that every mistake is an opportunity to learn. This discussion is the retro or retrospective scrum ceremony.
How do we streamline our development process?
Part of the scrum process is to refine the process itself and to make it fit your team. Applying critical thought and introspection to the previous sprint can only make the next sprint more effective.
Still waiting…it’s been several hours and still nothing. I watch the clock, get some tea, ruminate on the structure of dark matter….
My data science colleague is trying to extract course enrollment data from a very large database and format it in a nice dashboard, but getting this data takes far too long. Perhaps dark matter is to blame.
Let me back up.
Last year, one of the data scientists on my team was tinkering with a dashboard to summarize course enrollment and completion stats for some of our Python courses. As any good data scientist would do, he ensured his data experiment would be repeatable and usable with other data sets, so he wrote his code in a Jupyter notebook.
My colleague presented the MVP of his dashboard to the team, and we were all quite impressed. I asked him to make a small change, and he quickly updated a couple of cells in his Python notebook and declared that the change was made. So I asked to see the updated dashboard, and he replied that the notebook was running and the new visualization would probably be ready in an hour.
He saw the puzzled look on my face and laughingly said, “You’re a database guy; you should know its going to take more than an hour to pull all the data from the database.” Now this guy is really bright, and an amazing data scientist. So I didn’t want to sound unintelligent, but even more confused by now, I asked “what do you mean ALL the data?”
And he calmly explained that the database has millions of rows for enrollment and completion data, and it would take some time to get all the data into his notebook. Then the Python code on his machine would process the data, but that part should be faster in production when he deploys it on a server.
That’s when another data scientist from the team peered at the code in the notebook and remarked that it was pretty good code. “The database queries are going to the right tables and the Python code to process the data looks pretty efficient.”
So I got up to look at the code myself and and saw something that left me amazed. And when I read aloud “SELECT * FROM ENROLLMENTS”, the other data scientists were even more amazed at my being so very amazed. This led to everyone being amused.
Now, let me explain what really amazed me about a data scientist doing “SELECT * FROM ENROLLMENTS”. Imagine you want to buy one item from an online retailer. Would you order all the millions of items in the retailer’s warehouse to get just the one you want and then discard or return the rest of the items? Can you imagine how long it would take to have the entire inventory shipped to you? Even if all of the contents managed to reach you somehow, would you even have enough capacity and resources in your house to receive and process the entire inventory?
But apparently that is exactly what a lot of data scientists actually do. They “order” all of the items in the data warehouse and then use tools like Pandas dataframes to sift through the data they need and discard the rest.
A lot of data that data scientists work with comes from databases. This is especially true in enterprise environments where business data resides in relational database systems, data marts, and data warehouses.
Shortly after our dashboard discussion, I met with a Database Administrator (DBA) at one of the big banks. Their CEO was sold on the fact that data science could help transform the company and data science teams were cropping up all over the company in the recent months, but that’s when his job had started to become “hell”.
DBAs run a tight ship. They tune the system and queries to the umpteenth degree so the database can hum along fine responding to predictable queries efficiently.
And then comes along a hotshot data scientist running a data experiment who somehow managed to get hold of the database credentials, and runs a huge query like “SELECT * FROM ENROLLMENTS” against an operational database. The database slows to a crawl, and the company’s clients on the website start seeing database errors and timeouts. And the DBA responsible for the database gets called to the boss’s office.
I may have exaggerated a bit but I think you get the point. If you want to get some specific data from a relational database, it’s highly wasteful to run a query like “SELECT * FROM ENROLLMENTS”, especially if the table contains millions of rows.
No long after this meeting with the bank DBA, I realized that Data Scientists needed help working with databases (and so did DBAs who now had to deal with Data Scientists). And by leveraging both – my well-honed database skills and newly minted Data Science skills – I could help Data Scientists (or those aspiring to become one) work more efficiently with databases and SQL.
Working with my colleagues Hima Vasudevan and Raul Chong, we launched the course Databases and SQL for Data Science on Coursera. It is an online self-study course that you can complete at your own pace.
This course introduces relational database concepts and helps you learn and apply knowledge of the SQL language. It also shows you how to perform SQL access in a data science environment like Jupyter notebooks.
The course requires no prior knowledge of databases, SQL, Python, or programming. It has four modules and each requires 2 – 4 hours of effort to complete. Topics covered include:
– Introduction to Databases
– How to Create a Database Instance on Cloud
– CREATE Table Statement
– SELECT Statement
– INSERT Statement
– UPDATE and DELETE Statements
– Optional: Relational Model Concepts
– Using String Patterns, Ranges
– Sorting Result Sets
– Grouping Result Sets
– Built-in Functions, Dates, Timestamps
– Sub-Queries and Nested Selects
– Working with Multiple Tables
– Optional: Relational Model Constraints
– How to access databases using Python
– Writing code by Using DB-API
– Connecting to a Database by Using ibm_db API
– Creating Tables, Loading Data, and Querying Data from Jupyter Notebooks
– Analyzing Data with SQL and Python
– Optional: INNER JOIN, LEFT, RIGHT OUTER JOIN
– Working with Real-world Data Sets
– Assignment: Analyzing Chicago Data Sets using SQL and Python
The emphasis in this course is hands-on and practical learning. As such, you will work with real databases, real data science tools, and real-world datasets. You will create a database instance in the cloud. Through a series of hands-on labs, you will practice building and running SQL queries using cloud based tools. You will also learn how to access databases from Jupyter notebooks by using SQL and Python.
Anyone can audit this course at no-charge. If you want a certificate and access to graded components of the course, there is currently a limited time price of $39 USD. And if you are looking for a Professional Certificate in Data Science, this course is one of the 9 courses in the IBM Data Science Professional Certificate.
So if you are interested in learning SQL for Data Science, you can enroll now and audit for free.
Today IBM and Coursera launched an online Data Science Professional Certificate to address the shortage of skills in data-related professions. This certificate is designed for those interested in a career in Data Science or AI, and equips people to become job-ready through hands-on, practical learning.
IBM Data Science Professional Certificate
In this post we look at
why this certificate is being created (the demand),
what is being offered,
how it differs from other offerings,
who it is for,
the duration and cost,
what outcomes you should expect,
and your next steps.
Why this Professional Certificate
Data is collected in every aspect of our existence. The true transformative impact of data is realizable only when we can mine and act upon the insights contained within the data. Thus it is no surprise to see phrases such as “data is the new oil” (Economist).
We see organizations in most spaces seeding data-related initiatives. Companies that leverage and act upon the gems of information contained within data will get ahead of the competition – or even transform their industries. The transformative aspect of data is also applicable to the not-for-profit sector, for the betterment of society and improving our existence.
A variety of data related professions are relevant: Data Scientist, Data Engineer, Data Analyst, Database Developer, Business Intelligence (BI) Analyst, etc., and the most prominent of those is Data Scientist. It has been called “the sexiest job of the 21st century” by the Harvard Business Review, and Glassdoor calls it the “best job in America”.
Job listings and salary profiles for this profession clearly reflect this. When we talk to our clients we see a common thread: they can’t find enough qualified people to staff their data projects. This has created a tremendous opportunity for data professionals, especially Data.
In a recent report, IBM projected that “by 2020 the number of positions for data and analytics talent in the United States will increase by 364,000 openings, to 2,720,000”. The global demand is even higher.
Even though Data Science is “hot and sexy” and might enable you to get a great job today, the question is will it continue to be in demand and important going forward?
I certainly believe so, at least for another decade or more. Data is being created and collected at a rapid pace, and the number of organizations leveraging data is also expected to increase significantly.
“IBM’s Data Science Professional Certificate on Coursera fulfills a massive need for more data science talent in the US and globally,” said Leah Belsky, Vice President of Enterprise at Coursera. “Coursera offers online courses on everything from computer science to literature, but over a quarter of all enrollments from our 7 million users in the US are in data science alone. We expect IBM’s certificate will become a valuable credential for people wanting to start a career in data science.”
What we offer
It is with this in mind IBM developed the Data Science Professional Certificate. It consists of 9 courses that are intended to arm you with latest job-ready skills and techniques in Data Science.
The courses cover variety of data science topics including: open source tools and libraries, methodologies, Python, databases and SQL, data visualization, data analysis, and machine learning. You will practice hands-on in the IBM Cloud (at no additional cost) using real data science tools and real-world data sets.
The courses in the Data Science Professional Certificate include:
What is Data Science
Tools for Data Science
Data Science Methodology
Python for Data Science
Databases and SQL for Data Science
Data Visualization with Python
Data Analysis with Python
Machine Learning with Python
Applied Data Science Capstone
How it is different
This professional certificate has a strong emphasis on applied learning. Except for the first course, all other courses include a series of hands-on labs and are performed in the IBM Cloud (without any cost to you).
Throughout this Professional Certificate you are exposed to a series of tools, libraries, cloud services, datasets, algorithms, assignments and projects that will provide you with practical skills with applicability to real jobs that employers value, including:
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Folium, ipython-sql, Scikit-learn, ScipPy, etc.
Projects: random album generator, predict housing prices, best classifier model, battle of neighborhoods
Who this is for
Data Science is for everyone – not just those with a Master’s or Ph.D. Anyone can become a Data Scientist, whether you currently have computer science or programming skills. It is suitable for those entering the workforce as well as for existing professionals looking to upskill/re-skill themselves.
I consider a Data Scientist as someone who can find the right data, prepare it, analyze and visualize data using a variety of tools and algorithms, build data experiments and models, run these experiments, learn from them, adjust and re-iterate as needed, and eventually be able to tell the story hidden within data so it can be acted upon – either by a human or a machine.
If you are passionate about pursuing a career line that is in high demand with above average starting salaries, and if you have the drive and discipline for self-learning, this Data Science Professional Certificate is for you.
“Now is a great time to enter the Data Science profession and IBM is committed to help address the skills gap and promote data literacy” says Leon Katsnelson, CTO and Director, IBM Developer Skills Network. “Coursera, with over 33 million registered learners, is a great platform for us to partner with and help with our mission to democratize data skills and build a pipeline of data literate professionals.”
Cost and Duration
The courses in this certificate are offered online for self-learning and available for “audit” at no cost. “Auditing” a course gives you the ability to access all lectures, readings, labs, and non-graded assignments at no charge. If you want to learn and develop skills you can audit all the courses for free.
The graded quizzes, assignments, and verified certificates are only available with a low-cost monthly subscription (just $39 USD per month for a limited time). So if you require a verified certificate to showcase your achievement with prospective employers and others, you will need to purchase the subscription. Enterprises looking to skill their employees in Data Science can access the Coursera for Business offering. Financial aid is also available for those who qualify.
The certificate requires completion of 9 courses. Each course typically contains 3-6 modules with an average effort of 2 to 4 hours per module. If learning part-time (e.g. 1 module per week), it would take 6 to 12 months to complete the entire certificate. If learning full-time (e.g. 1 module per day) the certificate can be completed in 2 to 3 months.
Upon completing the courses in this Professional Certificate you will have done several hands-on assignments and built a portfolio of Data Science projects to provide you with the confidence to plunge into an exciting profession in Data Science.
Those pursuing a paid certificate will not only receive a course completion certificate for every course they complete but also receive an IBM open badge. Successfully completing all courses earns you the Data Science Professional Certificate as well as an IBM digital badge recognizing your proficiency in Data Science. These credentials can be shared on your social profiles such as LinkedIn, and also with employers.
Sample of IBM Data Science Professional Certificate
IBM and Coursera are also working together to form a hiring consortium. Learners who obtain the verified Certificate will be able to opt-in to have their resumes sent to employers in the consortium.
All courses in this Data Science Professional Certificate are available today. Enroll today, start developing skills employers are looking for, and kickstart your career in Data Science.
IBM has partnered with edX.org, the leading online learning destination founded by Harvard and MIT, for the delivery of several Professional Certificate programs. Professional Certificate programs are a series of in-demand courses designed to build or advance critical skills for a specific career.
“We are honored to welcome IBM as an edX partner,” said Anant Agarwal, edX CEO and MIT Professor. “IBM is defined by its commitment to constant innovation and its culture of lifelong learning, and edX is delighted to be working together to further this shared commitment. We are pleased to offer these Professional Certificate programs in Deep Learning and Chatbots to help our learners gain the knowledge needed to advance in these incredibly in-demand fields. Professional Certificate programs, like these two new offerings on edX, deliver career-relevant education in a flexible, affordable way, by focusing on the skills industry leaders and successful professionals are seeking today.”
EdX.org is a great partner for us too, not just because they have an audience of over 17 million students, but because their mission of increasing access to high-quality education for everyone so closely aligns with our own.
“Today we’re seeing a transformational shift in society. Driven by innovations like AI, cloud computing, blockchain and data analytics, industries from cybersecurity to healthcare to agriculture are being revolutionized. These innovations are creating new jobs but also changing existing ones—and require new skills that our workforce must be equipped with. We are therefore taking our responsibility by partnering with edX to make verified certificate programs available through their platform that will enable society to embrace and develop the skills most in-demand” said IBM Chief Learning Officer Gordon Fuller.
The IBM Skills Network (of which Cognitive Class is part of) also relies on Open edX — the open source platform that powers edX.org — and we plan to contribute back the enhancements as well as support the development of this MOOC project. To learn more about how we use (and scale Open edX) check out our [recent post] on the topic.
We are kicking off this collaboration with two Professional Certificate programs that might be of interest to you.
– Deep Learning (the first course in the program, Deep Learning Fundamentals with Keras, is open for enrollment today starts September 16)
– Building Chatbots Powered by AI (the first course in the program, How to Build Chatbots and Make Money , is open for enrollment today and already running)
Those of you who are familiar with my chatbot course on Cognitive Class, will recognize the first course on the list. The key difference is that this version on edX includes a module on making money from chatbots.
The last course in this chatbot program focuses on other Watson services, specifically the powerful combination of Watson Assistant and Watson Discovery to create smarter chatbots that can draw answers from your existing knowledge base.
All in all, this program is still accessible to people with limited programming skills; though, you will get the most out of it if you are a programmer.
The Deep Learning program is aimed at professionals and students interested in machine learning and data science. Once completed, it will include five courses:
The goal of these programs is to get you ready to use exciting new technologies in the emerging fields of Data Science, Machine Learning, AI, and more. The skills you’ll acquire through these highly practical programs will help you advance your career, whether at your current job or when seeking new employment.
It’s a competitive market out there, and we are confident that these programs will serve you well. If you are an employer looking to re-skill your workforce, these programs are also an ideal way to do so in a structured manner.
The certificates also look quite good on a resume (or LinkedIn) as passing these courses and completing the programs demonstrates a substantial understanding of the topics at hand. This isn’t just theory. You can’t complete these Professional Certificate programs without getting your hands dirty, so to speak.
We also plan to launch more Professional Certificates in collaboration with edX.org, but if you have an interest in advancing your career in Data Science and AI, we recommend that you start with these two.
Ever since we launched, Cognitive Class has hit many milestones. From name changes (raise your hand if you remember DB2 University) to our 1,000,000th learner, we’ve been through a lot.
But in this post, I will focus on the milestones and evolution of the technical side of things, specifically how we went from a static infrastructure to a dynamic and scalable deployment of dozens of Open edX instances using Docker.
Open edX 101
Open edX is the open source code behind edx.org. It is composed of several repositories, edx-platform being the main one. The official method of deploying an Open edX instance is by using the configuration repo which uses Ansible playbooks to automate the installation. This method requires access to a server where you run the Ansible playbook. Once everything is done you will have a brand new Open edX deployment at your disposal.
This is how we run cognitiveclass.ai, our public website, since we migrated from a Moodle deployment to Open edX in 2015. It has served us well, as we are able to serve hundreds of concurrent learners over 70 courses every day.
But this strategy didn’t come without its challenges:
Open edX mainly targets Amazon’s AWS services and we run our infrastructure on IBM Cloud.
Deploying a new instance requires creating a new virtual machine.
Open edX reads configurations from JSON files stored in the server, and each instance must keep these files synchronized.
While we were able to overcome these in a large single deployment, they would be much harder to manage for our new offering, the Cognitive Class Private Portals.
Cognitive Class for business
When presenting to other companies, we often hear the same question: “how can I make this content available to my employees?“. That was the main motivation behind our Private Portals offer.
A Private Portal represents a dedicated deployment created specifically for a client. From a technical perspective, this new offering would require us to spin up new deployments quickly and on-demand. Going back to the points highlighted earlier, numbers two and three are especially challenging as the number of deployments grows.
Creating and configuring a new VM for each deployment is a slow and costly process. And if a particular Portal outgrows its resources, we would have to find a way to scale it and manage its configuration across multiple VMs.
At the same time, we were experiencing a similar demand in our Virtual Labs infrastructure, where the use of hundreds of VMs was becoming unbearable. The team started to investigate and implement a solution based on Docker.
The main benefits of Docker for us were twofold:
Increase server usage density;
Isolate services processes and files from each other.
These benefits are deeply related: since each container manages its own runtime and files we are able to easily run different pieces of software on the same server without them interfering with each other. We do so with a much lower overhead compared to VMs since Docker provides a lightweight isolation between them.
By increasing usage density, we are able to run thousands of containers in a handful of larger servers that could pre-provisioned ahead of time instead of having to manage thousands of smaller instances.
For our Private Portals offering this means that a new deployment can be ready to be used in minutes. The underlying infrastructure is already in place so we just need to start some containers, which is a much faster process.
Herding containers with Rancher
Docker in and of itself is a fantastic technology but for a highly scalable distributed production environment, you need something on top of it to manage your containers’ lifecycle. Here at Cognitive Class, we decided to use Rancher for this, since it allows us to abstract our infrastructure and focus on the application itself.
In a nutshell, Rancher organizes containers into services and services are grouped into stacks. Stacks are deployed to environments, and environments have hosts, which are the underlying servers where containers are eventually started. Rancher takes care of creating a private network across all the hosts so they can communicate securely with each other.
Getting everything together
Our Portals are organized in a micro-services architecture and grouped together in Rancher as a stack. Open edX is the main component and itself broken into smaller services. On top of Open edX we have several other components that provide additional functionalities to our offering. Overall this is how things look like in Rancher:
There is a lot going on here, so let’s break it down and quickly explain each piece:
lms: this is where students access courses content
cms: used for authoring courses
forum: handles course discussions
nginx: serves static assets
rabbitmq: message queue system
glados: admin users interface to control and customize the Portal
companion-cube: API to expose extra functionalities of Open edX
compete: service to run data hackathons
learner-support: built-in learner ticket support system
lp-certs: issue certificates for students that complete multiple courses
cms-workers and lms-workers: execute background tasks for `lms` and `cms`
glados-worker: execute background tasks for `glados`
letsencrypt: automatically manages SSL certificates using Let’s Encrypt
load-balancer: routes traffic to services based on request hostname
mailer: proxy SMTP requests to an external server or sends emails itself otherwise
ops: group of containers used to run specific tasks
rancher-cron: starts containers following a cron-like schedule
The ops service behaves differently from the other ones, so let’s dig a bit deeper into it:
Here we can see that there are several containers inside ops and that they are usually not running. Some containers, like edxapp-migrations, run when the Portal is deployed but are not expected to be started again unless in special circumstances (such as if the database schema changes). Other containers, like backup, are started by rancher-cron periodically and stop once they are done.
In both cases, we can trigger a manual start by clicking the play button. This provides us the ability to easily run important operational tasks on-demand without having to worry about SSH into specific servers and figuring out which script to run.
One key aspect of Docker is that the file system is isolated per container. This means that, without proper care, you might lose important files if a container dies. The way to handle this situation is to use Docker volumes to mount local file system paths into the containers.
Moreover, when you have multiple hosts, it is best to have a shared data layer to avoid creating implicit scheduling dependencies between containers and servers. In other words, you want your containers to have access to the same files no matter which host they are running on.
Each Portal has its own directory in the NFS drive and the containers mount the directory of that specific Portal. So it’s impossible for one Portal to access the files of another one.
One of the most important file is the ansible_overrides.yml. As we mentioned at the beginning of this post, Open edX is configured using JSON files that are read when the process starts. The Ansible playbook generates these JSON files when executed.
To propagate changes made by Portal admins on glados to the lms and cms of Open edX we mount ansible_overrides.yml into the containers. When something changes, glados can write the new values into this file and lms and cms can read them.
We then restart the lms and cms containers which are set to run the Ansible playbook and re-generate the JSON files on start up. ansible_overrides.yml is passed as a variables file to Ansible so that any values declared in there will override the Open edX defaults.
By having this shared data layer, we don’t have to worry about containers being rescheduled to another host since we are sure Docker will be able to find the proper path and mount the required volumes into the containers.
By building on top of the lessons we learned as our platform evolved and by using the latest technologies available, we were able to build a fast, reliable and scalable solution to provide our students and clients a great learning experience.
We covered a lot on this post and I hope you were able to learn something new today. If you are interested in learning more about our Private Portals offering fill out our application form and we will contact you.
Every data scientist I know spends a lot of time handling data that originates in CSV files. You can quickly end up with a mess of CSV files located in your Documents, Downloads, Desktop, and other random folders on your hard drive.
I greatly simplified my workflow the moment I started organizing all my CSV files in my Cloud account. Now I always know where my files are and I can read them directly from the Cloud using JupyterLab (the new Jupyter UI) or my Python scripts.
This article will teach you how to read your CSV files hosted on the Cloud in Python as well as how to write files to that same Cloud account.
I’ll use IBM Cloud Object Storage, an affordable, reliable, and secure Cloud storage solution. (Since I work at IBM, I’ll also let you in on a secret of how to get 10 Terabytes for a whole year, entirely for free.) This article will help you get started with IBM Cloud Object Storage and make the most of this offer. It is composed of three parts:
How to use IBM Cloud Object Storage to store your files;
Reading CSV files in Python from Object Storage;
Writing CSV files to Object Storage (also in Python of course).
The “Storage” part of object storage is pretty straightforward, but what exactly is an object and why would you want to store one? An object is basically any conceivable data. It could be a text file, a song, or a picture. For the purposes of this tutorial, our objects will all be CSV files.
Unlike a typical filesystem (like the one used by the device you’re reading this article on) where files are grouped in hierarchies of directories/folders, object storage has a flat structure. All objects are stored in groups called buckets. This structure allows for better performance, massive scalability, and cost-effectiveness.
By the end of this article, you will know how to store your files on IBM Cloud Object Storage and easily access them using Python.
Provisioning an Object Storage Instance on IBM Cloud
Visit the IBM Cloud Catalog and search for “object storage”. Click the Object Storage option that pops up. Here you’ll be able to choose your pricing plan. Feel free to use the Lite plan, which is free and allows you to store up to 25 GB per month.
Sign up (it’s free) or log in with your IBM Cloud account, and then click the Create button to provision your Object Storage instance. You can customize the Service Name if you wish, or just leave it as the default. You can also leave the resource group to the default. Resource groups are useful to organize your resources on IBM Cloud, particularly when you have many of them running.
Working with Buckets
Since you just created the instance, you’ll now be presented with options to create a bucket. You can always find your Object Storage instance by selecting it from your IBM Cloud Dashboard.
There’s a limit of 100 buckets per Object Storage instance, but each bucket can hold billions of objects. In practice, how many buckets you need will be dictated by your availability and resilience needs.
For the purposes of this tutorial, a single bucket will do just fine.
Creating your First Bucket
Click the Create Bucket button and you’ll be shown a window like the one below, where you can customize some details of your Bucket. All these options may seem overwhelming at the moment, but don’t worry, we’ll explain them in a moment. They are part of what makes this service so customizable, should you have the need later on.
If you don’t care about the nuances of bucket configuration, you can type in any unique name you like and press the Create button, leaving all other options to their defaults. You can then skip to thePutting Objects in Buckets section below. If you would like to learn about what these options mean, read on.
Configuring your bucket
Your data is stored across three geographic regions within your selected location
High availability and very high durability
Your data is stored across three different data centers within a single geographic region
High availability and durability, very low latency for regional users
Single Data Center
Your data is stored across multiple devices within a single data center
Storage Class Options
Frequency of Data Access
IBM Cloud Object Storage Class
Weekly or monthly
Less than once a month
Feel free to experiment with different configurations, but I recommend choosing “Standard” for your storage class for this tutorial’s purposes. Any resilience option will do.
Putting Objects in Buckets
After you’ve created your bucket, store the name of the bucket into the Python variable below (replace cc-tutorial with the name of your bucket) either in your Jupyter notebook or a Python script.
There are many ways to add objects to your bucket, but we’ll start with something simple. Add a CSV file of your choice to your newly created bucket, either by clicking the Add objects button, or dragging and dropping your CSV file into the IBM Cloud window.
Whatever CSV file you decide to add to your bucket, assign the name of the file to the variable filename below so that you can easily refer to it later.
We’ve placed our first object in our first bucket, now let’s see how we can access it. To access your IBM Cloud Object Storage instance from anywhere other than the web interface, you will need to create credentials. Click the New credential button under the Service credentials section to get started.
In the next window, you can leave all fields as their defaults and click the Add button to continue. You’ll now be able to click on View credentials to obtain the JSON object containing the credentials you just created. You’ll want to store everything you see in a credentials variable like the one below (obviously, replace the placeholder values with your own).
Note: If you’re following along within a notebook be careful not to share this notebook after adding your credentials!
Reading CSV files from Object Storage using Python
The recommended way to access IBM Cloud Object Storage with Python is to use the ibm_boto3 library, which we’ll import below.
The primary way to interact with IBM Cloud Object Storage through ibm_boto3 is by using an ibm_boto3.resource object. This resource-based interface abstracts away the low-level REST interface between you and your Object Storage instance.
Run the cell below to create a resource Python object using the IBM Cloud Object Storage credentials you filled in above.
After creating a resource object, we can easily access any of our Cloud objects by specifying a bucket name and a key (in our case the key is a filename) to our resource.Object method and calling the get method on the result. In order to get the object into a useful format, we’ll do some processing to turn it into a pandas dataframe.
We’ll make this into a function so we can easily use it later:
Adding files to IBM Cloud Object Storage with Python
IBM Cloud Object Storage’s web interface makes it easy to add new objects to your buckets, but at some point you will probably want to handle creating objects through Python programmatically. The put_object method allows you to do this.
In order to use it you will need:
The name of the bucket you want to add the object to;
You can now easily access your newly created object using the function we defined above in the Reading from Object Storage using Python section.
Get 10 Terabytes of IBM Cloud Object Storage for free
You now know how to read from and write to IBM Cloud Object Storage using Python! Well done. The ability to pragmatically read and write files to the Cloud will be quite handy when working from scripts and Jupyter notebooks.
If you build applications or do data science, we also have a great offer for you. You can apply to become an IBM Partner at no cost to you and receive 10 Terabytes of space to play and build applications with.
You can sign up by simply filling the embedded form below. If you are unable to fill the form, you can click here to open the form in a new window.
Just make sure that you apply with a business email (even your own domain name if you are a freelancer) as free email accounts like Gmail, Hotmail, and Yahoo are automatically rejected.