Scribd Technology Blog
1,070 FOLLOWERS
Get the latest articles on Latest Core Infrastructure, Core Platform, Data Engineering, Data Science and Internal Tools. Scribd technology builds and delivers one of the world's largest libraries, bringing the best books, audiobooks, and journalism to millions of people around the world.
Scribd Technology Blog
2M ago
Machine Learning Platforms (ML Platforms) have the potential to be a key component in achieving production ML at scale without large technical debt, yet ML Platforms are not often understood. This document outlines the key concepts and paradigm shifts that led to the conceptualization of ML Platforms in an effort to increase an understanding of these platforms and how they can best be applied.
Technical Debt and development velocity defined Development Velocity
Machine learning development velocity refers to the speed and efficiency at which machine learning (ML) projects progress from the ini ..read more
Scribd Technology Blog
1y ago
We brought a whole team to San Francisco to present and attend this year’s Data and AI Summit, and it was a blast! I would consider the event a success both in the attendance to the Scribd hosted talks and the number of talks which discussed patterns we have adopted in our own data and ML platform. The three talks I wrote about previously were well received and have since been posted to YouTube along with hundreds of other talks.
Christian Williams shared some of the work he has done developing kafka-delta-ingest in his talk:
QP Hou, Scribd Emeritus, presented on his foundational work to ens ..read more
Scribd Technology Blog
1y ago
We recently migrated Looker to a Databricks SQL Serverless, improving our infrastructure cost and reducing the footprint of infrastructure we need to worry about! “Databricks SQL” which provides a single load balanced Endpoint for executing Spark SQL queries across multiple Spark clusters behind the scenes. “Serverless” is an evolution of that concept, rather than running a SQL Endpoint in our AWS infrastructure, the entirety of execution happens on the Databricks side. With a much simpler and faster interface, queries executed in Looker now return results much faster to our users than ever be ..read more
Scribd Technology Blog
2y ago
We are very excited to be presenting and attending this year’s Data and AI Summit which will be hosted virtually and physically in San Francisco from June 27th-30th. Throughout the course of 2021 we completed a number of really interesting projects built around delta-rs and the Databricks platform which we are thrilled to share with a broader audience. In addition to the presentations listed below, a number of Scribd engineers who are responsible for data and ML platform, machine learning systems, and more, will be in attendance if you want to meet up and learn more about how Scribd uses data ..read more
Scribd Technology Blog
2y ago
Armadillo is the fully featured audio player library Scribd uses to play and download all of its audiobooks and podcasts, which is now open source. It specializes in playing HLS or MP3 content that is broken down into chapters or tracks. It leverages Google’s Exoplayer library for its audio engine. Exoplayer wraps a variety of low level audio and video apis but has few opinions of its own for actually using audio in an Android app.
The leap required from Exoplayer to audio player is enormous both in terms of the amount of code needed as well as the amount of domain knowledge required about co ..read more
Scribd Technology Blog
2y ago
Scribd offers a variety of publisher and user-uploaded content to our users and while the publisher content is rich in metadata, user-uploaded content typically is not. Documents uploaded by the users have varied subjects and content types which can make it challenging to link them together. One way to connect content can be through a taxonomy - an important type of structured information widely used in various domains. In this series, we have already shared how we identify document types and extract information from documents, this post will discuss how insights from data were used to help bu ..read more
Scribd Technology Blog
3y ago
Extracting metadata from our documents is an important part of our discovery and recommendation pipeline, but discerning useful and relevant details from text-heavy user-uploaded documents can be challenging. This is part 2 in a series of blog posts describing a multi-component machine learning system the Applied Research team built to extract metadata from our documents in order to to enrich downstream discovery models. In this post, we present the challenges and limitations the team faced when building information extraction NLP models for Scribd’s text-heavy documents and how they were solv ..read more
Scribd Technology Blog
3y ago
Delta Lake is integral to our data platform which is why we have invested heavily in delta-rs to support our non-JVM Delta Lake needs. This year I had the opportunity to share the progress of delta-rs at Data and AI Summit. Delta-rs was originally started by my colleague QP just over a year ago and it has now grown to now a multi-company project with numerous contributors, and downstream projects such as kafka-delta-ingest.
In the session embedded below, I introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational D ..read more
Scribd Technology Blog
3y ago
User-uploaded documents have been a core component of Scribd’s business from the very beginning, understanding what is actually in the document corpus unlocks exciting new opportunities for discovery and recommendation. With Scribd anybody can upload and share documents, analogous to YouTube and videos. Over the years, our document corpus has become larger and more diverse which has made understanding it an ever-increasing challenge. Over the past year one of the missions of the Applied Research team has been to extract key document metadata to enrich downstream discovery systems. Our approach ..read more
Scribd Technology Blog
3y ago
The long term success of our data platform relies on putting tools into the hands of developers and data scientists to “choose their own adventure”. A big part of that story has been Databricks which we recently integrated with Terraform to make it easy to scale a top-notch developer experience. At the 2021 Data and AI Summit, Core Platform infrastructure engineer Hamilton Hord and Databricks engineer Serge Smertin presented on the Databricks terraform provider and how it’s been used by Scribd.
In the session embedded below, they share the details on the Databricks (Labs) Terraform integration ..read more