Data and AI Summit Wrap-up
Scribd Technology Blog
by R Tyler Croy
6M ago
We brought a whole team to San Francisco to present and attend this year’s Data and AI Summit, and it was a blast! I would consider the event a success both in the attendance to the Scribd hosted talks and the number of talks which discussed patterns we have adopted in our own data and ML platform. The three talks I wrote about previously were well received and have since been posted to YouTube along with hundreds of other talks. Christian Williams shared some of the work he has done developing kafka-delta-ingest in his talk: QP Hou, Scribd Emeritus, presented on his foundational work to ens ..read more
Visit website
Accelerating Looker with Databricks SQL Serverless
Scribd Technology Blog
by Hamilton Hord
7M ago
We recently migrated Looker to a Databricks SQL Serverless, improving our infrastructure cost and reducing the footprint of infrastructure we need to worry about! “Databricks SQL” which provides a single load balanced Endpoint for executing Spark SQL queries across multiple Spark clusters behind the scenes. “Serverless” is an evolution of that concept, rather than running a SQL Endpoint in our AWS infrastructure, the entirety of execution happens on the Databricks side. With a much simpler and faster interface, queries executed in Looker now return results much faster to our users than ever be ..read more
Visit website
Scribd is presenting at Data and AI Summit 2022
Scribd Technology Blog
by R Tyler Croy
9M ago
We are very excited to be presenting and attending this year’s Data and AI Summit which will be hosted virtually and physically in San Francisco from June 27th-30th. Throughout the course of 2021 we completed a number of really interesting projects built around delta-rs and the Databricks platform which we are thrilled to share with a broader audience. In addition to the presentations listed below, a number of Scribd engineers who are responsible for data and ML platform, machine learning systems, and more, will be in attendance if you want to meet up and learn more about how Scribd uses data ..read more
Visit website
Armadillo makes audio players in Android easy
Scribd Technology Blog
by Nathan Sass
1y ago
Armadillo is the fully featured audio player library Scribd uses to play and download all of its audiobooks and podcasts, which is now open source. It specializes in playing HLS or MP3 content that is broken down into chapters or tracks. It leverages Google’s Exoplayer library for its audio engine. Exoplayer wraps a variety of low level audio and video apis but has few opinions of its own for actually using audio in an Android app. The leap required from Exoplayer to audio player is enormous both in terms of the amount of code needed as well as the amount of domain knowledge required about co ..read more
Visit website
Categorizing user-uploaded documents
Scribd Technology Blog
by Monique Alves Cruz
1y ago
Scribd offers a variety of publisher and user-uploaded content to our users and while the publisher content is rich in metadata, user-uploaded content typically is not. Documents uploaded by the users have varied subjects and content types which can make it challenging to link them together. One way to connect content can be through a taxonomy - an important type of structured information widely used in various domains. In this series, we have already shared how we identify document types and extract information from documents, this post will discuss how insights from data were used to help bu ..read more
Visit website
Information Extraction at Scribd
Scribd Technology Blog
by Antonia Mouawad
1y ago
Extracting metadata from our documents is an important part of our discovery and recommendation pipeline, but discerning useful and relevant details from text-heavy user-uploaded documents can be challenging. This is part 2 in a series of blog posts describing a multi-component machine learning system the Applied Research team built to extract metadata from our documents in order to to enrich downstream discovery models. In this post, we present the challenges and limitations the team faced when building information extraction NLP models for Scribd’s text-heavy documents and how they were solv ..read more
Visit website
Presenting Rust and Python Support for Delta Lake
Scribd Technology Blog
by R Tyler Croy
1y ago
Delta Lake is integral to our data platform which is why we have invested heavily in delta-rs to support our non-JVM Delta Lake needs. This year I had the opportunity to share the progress of delta-rs at Data and AI Summit. Delta-rs was originally started by my colleague QP just over a year ago and it has now grown to now a multi-company project with numerous contributors, and downstream projects such as kafka-delta-ingest. In the session embedded below, I introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational D ..read more
Visit website
Identifying Document Types at Scribd
Scribd Technology Blog
by Jonathan Ramkissoon
1y ago
User-uploaded documents have been a core component of Scribd’s business from the very beginning, understanding what is actually in the document corpus unlocks exciting new opportunities for discovery and recommendation. With Scribd anybody can upload and share documents, analogous to YouTube and videos. Over the years, our document corpus has become larger and more diverse which has made understanding it an ever-increasing challenge. Over the past year one of the missions of the Applied Research team has been to extract key document metadata to enrich downstream discovery systems. Our approach ..read more
Visit website
Automating Databricks with Terraform
Scribd Technology Blog
by R Tyler Croy
1y ago
The long term success of our data platform relies on putting tools into the hands of developers and data scientists to “choose their own adventure”. A big part of that story has been Databricks which we recently integrated with Terraform to make it easy to scale a top-notch developer experience. At the 2021 Data and AI Summit, Core Platform infrastructure engineer Hamilton Hord and Databricks engineer Serge Smertin presented on the Databricks terraform provider and how it’s been used by Scribd. In the session embedded below, they share the details on the Databricks (Labs) Terraform integration ..read more
Visit website
Kafka to Delta Lake, as fast as possible
Scribd Technology Blog
by Christian Williams
1y ago
Streaming data from Apache Kafka into Delta Lake is an integral part of Scribd’s data platform, but has been challenging to manage and scale. We use Spark Structured Streaming jobs to read data from Kafka topics and write that data into Delta Lake tables. This approach gets the job done but in production our experience has convinced us that a different approach is necessary to efficiently bring data from Kafka to Delta Lake. To serve this need, we created kafka-delta-ingest. The user requirements are likely relatable to a lot of folks: My application emits data into Kafka that I want to analy ..read more
Visit website

Follow Scribd Technology Blog on Feedspot

Continue with Google
OR